Storage Insights datasets

The Storage Insights datasets feature helps you understand, organize, and manage your data at scale. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to update the metadata for. A queryable metadata index for the included buckets and objects within those projects, are made available as a BigQuery linked dataset.

If you want to get insights for your Cloud Storage resources that are exported to BigQuery, use the Storage Insights datasets. These insights can help you with data exploration, cost optimization, security enforcement, and governance implementation. Storage Insights datasets is an exclusive feature only available through the Storage Intelligence subscription.

Overview

A Storage Insights dataset is a rolling snapshot of metadata for all the buckets and objects within one or multiple specified source projects within an organization. The information provided by datasets lets you better understand and routinely audit your Cloud Storage data.

To create a dataset, you first create a dataset configuration in a project. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to view the metadata for. The dataset configuration generates datasets daily. Both dataset configurations and datasets are resources stored within Cloud Storage.

To view a dataset, you must first link the dataset to BigQuery.

Dataset configuration properties

When you create a dataset configuration, you set these properties of the dataset. It can take up to 48 hours for you to see the first data populated as a linked dataset in BigQuery after you configure the dataset. Any newly added objects or buckets are included in the next daily snapshot.

Name: a name that's used to reference the dataset. Names are used as the identifier of dataset configurations and cannot be changed after the configuration is created. The name contains up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.
Description (optional): a description of the dataset. You can edit the description at any time.
Dataset scope: a required field that specifies an organization, projects, or folders containing the buckets and objects for which you want metadata. You can specify projects or folders individually or as a CSV file, with each project or folder number on a separate line. You can specify up to 10,000 projects or folders in one dataset configuration. Datasets are configured for the specified dataset scope. Only one dataset scope can be specified for each dataset configuration. You can update the dataset scope when editing the dataset configuration.
Bucket filters (optional): filters used to include and exclude specific buckets from the dataset by bucket name or by regions.
Retention period: the number of days that the dataset captures and retains data for, including the creation date of the dataset. Datasets update with metadata every 24 hours and can retain data for up to 90 days. Data captured outside of the retention window is automatically deleted. For example, suppose you have a dataset that was created on October 1, 2023 with a retention window set to 30. On October 30, the dataset will reflect the past 30 days of data, from October 1 to October 30. On October 31, the dataset will reflect the data from October 2 to October 31. You can modify the retention window at any time.
Location: a location to store the dataset and its data. For example, us-central1. The location must be supported by BigQuery. We recommend that you select the location of your BigQuery tables, if you have any.
Service agent type: either a configuration-scoped service agent or a project-scoped service agent.

Creating a dataset configuration provisions a service agent for you. In order to read and write datasets, the service agent must be granted the required permissions.

A project-scoped service agent can access and write datasets that are generated from all the dataset configurations in the project. For example, if you have multiple dataset configurations within a project, then you only need to grant required permissions to the project-scoped service agent once for it to be able to read and write datasets for all the dataset configurations within the project. When a dataset configuration is deleted, the project-scoped service agent is not deleted.

A configuration-scoped service agent can only access and write the dataset generated by the particular dataset configuration. This means if you have multiple dataset configurations, you'll need to grant required permissions to each configuration-scoped service agent. When a dataset configuration is deleted, the configuration-scoped service agent is deleted.

Link the dataset to BigQuery after creating a dataset configuration. Linking a dataset to BigQuery creates a linked dataset in BigQuery for querying. You can link or unlink the dataset at any point.

For more information about the properties you set when creating or updating a dataset configuration, see the DatasetConfigs resource in the JSON API documentation.

Supported locations

The following BigQuery locations are supported for creating linked datasets:

EU
US
asia-southeast1
europe-west1
us-central1
us-east1
us-east4

Dataset schema of metadata

The following sections describe the metadata fields included in datasets. For more information about BigQuery column modes, see Modes. The column modes determine how BigQuery stores and queries the data.

Bucket metadata

The following table describes the bucket metadata fields:

Filter this table:

Metadata field	Mode	Type	Description
`snapshotTime`	`NULLABLE`	`TIMESTAMP`	The snapshotTime field stores the time of the bucket metadata snapshot refresh in RFC 3339 format.
`name`	`NULLABLE`	`STRING`	The name of the bucket.
`location`	`NULLABLE`	`STRING`	The location of the bucket. Object data for objects in the bucket resides in physical storage within this location.
`project`	`NULLABLE`	`INTEGER`	The project number of the project the bucket belongs to.
`storageClass`	`NULLABLE`	`STRING`	The bucket's default storage class.
`public`	`NULLABLE`	`RECORD`	Deprecated. This field indicates whether a bucket was publicly accessible. Use iamConfiguration instead.
`public.bucketPolicyOnly`	`NULLABLE`	`BOOLEAN`	Deprecated. This field, part of the `public` record, indicates whether uniform bucket-level access was enabled, which prevents access being granted through object-level ACLs.
`public.publicAccessPrevention`	`NULLABLE`	`STRING`	Deprecated. This field, part of the `public` record, indicates whether public access to the bucket was prevented.
`autoclass`	`NULLABLE`	`RECORD`	The bucket's Autoclass configuration, which, when enabled, controls the storage class of objects based on how and when the objects are accessed.
`autoclass.enabled`	`NULLABLE`	`BOOLEAN`	Whether or not Autoclass is enabled.
`autoclass.toggleTime`	`NULLABLE`	`TIMESTAMP`	The time at which Autoclass was last enabled or disabled for this bucket, in RFC 3339 format.
`versioning`	`NULLABLE`	`BOOLEAN`	Whether or not the bucket has versioning enabled. For more information, see Object Versioning.
`lifecycle`	`NULLABLE`	`BOOLEAN`	Whether or not the bucket has a lifecycle configuration. See lifecycle management for more information.
`metageneration`	`NULLABLE`	`INTEGER`	The metadata generation of this bucket.
`timeCreated`	`NULLABLE`	`TIMESTAMP`	The creation time of the bucket in RFC 3339 format.
`tags`	`NULLABLE`	`RECORD`	Deprecated. This field contains user-defined key-value pairs associated with the bucket. Use resource tags instead.
`tags.lastUpdatedTime`	`NULLABLE`	`TIMESTAMP`	Deprecated. This field, part of the `tags` record, indicates the last time the tags were updated.
`tags.tagMap`	`REPEATED`	`RECORD`	Deprecated. This field, part of the `tags` record, contains the map of tag keys and values.
`tags.tagMap.key`	`NULLABLE`	`STRING`	Deprecated. This field, part of the `tags.tagMap` record, represents the key of a tag.
`tags.tagMap.value`	`NULLABLE`	`STRING`	Deprecated. This field, part of the `tags.tagMap` record, represents the value of a tag.
`labels`	`REPEATED`	`RECORD`	User-provided bucket labels, in key-value pairs.
`labels.key`	`NULLABLE`	`STRING`	An individual label entry.
`labels.value`	`NULLABLE`	`STRING`	The label's value.
`softDeletePolicy`	`NULLABLE`	`OBJECT`	The bucket's soft delete policy, which defines the period of time during which objects in the bucket are retained in a soft-deleted state after being deleted. Objects in a soft-deleted state cannot be permanently deleted, and are restorable until their `hardDeleteTime`.
`softDeletePolicy.effectiveTime`	`NULLABLE`	`DATETIME`	The datetime at which the soft delete policy becomes effective, in RFC 3339 format. `softDeletePolicy.effectiveTime` is updated whenever `softDeletePolicy.retentionDurationSeconds` is increased.
`softDeletePolicy.retentionDurationSeconds`	`NULLABLE`	`LONG`	The period of time during which a soft-deleted object is retained and cannot be permanently deleted, in seconds. The value must be greater than or equal to `604800` (7 days) and less than `7776000` (90 days). The value can also be set to `0`, which disables the soft delete policy.
`iamConfiguration`	`NULLABLE`	`RECORD`	The IAM configuration for a bucket.
`iamConfiguration.uniformBucketLevelAccess`	`NULLABLE`	`RECORD`	The bucket's uniform bucket-level access configuration.
`iamConfiguration.uniformBucketLevelAccess.enabled`	`NULLABLE`	`BOOLEAN`	Whether or not the bucket uses uniform bucket-level access.
`iamConfiguration.publicAccessPrevention`	`NULLABLE`	`STRING`	The bucket's public access prevention status, which is either `"inherited"` or `"enforced"`.
`resourceTags`	`REPEATED`	`RECORD`	The bucket's tags. For more information, see Cloud Resource Manager API.
`resourceTags.key`	`NULLABLE`	`STRING`	The resource tag key.
`resourceTags.value`	`NULLABLE`	`STRING`	The resource tag value.
`totalSize`	`NULLABLE`	`INTEGER`	The size of the bucket in bytes.
`objectCount`	`NULLABLE`	`INTEGER`	Total number of objects in the bucket.

Object metadata

The following table describes the object metadata fields:

Filter this table:

Metadata field	Mode	Type	Description
`snapshotTime`	`NULLABLE`	`TIMESTAMP`	The snapshotTime field stores the time of the object metadata snapshot refresh in RFC 3339 format.
`bucket`	`NULLABLE`	`STRING`	The name of the bucket containing this object.
`location`	`NULLABLE`	`STRING`	The location of the bucket. Object data for objects in the bucket resides in physical storage within this location.
`componentCount`	`NULLABLE`	`INTEGER`	Returned for composite objects only. Number of non-composite objects in the composite object. `componentCount` includes non-composite objects that were part of any composite objects used to compose the current object.
`contentDisposition`	`NULLABLE`	`STRING`	Content-Disposition of the object data.
`contentEncoding`	`NULLABLE`	`STRING`	Content-Encoding of the object data.
`contentLanguage`	`NULLABLE`	`STRING`	Content-Language of the object data.
`contentType`	`NULLABLE`	`STRING`	Content-Type of the object data.
`crc32c`	`NULLABLE`	`INTEGER`	CRC32c checksum, as described in RFC 4960, Appendix B; encoded using base64 in big-endian byte order.
`customTime`	`NULLABLE`	`TIMESTAMP`	A user-specified timestamp for the object in RFC 3339 format.
`etag`	`NULLABLE`	`STRING`	HTTP 1.1 Entity tag for the object.
`eventBasedHold`	`NULLABLE`	`BOOLEAN`	Whether or not the object is subject to an event-based hold.
`generation`	`NULLABLE`	`INTEGER`	The content generation of this object.
`md5Hash`	`NULLABLE`	`STRING`	MD5 hash of the data, encoded using base64. This field is not present for composite objects.
`mediaLink`	`NULLABLE`	`STRING`	A URL for downloading the object's data.
`metadata`	`REPEATED`	`RECORD`	User-provided metadata, in key-value pairs.
`metadata.key`	`NULLABLE`	`STRING`	An individual metadata entry.
`metadata.value`	`NULLABLE`	`STRING`	The metadata value.
`metageneration`	`NULLABLE`	`INTEGER`	The version of the metadata for this object at this generation.
`name`	`NULLABLE`	`STRING`	The name of the object.
`selfLink`	`NULLABLE`	`STRING`	A URL for this object.
`size`	`NULLABLE`	`INTEGER`	Content-Length of the data in bytes.
`storageClass`	`NULLABLE`	`STRING`	Storage class of the object.
`temporaryHold`	`NULLABLE`	`BOOLEAN`	Whether or not the object is subject to a temporary hold.
`timeCreated`	`NULLABLE`	`TIMESTAMP`	The creation time of the object in RFC 3339 format.
`timeDeleted`	`NULLABLE`	`TIMESTAMP`	The deletion time of the object in RFC 3339 format.
`updated`	`NULLABLE`	`TIMESTAMP`	The modification time of the object metadata in RFC 3339 format.
`timeStorageClassUpdated`	`NULLABLE`	`TIMESTAMP`	The time at which the object's storage class was last changed.
`retentionExpirationTime`	`NULLABLE`	`TIMESTAMP`	The earliest time that the object can be deleted, which depends on any retention configuration set for the object and any retention policy set for the bucket that contains the object. The value for `retentionExpirationTime` is given in RFC 3339 format.
`softDeleteTime`	`NULLABLE`	`DATETIME`	The time at which the object was soft deleted. Only available for objects in buckets with a soft delete policy.
`hardDeleteTime`	`NULLABLE`	`DATETIME`	The time at which a soft-deleted object is permanently deleted and can no longer be restored. The value is the sum of the `softDeleteTime` value and the `softDeletePolicy.retentionDurationSeconds` value of the bucket. Only available for objects in buckets with a soft delete policy.
`project`	`NULLABLE`	`INTEGER`	The project number of the project the bucket belongs to.

Latest bucket and object metadata snapshot

The linked dataset exposes the latest snapshot of the bucket and object metadata through the following dedicated views:

The bucket_attributes_latest_snapshot_view provides the latest metadata for your Cloud Storage buckets. Its structure matches the Bucket metadata schema.
The object_attributes_latest_snapshot_view provides the latest metadata for your Cloud Storage objects. Its structure matches the Object metadata schema.

Project metadata

The project metadata is exposed as a view named project_attributes_view in the linked dataset:

Metadata field	Mode	Type	Description
`snapshotTime`	`NULLABLE`	`TIMESTAMP`	The snapshotTime field stores the time of the project metadata snapshot refresh in RFC 3339 format.
`name`	`NULLABLE`	`STRING`	The name of the project.
`id`	`NULLABLE`	`STRING`	The unique identifier for the project.
`number`	`NULLABLE`	`NUMBER`	A numeric value associated with the project.

Dataset schema for events and errors

In the linked dataset, you can also view the snapshot processing events and errors in the events_view and error_attributes_view views. To learn how to troubleshoot the snapshot processing errors, see Troubleshoot dataset errors.

Events log

You can view event logs in the events_view view in the linked dataset:

Column name	Mode	Type	Description
`manifest.snapshotTime`	`NULLABLE`	`TIMESTAMP`	The time in RFC 3339 format that the snapshot of the events is refreshed at.
`manifest.viewName`	`NULLABLE`	`STRING`	The name of the view that was refreshed.
`manifest.location`	`NULLABLE`	`STRING`	The source location of the data that was refreshed.
`globalManifest.snapshotTime`	`NULLABLE`	`TIMESTAMP`	The time in RFC 3339 format that the snapshot of the events is refreshed at.
`eventTime`	`NULLABLE`	`STRING`	The time that the event happened at.
`eventCode`	`NULLABLE`	`STRING`	The event code associated with the corresponding entry. The event code `1` refers to the `manifest.viewName` view being refreshed with all entries for the source location `manifest.location` within the snapshot `manifest.snapshotTime`. The event code `2` indicates that the dataset is refreshed with the bucket and object entries for all source locations. This refresh occurs within the snapshot `globalManifest.snapshotTime`.

Error codes

You can view error codes in the error_attributes_view view in the linked dataset:

Column name	Mode	Type	Description
`errorCode`	`NULLABLE`	`INTEGER`	The error code associated with this entry. For a list of valid values and how to resolve them, see Troubleshoot dataset errors.
`errorSource`	`NULLABLE`	`STRING`	The source of the error. Valid value: `CONFIGURATION_PREPROCESSING`.
`errorTime`	`NULLABLE`	`TIMESTAMP`	The time the error happened.
`sourceGcsLocation`	`NULLABLE`	`STRING`	The source Cloud Storage location of the error. For projects this field is null because they are locationless.
`bucketErrorRecord.bucketName`	`NULLABLE`	`STRING`	The name of the bucket involved in the error. You can use this information to debug a bucket error.
`bucketErrorRecord.serviceAccount`	`NULLABLE`	`STRING`	The service account that needs permission to ingest objects from the bucket. You can use this information to debug a bucket error.
`projectErrorRecord.projectNumber`	`NULLABLE`	`INTEGER`	The number of the project involved in the error. You can use this information to debug a project error.
`projectErrorRecord.organizationName`	`NULLABLE`	`STRING`	The number of the organization the project must belong to in order to be processed. A value of `0` indicates that the dataset is not in the organization. You can use this information to debug a project error.

Troubleshoot dataset errors

To learn how to troubleshoot the snapshot processing errors that are logged into the error_attributes_view view in the linked dataset, see the following table:

Error Code	Error Case	Error Message	Troubleshooting
1	Source project doesn't belong to the organization	Source project `projectErrorRecord.projectNumber` doesn't belong to the organization `projectErrorRecord.organizationName`.	Add source project `projectErrorRecord.projectNumber` to organization `projectErrorRecord.organizationName`. For instructions about how to migrate a project between organizations, see Migrate projects between organizations.
2	Bucket authorization error	Permission denied for ingesting objects for bucket `bucketErrorRecord.bucketName`.	Give service account `bucketErrorRecord.serviceAccount` Identity and Access Management (IAM) permissions to allow ingestion of objects for bucket `bucketErrorRecord.bucketName`. For more information, see Grant required permissions to the service agent.
3	Destination project doesn't belong to the organization	Destination project `projectErrorRecord.projectNumber` not in organization `projectErrorRecord.organizationName`.	Add destination project `projectErrorRecord.projectNumber` to organization `projectErrorRecord.organizationName`. For instructions about how to migrate a project between organizations, see Migrate projects between organizations.
4	Source project doesn't have Storage Intelligence configured.	Source project `projectErrorRecord.projectNumber` doesn't have Storage Intelligence configured.	Configure Storage Intelligence for the source project `projectErrorRecord.projectNumber`. For more information, see Configure and manage Storage Intelligence.
5	Bucket doesn't have Storage Intelligence configured.	Bucket `bucketErrorRecord.bucketName` doesn't have Storage Intelligence configured.	Configure Storage Intelligence for the bucket `bucketErrorRecord.bucketName`. For more information, see Configure and manage Storage Intelligence.

Considerations

Consider the following for dataset configurations:

When you rename a folder in a bucket with hierarchical namespace enabled, the object names in that bucket get updated. When ingested by the linked dataset, these object snapshots are considered new entries in the linked datasets.
Datasets are supported only in these BigQuery locations.