Storage Insights datasets

The Storage Insights datasets feature helps you understand, organize, and manage your data at scale. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to update the metadata for. A queryable metadata index for the included buckets and objects within those projects, are made available as a BigQuery linked dataset.

If you want to get insights for your Cloud Storage resources that are exported to BigQuery, use the Storage Insights datasets. These insights can help you with data exploration, cost optimization, security enforcement, and governance implementation. Storage Insights datasets is an exclusive feature only available through the Storage Intelligence subscription.

Overview

A Storage Insights dataset is a rolling snapshot of metadata for all the buckets and objects within one or multiple specified source projects within an organization. The information provided by datasets lets you better understand and routinely audit your Cloud Storage data.

To create a dataset, you first create a dataset configuration in a project. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to view the metadata for. The dataset configuration generates datasets daily. Both dataset configurations and datasets are resources stored within Cloud Storage.

To view a dataset, you must first link the dataset to BigQuery.

Dataset configuration properties

When you create a dataset configuration, you set these properties of the dataset. It can take up to 48 hours for you to see the first data populated as a linked dataset in BigQuery after you configure the dataset. Any newly added objects or buckets are included in the next daily snapshot.

  • Name: a name that's used to reference the dataset. Names are used as the identifier of dataset configurations and cannot be changed after the configuration is created. The name contains up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.

  • Description (optional): a description of the dataset. You can edit the description at any time.

  • Dataset scope: a required field that specifies an organization, projects, or folders containing the buckets and objects for which you want metadata. You can specify projects or folders individually or as a CSV file, with each project or folder number on a separate line. You can specify up to 10,000 projects or folders in one dataset configuration. Datasets are configured for the specified dataset scope. Only one dataset scope can be specified for each dataset configuration. You can update the dataset scope when editing the dataset configuration.

  • Bucket filters (optional): filters used to include and exclude specific buckets from the dataset by bucket name or by regions.

  • Retention period: the number of days that the dataset captures and retains data for, including the creation date of the dataset. Datasets update with metadata every 24 hours and can retain data for up to 90 days. Data captured outside of the retention window is automatically deleted. For example, suppose you have a dataset that was created on October 1, 2023 with a retention window set to 30. On October 30, the dataset will reflect the past 30 days of data, from October 1 to October 30. On October 31, the dataset will reflect the data from October 2 to October 31. You can modify the retention window at any time.

  • Location: a location to store the dataset and its data. For example, us-central1. The location must be supported by BigQuery. We recommend that you select the location of your BigQuery tables, if you have any.

  • Service agent type: either a configuration-scoped service agent or a project-scoped service agent.

    Creating a dataset configuration provisions a service agent for you. In order to read and write datasets, the service agent must be granted the required permissions.

    A project-scoped service agent can access and write datasets that are generated from all the dataset configurations in the project. For example, if you have multiple dataset configurations within a project, then you only need to grant required permissions to the project-scoped service agent once for it to be able to read and write datasets for all the dataset configurations within the project. When a dataset configuration is deleted, the project-scoped service agent is not deleted.

    A configuration-scoped service agent can only access and write the dataset generated by the particular dataset configuration. This means if you have multiple dataset configurations, you'll need to grant required permissions to each configuration-scoped service agent. When a dataset configuration is deleted, the configuration-scoped service agent is deleted.

Link the dataset to BigQuery after creating a dataset configuration. Linking a dataset to BigQuery creates a linked dataset in BigQuery for querying. You can link or unlink the dataset at any point.

For more information about the properties you set when creating or updating a dataset configuration, see the DatasetConfigs resource in the JSON API documentation.

Supported locations

The following BigQuery locations are supported for creating linked datasets:

  • EU
  • US
  • asia-southeast1
  • europe-west1
  • us-central1
  • us-east1
  • us-east4

Dataset schema of metadata

The following sections describe the metadata fields included in datasets. For more information about BigQuery column modes, see Modes. The column modes determine how BigQuery stores and queries the data.

Bucket metadata

The following table describes the bucket metadata fields:

Metadata field Mode Type Description
snapshotTime NULLABLE TIMESTAMP The snapshotTime field stores the time of the bucket metadata snapshot refresh in RFC 3339 format.
name NULLABLE STRING The name of the bucket.
location NULLABLE STRING The location of the bucket. Object data for objects in the bucket resides in physical storage within this location.
project NULLABLE INTEGER The project number of the project the bucket belongs to.
storageClass NULLABLE STRING The bucket's default storage class.
public NULLABLE RECORD Deprecated. This field indicates whether a bucket was publicly accessible. Use iamConfiguration instead.
public.bucketPolicyOnly NULLABLE BOOLEAN Deprecated. This field, part of the public record, indicates whether uniform bucket-level access was enabled, which prevents access being granted through object-level ACLs.
public.publicAccessPrevention NULLABLE STRING Deprecated. This field, part of the public record, indicates whether public access to the bucket was prevented.
autoclass NULLABLE RECORD The bucket's Autoclass configuration, which, when enabled, controls the storage class of objects based on how and when the objects are accessed.
autoclass.enabled NULLABLE BOOLEAN Whether or not Autoclass is enabled.
autoclass.toggleTime NULLABLE TIMESTAMP The time at which Autoclass was last enabled or disabled for this bucket, in RFC 3339 format.
versioning NULLABLE BOOLEAN Whether or not the bucket has versioning enabled. For more information, see Object Versioning.
lifecycle NULLABLE BOOLEAN Whether or not the bucket has a lifecycle configuration. See lifecycle management for more information.
metageneration NULLABLE INTEGER The metadata generation of this bucket.
timeCreated NULLABLE TIMESTAMP The creation time of the bucket in RFC 3339 format.
tags NULLABLE RECORD Deprecated. This field contains user-defined key-value pairs associated with the bucket. Use resource tags instead.
tags.lastUpdatedTime NULLABLE TIMESTAMP Deprecated. This field, part of the tags record, indicates the last time the tags were updated.
tags.tagMap REPEATED RECORD Deprecated. This field, part of the tags record, contains the map of tag keys and values.
tags.tagMap.key NULLABLE STRING Deprecated. This field, part of the tags.tagMap record, represents the key of a tag.
tags.tagMap.value NULLABLE STRING Deprecated. This field, part of the tags.tagMap record, represents the value of a tag.
labels REPEATED RECORD User-provided bucket labels, in key-value pairs.
labels.key NULLABLE STRING An individual label entry.
labels.value NULLABLE STRING The label's value.
softDeletePolicy NULLABLE OBJECT The bucket's soft delete policy, which defines the period of time during which objects in the bucket are retained in a soft-deleted state after being deleted. Objects in a soft-deleted state cannot be permanently deleted, and are restorable until their hardDeleteTime.
softDeletePolicy.effectiveTime NULLABLE DATETIME

The datetime at which the soft delete policy becomes effective, in RFC 3339 format.

softDeletePolicy.effectiveTime is updated whenever softDeletePolicy.retentionDurationSeconds is increased.

softDeletePolicy.retentionDurationSeconds NULLABLE LONG The period of time during which a soft-deleted object is retained and cannot be permanently deleted, in seconds. The value must be greater than or equal to 604800 (7 days) and less than 7776000 (90 days). The value can also be set to 0, which disables the soft delete policy.
iamConfiguration NULLABLE RECORD The IAM configuration for a bucket.
iamConfiguration.uniformBucketLevelAccess NULLABLE RECORD The bucket's uniform bucket-level access configuration.
iamConfiguration.uniformBucketLevelAccess.enabled NULLABLE BOOLEAN Whether or not the bucket uses uniform bucket-level access.
iamConfiguration.publicAccessPrevention NULLABLE STRING The bucket's public access prevention status, which is either "inherited" or "enforced".
resourceTags REPEATED RECORD The bucket's tags. For more information, see Cloud Resource Manager API.
resourceTags.key NULLABLE STRING The resource tag key.
resourceTags.value NULLABLE STRING The resource tag value.
totalSize NULLABLE INTEGER The size of the bucket in bytes.
objectCount NULLABLE INTEGER Total number of objects in the bucket.

Object metadata

The following table describes the object metadata fields:

Metadata field Mode Type Description
snapshotTime NULLABLE TIMESTAMP The snapshotTime field stores the time of the object metadata snapshot refresh in RFC 3339 format.
bucket NULLABLE STRING The name of the bucket containing this object.
location NULLABLE STRING The location of the bucket. Object data for objects in the bucket resides in physical storage within this location.
componentCount NULLABLE INTEGER Returned for composite objects only. Number of non-composite objects in the composite object. componentCount includes non-composite objects that were part of any composite objects used to compose the current object.
contentDisposition NULLABLE STRING Content-Disposition of the object data.
contentEncoding NULLABLE STRING Content-Encoding of the object data.
contentLanguage NULLABLE STRING Content-Language of the object data.
contentType NULLABLE STRING Content-Type of the object data.
crc32c NULLABLE INTEGER CRC32c checksum, as described in RFC 4960, Appendix B; encoded using base64 in big-endian byte order.
customTime NULLABLE TIMESTAMP A user-specified timestamp for the object in RFC 3339 format.
etag NULLABLE STRING HTTP 1.1 Entity tag for the object.
eventBasedHold NULLABLE BOOLEAN Whether or not the object is subject to an event-based hold.
generation NULLABLE INTEGER The content generation of this object.
md5Hash NULLABLE STRING MD5 hash of the data, encoded using base64. This field is not present for composite objects.
metadata REPEATED RECORD User-provided metadata, in key-value pairs.
metadata.key NULLABLE STRING An individual metadata entry.
metadata.value NULLABLE STRING The metadata value.
metageneration NULLABLE INTEGER The version of the metadata for this object at this generation.
name NULLABLE STRING The name of the object.
size NULLABLE INTEGER Content-Length of the data in bytes.
storageClass NULLABLE STRING Storage class of the object.
temporaryHold NULLABLE BOOLEAN Whether or not the object is subject to a temporary hold.
timeCreated NULLABLE TIMESTAMP The creation time of the object in RFC 3339 format.
timeDeleted NULLABLE TIMESTAMP The deletion time of the object in RFC 3339 format.
updated NULLABLE TIMESTAMP The modification time of the object metadata in RFC 3339 format.
timeStorageClassUpdated NULLABLE TIMESTAMP The time at which the object's storage class was last changed.
retentionExpirationTime NULLABLE TIMESTAMP The earliest time that the object can be deleted, which depends on any retention configuration set for the object and any retention policy set for the bucket that contains the object. The value for retentionExpirationTime is given in RFC 3339 format.
softDeleteTime NULLABLE DATETIME The time at which the object was soft deleted. Only available for objects in buckets with a soft delete policy.
hardDeleteTime NULLABLE DATETIME The time at which a soft-deleted object is permanently deleted and can no longer be restored. The value is the sum of the softDeleteTime value and the softDeletePolicy.retentionDurationSeconds value of the bucket. Only available for objects in buckets with a soft delete policy.
project NULLABLE INTEGER The project number of the project the bucket belongs to.

Latest bucket and object metadata snapshot

The linked dataset exposes the latest snapshot of the bucket and object metadata through the following dedicated views:

  • The bucket_attributes_latest_snapshot_view provides the latest metadata for your Cloud Storage buckets. Its structure matches the Bucket metadata schema.

  • The object_attributes_latest_snapshot_view provides the latest metadata for your Cloud Storage objects. Its structure matches the Object metadata schema.

Project metadata

The project metadata is exposed as a view named project_attributes_view in the linked dataset:

Metadata field Mode Type Description
snapshotTime NULLABLE TIMESTAMP The snapshotTime field stores the time of the project metadata snapshot refresh in RFC 3339 format.
name NULLABLE STRING The name of the project.
id NULLABLE STRING The unique identifier for the project.
number NULLABLE NUMBER A numeric value associated with the project.

Dataset schema for events and errors

In the linked dataset, you can also view the snapshot processing events and errors in the events_view and error_attributes_view views. To learn how to troubleshoot the snapshot processing errors, see Troubleshoot dataset errors.

Events log

You can view event logs in the events_view view in the linked dataset:

Column name Mode Type Description
manifest.snapshotTime NULLABLE TIMESTAMP The time in RFC 3339 format that the snapshot of the events is refreshed at.
manifest.viewName NULLABLE STRING The name of the view that was refreshed.
manifest.location NULLABLE STRING The source location of the data that was refreshed.
globalManifest.snapshotTime NULLABLE TIMESTAMP The time in RFC 3339 format that the snapshot of the events is refreshed at.
eventTime NULLABLE STRING The time that the event happened at.
eventCode NULLABLE STRING The event code associated with the corresponding entry. The event code 1 refers to the manifest.viewName view being refreshed with all entries for the source location manifest.location within the snapshot manifest.snapshotTime. The event code 2 indicates that the dataset is refreshed with the bucket and object entries for all source locations. This refresh occurs within the snapshot globalManifest.snapshotTime.

Error codes

You can view error codes in the error_attributes_view view in the linked dataset:

Column name Mode Type Description
errorCode NULLABLE INTEGER The error code associated with this entry. For a list of valid values and how to resolve them, see Troubleshoot dataset errors.
errorSource NULLABLE STRING The source of the error. Valid value: CONFIGURATION_PREPROCESSING.
errorTime NULLABLE TIMESTAMP The time the error happened.
sourceGcsLocation NULLABLE STRING The source Cloud Storage location of the error. For projects this field is null because they are locationless.
bucketErrorRecord.bucketName NULLABLE STRING The name of the bucket involved in the error. You can use this information to debug a bucket error.
bucketErrorRecord.serviceAccount NULLABLE STRING The service account that needs permission to ingest objects from the bucket. You can use this information to debug a bucket error.
projectErrorRecord.projectNumber NULLABLE INTEGER The number of the project involved in the error. You can use this information to debug a project error.
projectErrorRecord.organizationName NULLABLE STRING The number of the organization the project must belong to in order to be processed. A value of 0 indicates that the dataset is not in the organization. You can use this information to debug a project error.

Troubleshoot dataset errors

To learn how to troubleshoot the snapshot processing errors that are logged into the error_attributes_view view in the linked dataset, see the following table:

Error Code Error Case Error Message Troubleshooting
1 Source project doesn't belong to the organization Source project projectErrorRecord.projectNumber doesn't belong to the organization projectErrorRecord.organizationName. Add source project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName. For instructions about how to migrate a project between organizations, see Migrate projects between organizations.
2 Bucket authorization error Permission denied for ingesting objects for bucket bucketErrorRecord.bucketName. Give service account bucketErrorRecord.serviceAccount Identity and Access Management (IAM) permissions to allow ingestion of objects for bucket bucketErrorRecord.bucketName. For more information, see Grant required permissions to the service agent.
3 Destination project doesn't belong to the organization Destination project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName. Add destination project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName. For instructions about how to migrate a project between organizations, see Migrate projects between organizations.
4 Source project doesn't have Storage Intelligence configured. Source project projectErrorRecord.projectNumber doesn't have Storage Intelligence configured. Configure Storage Intelligence for the source project projectErrorRecord.projectNumber. For more information, see Configure and manage Storage Intelligence.
5 Bucket doesn't have Storage Intelligence configured. Bucket bucketErrorRecord.bucketName doesn't have Storage Intelligence configured. Configure Storage Intelligence for the bucket bucketErrorRecord.bucketName. For more information, see Configure and manage Storage Intelligence.

Considerations

Consider the following for dataset configurations:

  • When you rename a folder in a bucket with hierarchical namespace enabled, the object names in that bucket get updated. When ingested by the linked dataset, these object snapshots are considered new entries in the linked datasets.

  • Datasets are supported only in these BigQuery locations.

What's next