Storage Insights datasets

The Storage Insights datasets feature helps you understand, organize, and manage your data at scale. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to update the metadata for. A queryable metadata index for the included buckets and objects within those projects, are made available as a BigQuery linked dataset.

If you want to get insights for your Cloud Storage resources that are exported to BigQuery, use the Storage Insights datasets. These insights can help you with data exploration, cost optimization, security enforcement, and governance implementation. Storage Insights datasets is an exclusive feature only available through the Storage Intelligence subscription.

Overview

A Storage Insights dataset is a rolling snapshot of metadata for all the buckets and objects within one or multiple specified source projects within an organization. The information provided by datasets lets you better understand and routinely audit your Cloud Storage data.

To create a dataset, you first create a dataset configuration in a project. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to view the metadata for. The dataset configuration generates datasets daily. Both dataset configurations and datasets are resources stored within Cloud Storage.

To view a dataset, you must first link the dataset to BigQuery.

Dataset configuration properties

When you create a dataset configuration, you set the following properties of the dataset:

  • Name: a name that's used to reference the dataset. Names are used as the identifier of dataset configurations and cannot be changed after the configuration is created. The name contains up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.

  • Description (optional): a description of the dataset. You can edit the description at any time.

  • Dataset scope: an organization, projects, or folders containing the buckets and objects for which you want metadata. You can specify projects or folders individually or as a CSV file, with each project or folder number on a separate line. You can specify up to 10,000 projects or folders in one dataset configuration.

  • Bucket filters (optional): filters used to include and exclude specific buckets from the dataset by bucket name or by regions.

  • Retention period: the number of days that the dataset captures and retains data for, including the creation date of the dataset. Datasets update with metadata every 24 hours and can retain data for up to 90 days. Data captured outside of the retention window is automatically deleted. For example, suppose you have a dataset that was created on October 1, 2023 with a retention window set to 30. On October 30, the dataset will reflect the past 30 days of data, from October 1 to October 30. On October 31, the dataset will reflect the data from October 2 to October 31. You can modify the retention window at any time.

  • Location: a location to store the dataset and its data. For example, us-central1. The location must be supported by BigQuery. We recommend that you select the location of your BigQuery tables, if you have any.

  • Service agent type: either a configuration-scoped service agent or a project-scoped service agent.

    Creating a dataset configuration provisions a service agent for you. In order to read datasets, the service agent must be granted the required permissions to read data from Cloud Storage buckets.

    A project-scoped service agent can access and write datasets that are generated from all the dataset configurations in the project. For example, if you have multiple dataset configurations within a project, then you only need to grant required permissions to the project-scoped service agent once for it to be able to read and write datasets for all the dataset configurations within the project. For more information about the permissions required to read and write datasets, see Permissions. When a dataset configuration is deleted, the project-scoped service agent is not deleted.

    A configuration-scoped service agent can only access and write the dataset generated by the particular dataset configuration. This means if you have multiple dataset configurations, you'll need to grant required permissions to each configuration-scoped service agent. When a dataset configuration is deleted, the configuration-scoped service agent is deleted.

Link the dataset to BigQuery after creating a dataset configuration. Linking a dataset to BigQuery creates a linked dataset in BigQuery for querying. You can link or unlink the dataset at any point.

For more information about the properties you set when creating or updating a dataset configuration, see the DatasetConfigs resource in the JSON API documentation.

Supported locations

The following BigQuery locations are supported for creating linked datasets:

  • EU
  • US
  • asia-southeast1
  • europe-west1
  • us-central1
  • us-east1
  • us-east4

Dataset schema of metadata

The following metadata fields are included in datasets. For more information about BigQuery column modes, see Modes. The column modes determine how BigQuery stores and queries the data.

The snapshotTime field stores the time of the bucket metadata snapshot refresh in RFC 3339 format.

Bucket metadata

Unless otherwise noted, you can find more detailed descriptions of the following bucket metadata fields by referring to the Buckets resource representation for the JSON API.

Metadata field Mode Type
snapshotTime NULLABLE TIMESTAMP
name NULLABLE STRING
location NULLABLE STRING
project NULLABLE INTEGER
storageClass NULLABLE STRING
versioning NULLABLE BOOLEAN
lifecycle NULLABLE BOOLEAN
metageneration NULLABLE INTEGER
timeCreated NULLABLE TIMESTAMP
public NULLABLE RECORD
public.bucketPolicyOnly NULLABLE BOOLEAN
public.publicAccessPrevention NULLABLE STRING
autoclass NULLABLE RECORD
autoclass.enabled NULLABLE BOOLEAN
autoclass.toggleTime NULLABLE TIMESTAMP
softDeletePolicy NULLABLE OBJECT
softDeletePolicy.effectiveTime NULLABLE DATETIME
softDeletePolicy.retentionDurationSeconds NULLABLE LONG
tags* NULLABLE RECORD
tags.lastUpdatedTime NULLABLE TIMESTAMP
tags.tagMap REPEATED RECORD
tags.tagMap.key NULLABLE STRING
tags.tagMap.value NULLABLE STRING
labels REPEATED RECORD
labels.key NULLABLE STRING
labels.value NULLABLE STRING

* The bucket's tags. For more information, see Cloud Resource Manager API.

Object metadata

Unless otherwise noted, you can find more detailed descriptions of the following object metadata fields by referring to the Objects resource representation for the JSON API.

Metadata field Mode Type
snapshotTime NULLABLE TIMESTAMP
bucket NULLABLE STRING
location NULLABLE STRING
componentCount NULLABLE INTEGER
contentDisposition NULLABLE STRING
contentEncoding NULLABLE STRING
contentLanguage NULLABLE STRING
contentType NULLABLE STRING
crc32c NULLABLE INTEGER
customTime NULLABLE TIMESTAMP
etag NULLABLE STRING
eventBasedHold NULLABLE BOOLEAN
generation NULLABLE INTEGER
md5Hash NULLABLE STRING
metageneration NULLABLE INTEGER
name NULLABLE STRING
size NULLABLE INTEGER
storageClass NULLABLE STRING
temporaryHold NULLABLE BOOLEAN
timeCreated NULLABLE TIMESTAMP
timeDeleted NULLABLE TIMESTAMP
updated NULLABLE TIMESTAMP
timeStorageClassUpdated NULLABLE TIMESTAMP
retentionExpirationTime NULLABLE TIMESTAMP
softDeleteTime NULLABLE DATETIME
hardDeleteTime NULLABLE DATETIME
metadata REPEATED RECORD
metadata.key NULLABLE STRING
metadata.value NULLABLE STRING

Project metadata

The project metadata is exposed as a view named project_attributes_view in the linked dataset:

Metadata field Mode Type
snapshotTime NULLABLE TIMESTAMP
name NULLABLE STRING
id NULLABLE STRING
number NULLABLE NUMBER

Dataset schema for events and errors

In the linked dataset, you can also view the snapshot processing events and errors in the events_view and error_attributes_view views. To learn how to troubleshoot the snapshot processing errors, see Troubleshoot dataset errors.

Events log

You can view event logs in the events_view view in the linked dataset:

Column name Mode Type Description
manifest.snapshotTime NULLABLE TIMESTAMP The time in RFC 3339 format that the snapshot of the events is refreshed at.
manifest.viewName NULLABLE STRING The name of the view that was refreshed.
manifest.location NULLABLE STRING The source location of the data that was refreshed.
eventTime NULLABLE STRING The time that the event happened at.
eventCode NULLABLE STRING The event code associated with the corresponding entry. The event code 1 refers to the manifest.viewName view being refreshed with all entries for the source location manifest.location within snapshot manifest.snapshotTime.

Error codes

You can view error codes in the error_attributes_view view in the linked dataset:

Column name Mode Type Description
errorCode NULLABLE INTEGER The error code associated with this entry. For a list of valid values and how to resolve them, see Troubleshoot dataset errors.
errorSource NULLABLE STRING The source of the error. Valid value: CONFIGURATION_PREPROCESSING.
errorTime NULLABLE TIMESTAMP The time the error happened.
sourceGcsLocation NULLABLE STRING The source Cloud Storage location of the error. For projects this field is null because they are locationless.
bucketErrorRecord.bucketName NULLABLE STRING The name of the bucket involved in the error. You can use this information to debug a bucket error.
bucketErrorRecord.serviceAccount NULLABLE STRING The service account that needs permission to ingest objects from the bucket. You can use this information to debug a bucket error.
projectErrorRecord.projectNumber NULLABLE INTEGER The number of the project involved in the error. You can use this information to debug a project error.
projectErrorRecord.organizationName NULLABLE STRING The number of the organization the project must belong to in order to be processed. A value of 0 indicates that the dataset is not in the organization. You can use this information to debug a project error.

Troubleshoot dataset errors

To learn how to troubleshoot the snapshot processing errors that are logged into the error_attributes_view view in the linked dataset, see the following table:

Error Code Error Case Error Message Troubleshooting
1 Source project doesn't belong to the organization Source project projectErrorRecord.projectNumber doesn't belong to the organization projectErrorRecord.organizationName. Add source project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName. For instructions about how to migrate a project between organizations, see Migrate projects between organizations.
2 Bucket authorization error Permission denied for ingesting objects for bucket bucketErrorRecord.bucketName. Give service account bucketErrorRecord.serviceAccount Identity and Access Management (IAM) permissions to allow ingestion of objects for bucket bucketErrorRecord.bucketName. For more information, see Grant required permissions to the service agent.
3 Destination project doesn't belong to the organization Destination project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName. Add destination project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName. For instructions about how to migrate a project between organizations, see Migrate projects between organizations.
4 Source project doesn't have Storage Intelligence configured. Source project projectErrorRecord.projectNumber doesn't have Storage Intelligence configured. Configure Storage Intelligence for the source project projectErrorRecord.projectNumber. For more information, see Configure and manage Storage Intelligence.
5 Bucket doesn't have Storage Intelligence configured. Bucket bucketErrorRecord.bucketName doesn't have Storage Intelligence configured. Configure Storage Intelligence for the bucket bucketErrorRecord.bucketName. For more information, see Configure and manage Storage Intelligence.

Considerations

Consider the following for dataset configurations:

  • When you rename a folder in a bucket with hierarchical namespace enabled, the object names in that bucket get updated. When ingested by the linked dataset, these object snapshots are considered new entries in the linked datasets.

  • Datasets are supported only in these BigQuery locations.

What's next