The Storage Insights datasets feature helps you understand, organize, and manage your data at scale. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to update the metadata for. A queryable metadata index for the included buckets and objects within those projects, are made available as a BigQuery linked dataset.
If you want to get insights for your Cloud Storage resources that are exported to BigQuery, use the Storage Insights datasets. These insights can help you with data exploration, cost optimization, security enforcement, and governance implementation. Storage Insights datasets is an exclusive feature only available through the Storage Intelligence subscription.
Overview
A Storage Insights dataset is a rolling snapshot of metadata for all the buckets and objects within one or multiple specified source projects within an organization. The information provided by datasets lets you better understand and routinely audit your Cloud Storage data.
To create a dataset, you first create a dataset configuration in a project. You can choose an organization, or one or multiple projects or folders containing buckets and objects you want to view the metadata for. The dataset configuration generates datasets daily. Both dataset configurations and datasets are resources stored within Cloud Storage.
To view a dataset, you must first link the dataset to BigQuery.
Dataset configuration properties
When you create a dataset configuration, you set the following properties of the dataset:
Name: a name that's used to reference the dataset. Names are used as the identifier of dataset configurations and cannot be changed after the configuration is created. The name contains up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.
Description (optional): a description of the dataset. You can edit the description at any time.
Dataset scope: an organization, projects, or folders containing the buckets and objects for which you want metadata. You can specify projects or folders individually or as a CSV file, with each project or folder number on a separate line. You can specify up to 10,000 projects or folders in one dataset configuration.
Bucket filters (optional): filters used to include and exclude specific buckets from the dataset by bucket name or by regions.
Retention period: the number of days that the dataset captures and retains data for, including the creation date of the dataset. Datasets update with metadata every 24 hours and can retain data for up to 90 days. Data captured outside of the retention window is automatically deleted. For example, suppose you have a dataset that was created on October 1, 2023 with a retention window set to 30. On October 30, the dataset will reflect the past 30 days of data, from October 1 to October 30. On October 31, the dataset will reflect the data from October 2 to October 31. You can modify the retention window at any time.
Location: a location to store the dataset and its data. For example,
us-central1
. The location must be supported by BigQuery. We recommend that you select the location of your BigQuery tables, if you have any.Service agent type: either a configuration-scoped service agent or a project-scoped service agent.
Creating a dataset configuration provisions a service agent for you. In order to read datasets, the service agent must be granted the required permissions to read data from Cloud Storage buckets.
A project-scoped service agent can access and write datasets that are generated from all the dataset configurations in the project. For example, if you have multiple dataset configurations within a project, then you only need to grant required permissions to the project-scoped service agent once for it to be able to read and write datasets for all the dataset configurations within the project. For more information about the permissions required to read and write datasets, see Permissions. When a dataset configuration is deleted, the project-scoped service agent is not deleted.
A configuration-scoped service agent can only access and write the dataset generated by the particular dataset configuration. This means if you have multiple dataset configurations, you'll need to grant required permissions to each configuration-scoped service agent. When a dataset configuration is deleted, the configuration-scoped service agent is deleted.
Link the dataset to BigQuery after creating a dataset configuration. Linking a dataset to BigQuery creates a linked dataset in BigQuery for querying. You can link or unlink the dataset at any point.
For more information about the properties you set when creating or updating a dataset configuration, see the DatasetConfigs resource in the JSON API documentation.
Supported locations
The following BigQuery locations are supported for creating linked datasets:
EU
US
asia-southeast1
europe-west1
us-central1
us-east1
us-east4
Dataset schema of metadata
The following metadata fields are included in datasets. For more information about BigQuery column modes, see Modes. The column modes determine how BigQuery stores and queries the data.
The snapshotTime
field stores the time of the bucket metadata snapshot refresh
in RFC 3339 format.
Bucket metadata
Unless otherwise noted, you can find more detailed descriptions of the following bucket metadata fields by referring to the Buckets resource representation for the JSON API.
Metadata field | Mode | Type |
---|---|---|
snapshotTime |
NULLABLE |
TIMESTAMP |
name |
NULLABLE |
STRING |
location |
NULLABLE |
STRING |
project |
NULLABLE |
INTEGER |
storageClass |
NULLABLE |
STRING |
versioning |
NULLABLE |
BOOLEAN |
lifecycle |
NULLABLE |
BOOLEAN |
metageneration |
NULLABLE |
INTEGER |
timeCreated |
NULLABLE |
TIMESTAMP |
public |
NULLABLE |
RECORD |
public.bucketPolicyOnly |
NULLABLE |
BOOLEAN |
public.publicAccessPrevention |
NULLABLE |
STRING |
autoclass |
NULLABLE |
RECORD |
autoclass.enabled |
NULLABLE |
BOOLEAN |
autoclass.toggleTime |
NULLABLE |
TIMESTAMP |
softDeletePolicy |
NULLABLE |
OBJECT |
softDeletePolicy.effectiveTime |
NULLABLE |
DATETIME |
softDeletePolicy.retentionDurationSeconds |
NULLABLE |
LONG |
tags* |
NULLABLE |
RECORD |
tags.lastUpdatedTime |
NULLABLE |
TIMESTAMP |
tags.tagMap |
REPEATED |
RECORD |
tags.tagMap.key |
NULLABLE |
STRING |
tags.tagMap.value |
NULLABLE |
STRING |
labels |
REPEATED |
RECORD |
labels.key |
NULLABLE |
STRING |
labels.value |
NULLABLE |
STRING |
* The bucket's tags. For more information, see Cloud Resource Manager API.
Object metadata
Unless otherwise noted, you can find more detailed descriptions of the following object metadata fields by referring to the Objects resource representation for the JSON API.
Metadata field | Mode | Type |
---|---|---|
snapshotTime |
NULLABLE |
TIMESTAMP |
bucket |
NULLABLE |
STRING |
location |
NULLABLE |
STRING |
componentCount |
NULLABLE |
INTEGER |
contentDisposition |
NULLABLE |
STRING |
contentEncoding |
NULLABLE |
STRING |
contentLanguage |
NULLABLE |
STRING |
contentType |
NULLABLE |
STRING |
crc32c |
NULLABLE |
INTEGER |
customTime |
NULLABLE |
TIMESTAMP |
etag |
NULLABLE |
STRING |
eventBasedHold |
NULLABLE |
BOOLEAN |
generation |
NULLABLE |
INTEGER |
md5Hash |
NULLABLE |
STRING |
mediaLink |
NULLABLE |
STRING |
metageneration |
NULLABLE |
INTEGER |
name |
NULLABLE |
STRING |
selfLink |
NULLABLE |
STRING |
size |
NULLABLE |
INTEGER |
storageClass |
NULLABLE |
STRING |
temporaryHold |
NULLABLE |
BOOLEAN |
timeCreated |
NULLABLE |
TIMESTAMP |
timeDeleted |
NULLABLE |
TIMESTAMP |
updated |
NULLABLE |
TIMESTAMP |
timeStorageClassUpdated |
NULLABLE |
TIMESTAMP |
retentionExpirationTime |
NULLABLE |
TIMESTAMP |
softDeleteTime |
NULLABLE |
DATETIME |
hardDeleteTime |
NULLABLE |
DATETIME |
metadata |
REPEATED |
RECORD |
metadata.key |
NULLABLE |
STRING |
metadata.value |
NULLABLE |
STRING |
Project metadata
The project metadata is exposed as a view named project_attributes_view
in the
linked dataset:
Metadata field | Mode | Type |
---|---|---|
snapshotTime |
NULLABLE |
TIMESTAMP |
name |
NULLABLE |
STRING |
id |
NULLABLE |
STRING |
number |
NULLABLE |
NUMBER |
Dataset schema for events and errors
In the linked dataset, you can also view the snapshot processing events and
errors in the events_view
and error_attributes_view
views. To learn how to
troubleshoot the snapshot processing errors, see Troubleshoot dataset errors.
Events log
You can view event logs in the events_view
view in the linked dataset:
Column name | Mode | Type | Description |
---|---|---|---|
manifest.snapshotTime |
NULLABLE |
TIMESTAMP |
The time in RFC 3339 format that the snapshot of the events is refreshed at. |
manifest.viewName |
NULLABLE |
STRING |
The name of the view that was refreshed. |
manifest.location |
NULLABLE |
STRING |
The source location of the data that was refreshed. |
eventTime |
NULLABLE |
STRING |
The time that the event happened at. |
eventCode |
NULLABLE |
STRING |
The event code associated with the corresponding entry. The event code
1 refers to the manifest.viewName view being refreshed with all
entries for the source location manifest.location within snapshot
manifest.snapshotTime . |
Error codes
You can view error codes in the error_attributes_view
view in the linked
dataset:
Column name | Mode | Type | Description |
---|---|---|---|
errorCode |
NULLABLE |
INTEGER |
The error code associated with this entry. For a list of valid values and how to resolve them, see Troubleshoot dataset errors. |
errorSource |
NULLABLE |
STRING |
The source of the error. Valid value: CONFIGURATION_PREPROCESSING . |
errorTime |
NULLABLE |
TIMESTAMP |
The time the error happened. |
sourceGcsLocation |
NULLABLE |
STRING |
The source Cloud Storage location of the error. For projects this field is null because they are locationless. |
bucketErrorRecord.bucketName |
NULLABLE |
STRING |
The name of the bucket involved in the error. You can use this information to debug a bucket error. |
bucketErrorRecord.serviceAccount |
NULLABLE |
STRING |
The service account that needs permission to ingest objects from the bucket. You can use this information to debug a bucket error. |
projectErrorRecord.projectNumber |
NULLABLE |
INTEGER |
The number of the project involved in the error. You can use this information to debug a project error. |
projectErrorRecord.organizationName |
NULLABLE |
STRING |
The number of the organization the project must belong to in order to be processed. A value of 0 indicates that the dataset is not in the organization. You can use this information to debug a project error. |
Troubleshoot dataset errors
To learn how to troubleshoot the snapshot processing errors that are logged into
the error_attributes_view
view in the linked dataset, see the following table:
Error Code | Error Case | Error Message | Troubleshooting |
---|---|---|---|
1 | Source project doesn't belong to the organization | Source project projectErrorRecord.projectNumber doesn't belong to the organization projectErrorRecord.organizationName . |
Add source project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName . For instructions about how to migrate a project between organizations, see Migrate projects between organizations. |
2 | Bucket authorization error | Permission denied for ingesting objects for bucket bucketErrorRecord.bucketName . |
Give service account bucketErrorRecord.serviceAccount Identity and Access Management (IAM) permissions to allow ingestion of objects for bucket bucketErrorRecord.bucketName . For more information, see Grant required permissions to the service agent. |
3 | Destination project doesn't belong to the organization | Destination project projectErrorRecord.projectNumber not in organization projectErrorRecord.organizationName . |
Add destination project projectErrorRecord.projectNumber to organization projectErrorRecord.organizationName . For instructions about how to migrate a project between organizations, see Migrate projects between organizations. |
4 | Source project doesn't have Storage Intelligence configured. | Source project projectErrorRecord.projectNumber doesn't have Storage Intelligence configured. |
Configure Storage Intelligence for the source project projectErrorRecord.projectNumber . For more information, see Configure and manage Storage Intelligence. |
5 | Bucket doesn't have Storage Intelligence configured. | Bucket bucketErrorRecord.bucketName doesn't have Storage Intelligence configured. |
Configure Storage Intelligence for the bucket bucketErrorRecord.bucketName . For more information, see Configure and manage Storage Intelligence. |
Considerations
Consider the following for dataset configurations:
When you rename a folder in a bucket with hierarchical namespace enabled, the object names in that bucket get updated. When ingested by the linked dataset, these object snapshots are considered new entries in the linked datasets.
Datasets are supported only in these BigQuery locations.
What's next
- Use Storage Insights datasets.
- Learn about Storage Intelligence.
- Run SQL queries on the datasets in BigQuery.
- Learn about BigQuery analytics.