Use Storage Insights datasets

This page shows you how to create and manage Storage Insights datasets and dataset configurations. Learn more about Storage Insights datasets.

Before you begin

Before you begin creating and managing datasets and dataset configurations, follow the instructions in the subsequent subsections.

Get the required roles

To get the permissions that you need to create and manage datasets, ask your administrator to grant you the following IAM roles on your source projects:

To create, manage, and view dataset configurations: Storage Insights Admin (roles/storageinsights.admin)
To view, link, and unlink datasets:
- Storage Insights Analyst (roles/storageinsights.analyst)
- BigQuery Admin (roles/bigquery.admin)
To delete linked datasets: BigQuery Admin (roles/bigquery.admin)
To view and query datasets in BigQuery:
- Storage Insights Viewer (roles/storageinsights.viewer)
- BigQuery Job User (roles/bigquery.jobUser)
- BigQuery Data Viewer (roles/bigquery.dataViewer)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to create and manage datasets. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create and manage datasets:

Create dataset configuration: storageinsights.datasetConfigs.create
View dataset configuration:
- storageinsights.datasetConfigs.get
- storageinsights.datasetConfigs.list
Manage dataset configuration:
- storageinsights.datasetConfigs.update
- storageinsights.datasetConfigs.delete
Link to BigQuery dataset: storageinsights.datasetConfigs.linkDataset
Unlink to BigQuery dataset: storageinsights.datasetConfigs.unlinkDataset
Query BigQuery linked datasets: bigquery.jobs.create or bigquery.jobs.*

You might also be able to get these permissions with custom roles or other predefined roles.

Enable the Storage Insights API

Console

Enable the storageinsights.googleapis.com API

Command line

To enable the Storage Insights API in your current project, run the following command:

gcloud services enable storageinsights.googleapis.com

For more details about enabling services for a Google Cloud project, see Enabling and disabling services.

Configure Storage Intelligence

Ensure that Storage Intelligence is configured on the project, folder, or organization that you want to analyze with datasets.

Create a dataset configuration

To create a dataset configuration and generate a dataset, follow these steps. For more information about the fields that you can specify as you create the dataset configuration, see Dataset configuration properties.

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click Configure dataset.
In the Name your dataset section, enter a name for your dataset. Optionally, enter a description for the dataset.
In the Define dataset scope section, do the following:
- Select one of the following options:
  - To get storage metadata for all projects in the current organization, select Include the organization.
  - To get storage metadata for all projects in the selected folders, select Include folders(Sub-organization/departments). For information about how to get folder IDs, see Viewing or listing folders and projects. To add folders, do the following:
    1. In the Folder 1 field, enter the folder ID.
    2. Optionally, to add multiple folder IDs, click + Add another folder.
  - To get storage metadata for the selected projects, select Include projects by providing project numbers. To learn how to find the project numbers, see Find the project name, number, and ID. To add projects, do the following:
    1. In the Project 1 field, enter the project number.
    2. Optionally, to add multiple project numbers, click + Add another project.
  - To add projects or folders in bulk, select Upload a list of projects /folders via CSV file. The CSV file must contain the project numbers or folder IDs that you want to include in the dataset.
- Specify if you want to automatically include future buckets in the selected resource.
- Optionally, to specify filters on buckets based on regions and bucket prefixes, expand the Filters (optional) section. Filters are applied additively on buckets.
  
  You can include or exclude buckets from specific regions. For example, you can exclude buckets that are in the me-central1 and me-central2 regions. You can also include or exclude buckets by prefix. For example, if you want to exclude buckets that start with my-bucket, enter the my-bucket* prefix.
Click Continue.
In the Select retention period section, select a retention period for the data in the dataset.
In the Select location to store configured dataset section, select a location to store the dataset and dataset configuration.
In the Select service account type section, select a service agent type for your dataset. This service agent is created on your behalf when you create the dataset configuration. You can select one of the following service agents:
- Configuration-scoped service account: This service agent can only access and write the dataset generated by the particular dataset configuration.
- Project-scoped service account: This service agent can access and write datasets that are generated from all the dataset configurations in the project.
Upon creation of the service agent, you must grant the service agent the required permissions. For more information about these service agent, see the Dataset configuration properties.
Click Configure. It can take up to 48 hours for you to view the first load of data in the linked datasets after you configure the dataset.

Command line

To create a dataset configuration, run the gcloud storage insights dataset-configs create command with the required flags:
```
gcloud storage insights dataset-configs create DATASET_CONFG_ID \
  --source-projects=SOURCE_PROJECT_NUMBERS \
  --location=LOCATION \
  --retention-period-days=RETENTION_PERIOD_DAYS \
  --organization=ORGANIZATION_ID
```
Replace:
- DATASET_CONFIG_ID with the name you want to give to your dataset configuration. Names are used as the identifier of dataset configurations and are mutable. The name can contain up to 128 characters using letters, numbers, and underscores.
- SOURCE_PROJECT_NUMBERS with the numbers of the projects you want to include in the dataset. For example, 464036093014. You can specify one or multiple projects. To learn how to find your project number, see Find the project name, number, and ID.
  
  As an alternative to using the --source-projects flag, you can use the --source-projects-file=FILE_PATH flag, which lets you specify several project numbers at a time by uploading a file containing the project numbers. The file must be in CSV format and must be uploaded to Cloud Storage.
- LOCATION with the location the dataset configuration and dataset will be stored in.
- RETENTION_PERIOD_DAYS with the retention period for the data in the dataset.
- ORGANIZATION_ID with the resource ID of the organization the source projects belongs to. Source projects outside of the specified location are excluded from the dataset configuration. To learn how to find your organization ID, see Getting your organization resource ID.
Optionally, you can use additional flags to finely configure the dataset:
- Use --include-buckets=BUCKET_NAMES_OR_REGEX to include specific buckets by name or regular expression. If this flag is used, --exclude-buckets can't be used.
- Use --exclude-buckets=BUCKET_NAMES_OR_REGEX to exclude specific buckets by name or regular expression. If this flag is used, --include-buckets can't be used.
- Use --project=DESTINATION_PROJECT_ID to specify a project to use for storing your dataset configuration and generated dataset. If this flag is unused, the destination project will be your active project. For more information about project IDs, see Creating and managing projects.
- Use --auto-add-new-buckets to automatically include any buckets that get added to source projects in the future.
- Use --skip-verification to skip checks and failures from the verification process, which includes checks for required IAM permissions. If used, some or all buckets might be excluded from the dataset.
- Use --identity=IDENTITY_TYPE to specify the type of service agent that gets created alongside the dataset configuration. Values are IDENTITY_TYPE_PER_CONFIG or IDENTITY_TYPE_PER_PROJECT. If unspecified, defaults to IDENTITY_TYPE_PER_CONFIG.
- Use --description=DESCRIPTION to write a description for the dataset configuration.

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Create a JSON file that contains the following information:

{
  "organizationNumber": "ORGANIZATION_ID",
  "sourceProjects": {
    "project_numbers": ["PROJECT_NUMBERS", ...]
  },
  "retentionPeriodDays": "RETENTION_PERIOD_DAYS",
  "identity": {
    "type": "IDENTITY_TYPE"
  }
}

Replace:

ORGANIZATION_ID with the resource ID of the organization to which the source projects belong. To learn how to find your organization ID, see Getting your organization resource ID.
PROJECT_NUMBERS with the numbers of the projects you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified as a list of strings.

Alternatively, you can add an organization, or one or multiple folders containing buckets and objects you want to update the metadata for. To include folders or organizations, use the sourceFolders or organizationScope field respectively. For more information, see the DatasetConfig reference.
RETENTION_PERIOD_DAYS with the number of days of data to capture in the dataset snapshot. For example, 90.
IDENTITY_TYPE with the type of service account that gets created alongside the dataset configuration. Values are IDENTITY_TYPE_PER_CONFIG or IDENTITY_TYPE_PER_PROJECT.

To create the dataset configuration, use cURL to call the JSON API with a Create DatasetConfig request:
```
curl -X POST --data-binary @JSON_FILE_NAME \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs?datasetConfigId=DATASET_CONFIG_ID" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json"
```
Replace:
- JSON_FILE_NAME with the path to the JSON file you created in the previous step. Alternatively, you can pass an instance of DatasetConfig in the request body.
- PROJECT_ID with the ID of the project that the dataset configuration and dataset will belong to.
- LOCATION with the location the dataset and dataset configuration will reside in. For example, us-central1.
- DATASET_CONFIG_ID with the name you want to give to your dataset configuration. Names are used as the identifier of dataset configurations and are not immutable. The name can contain up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

To troubleshoot snapshot processing errors that are logged in error_attributes_view, see Storage Insights dataset errors.

Grant the required permissions to the service agent

Google Cloud creates a configuration-scoped or project-scoped service agent on your behalf when you create a dataset configuration. The service agent follows the naming format service-PROJECT_NUMBER@gcp-sa-storageinsights.iam.gserviceaccount.com and appears on the IAM page of the Google Cloud console when you select the Include Google-provided role grants checkbox. You can also find the name of the service agent by viewing the DatasetConfig resource using the JSON API.

To enable Storage Insights to generate and write datasets, ask your administrator to grant the service agent the Storage Insights Collector Service role (roles/storage.insightsCollectorService) on the organization that contains the source projects. This role must be granted to every configuration-scoped service agent that gets created for each dataset configuration you want data from. If you're using a project-scoped service agent, this role must only be granted once for the service agent to be able to read and write datasets for all dataset configurations within the project.

For instructions about granting roles on projects, see Manage access.

Link a dataset

To link a dataset to BigQuery, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click the name of the dataset configuration that generated the dataset you want to link.
In the BigQuery linked dataset section, click Link dataset to link your dataset.

Command line

To link a dataset to BigQuery, run the gcloud storage insights dataset-configs create-link command:
```
gcloud storage insights dataset-configs create-link DATASET_CONFIG_ID --location=LOCATION
```
Replace:
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to link.
- LOCATION with the location of your dataset. For example, us-central1.
As an alternative to specifying DATASET_CONFIG_NAME and LOCATION, you can specify a full dataset configuration path. For example:
```
gcloud storage insights dataset-configs create-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID_
```
Replace:
- DESTINATION_PROJECT_ID with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to link.
- LOCATION with the location of your dataset and dataset configuration. For example, us-central1.

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Create a JSON file that contains the following information:

{
  "name": "DATASET_NAME"
}

Replace:

DATASET_NAME with the name of the dataset you want to link. For example, my_project.my_dataset276daa7e_2991_4f4f_b9d4_e354b48426a2.

Use cURL to call the JSON API with a linkDataset DatasetConfig request:
```
curl --request POST --data-binary @JSON_FILE_NAME \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:linkDataset?" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
```
Replace:
- JSON_FILE_NAME with the path to the JSON file you created in the previous step.
- PROJECT_ID with the ID of the project that the dataset configuration belongs to.
- LOCATION with the location in which the dataset and dataset configuration reside. For example, us-central1.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to link.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

View and query linked datasets

To view and query linked datasets, follow these steps:

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights

A list of dataset configurations that are created in your project appears.
Click the BigQuery linked dataset of the dataset configuration that you want to view.

The BigQuery linked dataset appears in the Google Cloud console. For information about the dataset schema of metadata, see Dataset schema of metadata.
You can query tables and views in your linked datasets in the same way you would query any other BigQuery table.

Unlink a dataset

To stop the dataset configuration from publishing to the BigQuery dataset, unlink the dataset. To unlink a dataset, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click the name of the dataset configuration that generated the dataset you want to unlink.
In the BigQuery linked dataset section, click Unlink dataset to unlink your dataset.

Command line

To unlink the dataset, run the gcloud storage insights dataset-configs delete-link command:
```
gcloud storage insights dataset-configs delete-link DATASET_CONFIG_ID --location=LOCATION
```
Replace:
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to unlink.
- LOCATION with the location of your dataset and dataset configuration. For example, us-central1.
As an alternative to specifying DATASET_CONFIG_NAME and LOCATION, you can specify a full dataset configuration path. For example:
```
gcloud storage insights dataset-configs delete-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
```
Replace:
- DESTINATION_PROJECT_ID with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to unlink.
- LOCATION with the location of your dataset and dataset configuration. For example, us-central1.

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Create a JSON file that contains the following information:

{
  "name": "DATASET_NAME"
}

Replace:

DATASET_NAME with the name of the dataset you want to unlink. For example, my_project.my_dataset276daa7e_2991_4f4f_b9d4_e354b48426a2.

Use cURL to call the JSON API with an unlinkDataset DatasetConfig request:
```
curl --request POST --data-binary @JSON_FILE_NAME \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:unlinkDataset?" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
```
Replace:
- JSON_FILE_NAME with the path to the JSON file you created in the previous step.
- PROJECT_ID with the ID of the project that the dataset configuration belongs to.
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to unlink.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

View a dataset configuration

To view a dataset configuration, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click the name of the dataset configuration you want to view.

The dataset configuration details are displayed.

Command line

To describe a dataset configuration, run the gcloud storage insights dataset-configs describe command:
```
gcloud storage insights dataset-configs describe DATASET_CONFG_ID \
  --location=LOCATION
```
Replace:
- DATASET_CONFIG_ID with the name of the dataset configuration.
- LOCATION with the location of the dataset and dataset configuration.
As an alternative to specifying DATASET_CONFIG_NAME and LOCATION, you can specify a full dataset configuration path. For example:
```
gcloud storage insights dataset-configs describe projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
```
Replace:
- DESTINATION_PROJECT_ID with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.
- DATASET_CONFIG_ID with the name of the dataset configuration that generated the dataset you want to view.
- LOCATION with the location of your dataset and dataset configuration. For example, us-central1.

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Use cURL to call the JSON API with an Get DatasetConfig request:
```
curl -X GET \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
```
Replace:
- PROJECT_ID with the ID of the project that the dataset configuration belongs to.
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
- DATASET_CONFIG_ID with the name of the dataset configuration.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

List dataset configurations

To list the dataset configurations in a project, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights

The list of dataset configurations is displayed.

Command line

To list dataset configurations in a project, run the gcloud storage insights dataset-configs list command:
```
gcloud storage insights dataset-configs list --location=LOCATION
```
Replace:
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
You can use the following optional flags to specify the behavior of the listing call:
- Use --page-size to specify the maximum number of results to return per page.
- Use --filter=FILTER to filter results. For more information on how to use the --filter flag, run gcloud topic filters and refer to the documentation.
- Use --sort-by=SORT_BY_VALUE to specify a comma-separated list of resource field key names to sort by. For example, --sort-by=DATASET_CONFIG_NAME.

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Use cURL to call the JSON API with an Get DatasetConfig request:
```
curl -X GET \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
```
Replace:
- PROJECT_ID with the ID of the project that the dataset configuration belongs to.
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

Update a dataset configuration

To update a dataset configuration, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click the name of the dataset configuration you want to update.
In the Dataset configuration tab that appears, click Edit to update the fields.

Command line

To update a dataset configuration, run the gcloud storage insights dataset-configs update command:
```
gcloud storage insights dataset-configs update DATASET_CONFG_ID \
  --location=LOCATION
```
Replace:
- DATASET_CONFIG_ID with the name of the dataset configuration.
- LOCATION with the location of the dataset and dataset configuration.
Use the following flags to update properties of the dataset configuration:
- Use --skip-verification to skip checks and failures from the verification process, which includes checks for required IAM permissions. If used, some or all buckets might be excluded from the dataset.
- Use --retention-period-days=DAYS to specify the moving number of days of data to capture in the dataset snapshot. For example, 90.
- Use --description=DESCRIPTION to write a description for the dataset configuration.

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Create a JSON file that contains the following optional information:

{
  "organization_number": "ORGANIZATION_ID",
  "source_projects": {
    "project_numbers": "PROJECT_NUMBERS"
  },
  "retention_period_days": RETENTION_PERIOD"
}

Replace:

ORGANIZATION_ID with the resource ID of the organization to which the source projects belong.
PROJECT_NUMBERS with the project numbers you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified in a list format.
RETENTION_PERIOD with the moving number of days of data to capture in the dataset snapshot. For example, 90.

To update the dataset configuration, use cURL to call the JSON API with an Patch DatasetConfig request:
```
curl -X PATCH --data-binary @JSON_FILE_NAME \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID?updateMask=RETENTION_PERIOD" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
```
Replace:
- JSON_FILE_NAME with the path to the JSON file you created in the previous step.
- PROJECT_ID with the ID of the project that the dataset configuration belongs to.
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
- DATASET_CONFIG_ID with the name of the dataset configuration you want to update.
- RETENTION_PERIOD with the moving number of days of data to capture in the dataset snapshot. For example, 90.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

Delete a dataset configuration

To delete a dataset configuration, complete the following steps:

Console

In the Google Cloud console, go to the Cloud Storage Storage Insights page.

Go to Storage Insights
Click the name of the dataset configuration you want to delete.
Click Delete .

Command line

To delete a dataset configuration, run the gcloud storage insights dataset-configs delete command:
```
gcloud storage insights dataset-configs delete DATASET_CONFG_ID \
  --location=LOCATION
```
Replace:
- DATASET_CONFIG_ID with the name of the dataset configuration you want to delete.
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
Use the following flags to update properties of the dataset configuration:
- Use --auto-delete-link to unlink the dataset that was generated from the dataset configuration you want to delete. You must unlink a dataset before you can delete the dataset configuration that generated the dataset.
- Use --retention-period-days=DAYS to specify the number of days of data to capture in the dataset snapshot. For example, 90.
As an alternative to specifying DATASET_CONFIG_NAME and LOCATION, you can specify a full dataset configuration path. For example:
```
gcloud storage insights dataset-configs describe projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
```

REST APIs

JSON API

Have gcloud CLI installed and initialized , which lets you generate an access token for the Authorization header.

Use cURL to call the JSON API with an Delete DatasetConfig request:
```
curl -X DELETE \
"https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID" \
  --header "Authorization: Bearer ACCESS_TOKEN" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
```
Replace:
- PROJECT_ID with the ID of the project that the dataset configuration belongs to.
- LOCATION with the location of the dataset and dataset configuration. For example, us-central1.
- DATASET_CONFIG_ID with the name of the dataset configuration you want to delete.
- ACCESS_TOKEN with the access token you generated when you installed and initialized the Google Cloud CLI.

Use Storage Insights datasets Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Get the required roles

Required permissions

Enable the Storage Insights API

Console

Command line

Configure Storage Intelligence

Create a dataset configuration

Console

Command line

REST APIs

JSON API

Grant the required permissions to the service agent

Link a dataset

Console

Command line

REST APIs

JSON API

View and query linked datasets

Unlink a dataset

Console

Command line

REST APIs

JSON API

View a dataset configuration

Console

Command line

REST APIs

JSON API

List dataset configurations

Console

Command line

REST APIs

JSON API

Update a dataset configuration

Console

Command line

REST APIs

JSON API

Delete a dataset configuration

Console

Command line

REST APIs

JSON API

What's next

Use Storage Insights datasets