This page shows you how to create and manage Storage Insights datasets and dataset configurations. Learn more about Storage Insights datasets.
Before you begin
Before you begin creating and managing datasets and dataset configurations, follow the instructions in the subsequent subsections.
Get the required roles
To get the permissions that you need to create and manage datasets, ask your administrator to grant you the following IAM roles on your source projects:
-
To create, manage, and view dataset configurations:
Storage Insights Admin (
roles/storageinsights.admin
) -
To view, link, and unlink datasets:
-
Storage Insights Analyst (
roles/storageinsights.analyst
) -
BigQuery Admin (
roles/bigquery.admin
)
-
Storage Insights Analyst (
-
To delete linked datasets:
BigQuery Admin (
roles/bigquery.admin
) -
To view and query datasets in BigQuery:
-
Storage Insights Viewer (
roles/storageinsights.viewer
) -
BigQuery Job User (
roles/bigquery.jobUser
) -
BigQuery Data Viewer (
roles/bigquery.dataViewer
)
-
Storage Insights Viewer (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create and manage datasets. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create and manage datasets:
-
Create dataset configuration:
storageinsights.datasetConfigs.create
-
View dataset configuration:
-
storageinsights.datasetConfigs.get
-
storageinsights.datasetConfigs.list
-
-
Manage dataset configuration:
-
storageinsights.datasetConfigs.update
-
storageinsights.datasetConfigs.delete
-
-
Link to BigQuery dataset:
storageinsights.datasetConfigs.linkDataset
-
Unlink to BigQuery dataset:
storageinsights.datasetConfigs.unlinkDataset
-
Query BigQuery linked datasets:
bigquery.jobs.create or bigquery.jobs.*
You might also be able to get these permissions with custom roles or other predefined roles.
Enable the Storage Insights API
Console
Command line
To enable the Storage Insights API in your current project, run the following command:
gcloud services enable storageinsights.googleapis.com
For more details about enabling services for a Google Cloud project, see Enabling and disabling services.
Configure Storage Intelligence
Ensure that Storage Intelligence is configured on the project, folder, or organization that you want to analyze with datasets.
Create a dataset configuration
To create a dataset configuration and generate a dataset, follow these steps. For more information about the fields that you can specify as you create the dataset configuration, see Dataset configuration properties.
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click Configure dataset.
In the Name your dataset section, enter a name for your dataset. Optionally, enter a description for the dataset.
In the Define dataset scope section, do the following:
Select one of the following options:
To get storage metadata for all projects in the current organization, select Include the organization.
To get storage metadata for all projects in the selected folders, select Include folders(Sub-organization/departments). For information about how to get folder IDs, see Viewing or listing folders and projects. To add folders, do the following:
- In the Folder 1 field, enter the folder ID.
- Optionally, to add multiple folder IDs, click + Add another folder.
To get storage metadata for the selected projects, select Include projects by providing project numbers. To learn how to find the project numbers, see Find the project name, number, and ID. To add projects, do the following:
- In the Project 1 field, enter the project number.
- Optionally, to add multiple project numbers, click + Add another project.
To add projects or folders in bulk, select Upload a list of projects /folders via CSV file. The CSV file must contain the project numbers or folder IDs that you want to include in the dataset.
Specify if you want to automatically include future buckets in the selected resource.
Optionally, to specify filters on buckets based on regions and bucket prefixes, expand the Filters (optional) section. Filters are applied additively on buckets.
You can include or exclude buckets from specific regions. For example, you can exclude buckets that are in the
me-central1
andme-central2
regions. You can also include or exclude buckets by prefix. For example, if you want to exclude buckets that start withmy-bucket
, enter themy-bucket*
prefix.
Click Continue.
In the Select retention period section, select a retention period for the data in the dataset.
In the Select location to store configured dataset section, select a location to store the dataset and dataset configuration.
In the Select service account type section, select a service agent type for your dataset. This service agent is created on your behalf when you create the dataset configuration. You can select one of the following service agents:
- Configuration-scoped service account: This service agent can only access and write the dataset generated by the particular dataset configuration.
- Project-scoped service account: This service agent can access and write datasets that are generated from all the dataset configurations in the project.
Upon creation of the service agent, you must grant the service agent the required permissions. For more information about these service agent, see the Dataset configuration properties.
Click Configure.
Command line
To create a dataset configuration, run the
gcloud storage insights dataset-configs create
command with the required flags:gcloud storage insights dataset-configs create DATASET_CONFG_ID \ --source-projects=SOURCE_PROJECT_NUMBERS \ --location=LOCATION \ --retention-period-days=RETENTION_PERIOD_DAYS \ --organization=ORGANIZATION_ID
Replace:
DATASET_CONFIG_ID
with the name you want to give to your dataset configuration. Names are used as the identifier of dataset configurations and are mutable. The name can contain up to 128 characters using letters, numbers, and underscores.SOURCE_PROJECT_NUMBERS
with the numbers of the projects you want to include in the dataset. For example,464036093014
. You can specify one or multiple projects. To learn how to find your project number, see Find the project name, number, and ID.As an alternative to using the
--source-projects
flag, you can use the--source-projects-file=FILE_PATH
flag, which lets you specify several project numbers at a time by uploading a file containing the project numbers. The file must be in CSV format and must be uploaded to Cloud Storage.LOCATION
with the location the dataset configuration and dataset will be stored in.RETENTION_PERIOD_DAYS
with the retention period for the data in the dataset.ORGANIZATION_ID
with the resource ID of the organization the source projects belongs to. Source projects outside of the specified location are excluded from the dataset configuration. To learn how to find your organization ID, see Getting your organization resource ID.
Optionally, you can use additional flags to finely configure the dataset:
Use
--include-buckets=BUCKET_NAMES_OR_REGEX
to include specific buckets by name or regular expression. If this flag is used,--exclude-buckets
can't be used.Use
--exclude-buckets=BUCKET_NAMES_OR_REGEX
to exclude specific buckets by name or regular expression. If this flag is used,--include-buckets
can't be used.Use
--project=DESTINATION_PROJECT_ID
to specify a project to use for storing your dataset configuration and generated dataset. If this flag is unused, the destination project will be your active project. For more information about project IDs, see Creating and managing projects.Use
--auto-add-new-buckets
to automatically include any buckets that get added to source projects in the future.Use
--skip-verification
to skip checks and failures from the verification process, which includes checks for required IAM permissions. If used, some or all buckets might be excluded from the dataset.Use
--identity=IDENTITY_TYPE
to specify the type of service agent that gets created alongside the dataset configuration. Values areIDENTITY_TYPE_PER_CONFIG
orIDENTITY_TYPE_PER_PROJECT
. If unspecified, defaults toIDENTITY_TYPE_PER_CONFIG
.Use
--description=DESCRIPTION
to write a description for the dataset configuration.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Create a JSON file that contains the following information:
{ "organization_number": "ORGANIZATION_ID", "source_projects": { "project_numbers": ["PROJECT_NUMBERS", ...] }, "retention_period_days": "RETENTION_PERIOD_DAYS", "identity": { "type": "IDENTITY_TYPE" } }
Replace:
ORGANIZATION_ID
with the resource ID of the organization to which the source projects belong. To learn how to find your organization ID, see Getting your organization resource ID.PROJECT_NUMBERS
with the numbers of the projects you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified as a list of strings.RETENTION_PERIOD_DAYS
with the number of days of data to capture in the dataset snapshot. For example,90
.IDENTITY_TYPE
with the type of service account that gets created alongside the dataset configuration. Values areIDENTITY_TYPE_PER_CONFIG
orIDENTITY_TYPE_PER_PROJECT
.
To create the dataset configuration, use
cURL
to call the JSON API with aCreate
DatasetConfig request:curl -X POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs?datasetConfigId=DATASET_CONFIG_ID" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json"
Replace:
JSON_FILE_NAME
with the path to the JSON file you created in the previous step. Alternatively, you can pass an instance ofDatasetConfig
in the request body.PROJECT_ID
with the ID of the project that the dataset configuration and dataset will belong to.LOCATION
with the location the dataset and dataset configuration will reside in. For example,us-central1
.DATASET_CONFIG_ID
with the name you want to give to your dataset configuration. Names are used as the identifier of dataset configurations and are not immutable. The name can contain up to 128 characters using letters, numbers, and underscores. The name must begin with a letter.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.
To troubleshoot snapshot processing errors that are logged in
error_attributes_view
, see Storage Insights dataset errors.
Grant the required permissions to the service agent
Google Cloud creates a configuration-scoped or project-scoped service
agent on your behalf when you create a dataset configuration. The service
agent follows the naming format
service-PROJECT_NUMBER@gcp-sa-storageinsights.iam.gserviceaccount.com
and appears on the
IAM page of the Google Cloud console
when you select the Include Google-provided role grants checkbox.
You can also find the name of the service agent by
viewing the DatasetConfig
resource using the JSON API.
To enable Storage Insights to generate and write datasets, ask your
administrator to grant the service agent the Storage Insights Collector
Service role (roles/storage.insightsCollectorService
) on the organization
that contains the source projects.
This role must be granted to every configuration-scoped service agent
that gets created for each dataset configuration you want data from. If you're
using a project-scoped service agent, this role must only be
granted once for the service agent to be able to read and write datasets
for all dataset configurations within the project.
For instructions about granting roles on projects, see Manage access.
Link a dataset
To link a dataset to BigQuery, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration that generated the dataset you want to link.
In the BigQuery linked dataset section, click Link dataset to link your dataset.
Command line
To link a dataset to BigQuery, run the
gcloud storage insights dataset-configs create-link
command:gcloud storage insights dataset-configs create-link DATASET_CONFIG_ID --location=LOCATION
Replace:
DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to link.LOCATION
with the location of your dataset. For example,us-central1
.
As an alternative to specifying
DATASET_CONFIG_NAME
andLOCATION
, you can specify a full dataset configuration path. For example:gcloud storage insights dataset-configs create-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID_
Replace:
DESTINATION_PROJECT_ID
with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to link.LOCATION
with the location of your dataset and dataset configuration. For example,us-central1
.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Create a JSON file that contains the following information:
{ "name": "DATASET_NAME" }
Replace:
DATASET_NAME
with the name of the dataset you want to link. For example,my_project.my_dataset276daa7e_2991_4f4f_b9d4_e354b48426a2
.
Use
cURL
to call the JSON API with alinkDataset
DatasetConfig request:curl --request POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:linkDataset?" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
JSON_FILE_NAME
with the path to the JSON file you created in the previous step.PROJECT_ID
with the ID of the project that the dataset configuration belongs to.LOCATION
with the location in which the dataset and dataset configuration reside. For example,us-central1
.DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to link.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.
View and query linked datasets
To view and query linked datasets, follow these steps:
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
A list of dataset configurations that are created in your project appears.
Click the BigQuery linked dataset of the dataset configuration that you want to view.
The BigQuery linked dataset appears in the Google Cloud console. For information about the dataset schema of metadata, see Dataset schema of metadata.
You can query tables and views in your linked datasets in the same way you would query any other BigQuery table.
Unlink a dataset
To stop the dataset configuration from publishing to the BigQuery dataset, unlink the dataset. To unlink a dataset, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration that generated the dataset you want to unlink.
In the BigQuery linked dataset section, click Unlink dataset to unlink your dataset.
Command line
To unlink the dataset, run the
gcloud storage insights dataset-configs delete-link
command:gcloud storage insights dataset-configs delete-link DATASET_CONFIG_ID --location=LOCATION
Replace:
DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to unlink.LOCATION
with the location of your dataset and dataset configuration. For example,us-central1
.
As an alternative to specifying
DATASET_CONFIG_NAME
andLOCATION
, you can specify a full dataset configuration path. For example:gcloud storage insights dataset-configs delete-link projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
Replace:
DESTINATION_PROJECT_ID
with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to unlink.LOCATION
with the location of your dataset and dataset configuration. For example,us-central1
.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Create a JSON file that contains the following information:
{ "name": "DATASET_NAME" }
Replace:
DATASET_NAME
with the name of the dataset you want to unlink. For example,my_project.my_dataset276daa7e_2991_4f4f_b9d4_e354b48426a2
.
Use
cURL
to call the JSON API with anunlinkDataset
DatasetConfig request:curl --request POST --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigsDATASET_CONFIG_ID:unlinkDataset?" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
JSON_FILE_NAME
with the path to the JSON file you created in the previous step.PROJECT_ID
with the ID of the project that the dataset configuration belongs to.LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to unlink.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.
View a dataset configuration
To view a dataset configuration, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration you want to view.
The dataset configuration details are displayed.
Command line
To describe a dataset configuration, run the
gcloud storage insights dataset-configs describe
command:gcloud storage insights dataset-configs describe DATASET_CONFG_ID \ --location=LOCATION
Replace:
DATASET_CONFIG_ID
with the name of the dataset configuration.LOCATION
with the location of the dataset and dataset configuration.
As an alternative to specifying
DATASET_CONFIG_NAME
andLOCATION
, you can specify a full dataset configuration path. For example:gcloud storage insights dataset-configs describe projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
Replace:
DESTINATION_PROJECT_ID
with the ID of the project that contains the dataset configuration. For more information about project IDs, see Creating and managing projects.DATASET_CONFIG_ID
with the name of the dataset configuration that generated the dataset you want to view.LOCATION
with the location of your dataset and dataset configuration. For example,us-central1
.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Use
cURL
to call the JSON API with anGet
DatasetConfig request:curl -X GET \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
PROJECT_ID
with the ID of the project that the dataset configuration belongs to.LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.DATASET_CONFIG_ID
with the name of the dataset configuration.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.
List dataset configurations
To list the dataset configurations in a project, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
The list of dataset configurations is displayed.
Command line
To list dataset configurations in a project, run the
gcloud storage insights dataset-configs list
command:gcloud storage insights dataset-configs list --location=LOCATION
Replace:
LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.
You can use the following optional flags to specify the behavior of the listing call:
Use
--page-size
to specify the maximum number of results to return per page.Use
--filter=FILTER
to filter results. For more information on how to use the--filter
flag, rungcloud topic filters
and refer to the documentation.Use
--sort-by=SORT_BY_VALUE
to specify a comma-separated list of resource field key names to sort by. For example,--sort-by=DATASET_CONFIG_NAME
.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Use
cURL
to call the JSON API with anGet
DatasetConfig request:curl -X GET \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
PROJECT_ID
with the ID of the project that the dataset configuration belongs to.LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.
Update a dataset configuration
To update a dataset configuration, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration you want to update.
In the Dataset configuration tab that appears, click Edit
to update the fields.
Command line
To update a dataset configuration, run the
gcloud storage insights dataset-configs update
command:gcloud storage insights dataset-configs update DATASET_CONFG_ID \ --location=LOCATION
Replace:
DATASET_CONFIG_ID
with the name of the dataset configuration.LOCATION
with the location of the dataset and dataset configuration.
Use the following flags to update properties of the dataset configuration:
Use
--skip-verification
to skip checks and failures from the verification process, which includes checks for required IAM permissions. If used, some or all buckets might be excluded from the dataset.Use
--retention-period-days=DAYS
to specify the moving number of days of data to capture in the dataset snapshot. For example,90
.Use
--description=DESCRIPTION
to write a description for the dataset configuration.
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Create a JSON file that contains the following optional information:
{ "organization_number": "ORGANIZATION_ID", "source_projects": { "project_numbers": "PROJECT_NUMBERS" }, "retention_period_days": RETENTION_PERIOD" }
Replace:
ORGANIZATION_ID
with the resource ID of the organization to which the source projects belong.PROJECT_NUMBERS
with the project numbers you want to include in the dataset. You can specify one project or multiple projects. Projects must be specified in a list format.RETENTION_PERIOD
with the moving number of days of data to capture in the dataset snapshot. For example,90
.
To update the dataset configuration, use
cURL
to call the JSON API with anPatch
DatasetConfig request:curl -X PATCH --data-binary @JSON_FILE_NAME \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID?updateMask=RETENTION_PERIOD" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
JSON_FILE_NAME
with the path to the JSON file you created in the previous step.PROJECT_ID
with the ID of the project that the dataset configuration belongs to.LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.DATASET_CONFIG_ID
with the name of the dataset configuration you want to update.RETENTION_PERIOD
with the moving number of days of data to capture in the dataset snapshot. For example,90
.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.
Delete a dataset configuration
To delete a dataset configuration, complete the following steps:
Console
- In the Google Cloud console, go to the Cloud Storage Storage Insights page.
Click the name of the dataset configuration you want to delete.
Click Delete
.
Command line
To delete a dataset configuration, run the
gcloud storage insights dataset-configs delete
command:gcloud storage insights dataset-configs delete DATASET_CONFG_ID \ --location=LOCATION
Replace:
DATASET_CONFIG_ID
with the name of the dataset configuration you want to delete.LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.
Use the following flags to update properties of the dataset configuration:
Use
--auto-delete-link
to unlink the dataset that was generated from the dataset configuration you want to delete. You must unlink a dataset before you can delete the dataset configuration that generated the dataset.Use
--retention-period-days=DAYS
to specify the number of days of data to capture in the dataset snapshot. For example,90
.
As an alternative to specifying
DATASET_CONFIG_NAME
andLOCATION
, you can specify a full dataset configuration path. For example:gcloud storage insights dataset-configs describe projects/DESTINATION_PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID
REST APIs
JSON API
Have gcloud CLI installed and initialized, which lets you generate an access token for the
Authorization
header.Use
cURL
to call the JSON API with anDelete
DatasetConfig request:curl -X DELETE \ "https://storageinsights.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasetConfigs/DATASET_CONFIG_ID" \ --header "Authorization: Bearer ACCESS_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \
Replace:
PROJECT_ID
with the ID of the project that the dataset configuration belongs to.LOCATION
with the location of the dataset and dataset configuration. For example,us-central1
.DATASET_CONFIG_ID
with the name of the dataset configuration you want to delete.ACCESS_TOKEN
with the access token you generated when you installed and initialized the Google Cloud CLI.