Create Spanner external datasets

This document describes how to create an external dataset (also known as a federated dataset) in BigQuery that's linked to an existing database in Spanner.

An external dataset is a connection between BigQuery and an external data source at the dataset level. It lets you query transactional data in Spanner databases with GoogleSQL without moving data from Spanner to BigQuery storage.

The tables in an external dataset are automatically populated from the tables in the corresponding external data source. You can query these tables directly in BigQuery, but you cannot make modifications, additions, or deletions. However, any updates that you make in the external data source are automatically reflected in BigQuery.

Required permissions

To get the permission that you need to create an external dataset, ask your administrator to grant you the BigQuery User (roles/bigquery.user) IAM role. For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the bigquery.datasets.create permission, which is required to create an external dataset.

You might also be able to get this permission with custom roles or other predefined roles.

For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.

Create an external dataset

To create a external dataset, do the following:

Console

  1. Open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the Explorer panel, select the project where you want to create the dataset.

  3. Expand the Actions option and click Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter a unique dataset name.
    • For Location type, choose a location for the dataset, such as us-central1 or multiregion us. After you create a dataset, the location can't be changed.
    • For External Dataset, do the following:

      • Check the box next to Link to an external dataset.
      • For External dataset type, select Spanner.
      • For External source, enter the full identifier of your Spanner database in the following format: projects/PROJECT_ID/instances/INSTANCE/databases/DATABASE. For example: projects/my_project/instances/my_instance/databases/my_database.
      • Optionally, for Database role enter the name of a Spanner database role. For more information read about Database roles used for creating Spanner Connections
    • Leave the other default settings as they are.

  5. Click Create dataset.

SQL

Use the CREATE EXTERNAL SCHEMA data definition language (DDL) statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL SCHEMA DATASET_NAME
      OPTIONS (
        external_source = 'SPANNER_EXTERNAL_SOURCE',
        location = 'LOCATION');

    Replace the following:

    • DATASET_NAME: the name of your new dataset in BigQuery.
    • SPANNER_EXTERNAL_SOURCE: the full, qualified Spanner database name, with a prefix identifying the source, in the following format: google-cloudspanner://[DATABASE_ROLE@]/projects/PROJECT_ID/instances/INSTANCE/databases/DATABASE. For example: google-cloudspanner://admin@/projects/my_project/instances/my_instance/databases/my_database or google-cloudspanner:/projects/my_project/instances/my_instance/databases/my_database.
    • LOCATION: the location of your new dataset in BigQuery, for example, us-central1. After you create a dataset, you can't change its location.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

bq

In a command-line environment, create an external dataset by using the bq mk command:

bq --location=LOCATION mk --dataset \
    --external_source SPANNER_EXTERNAL_SOURCE \
    DATASET_NAME

Replace the following:

  • LOCATION: the location of your new dataset in BigQuery—for example, us-central1. After you create a dataset, you can't change its location. You can set a default location value by using the .bigqueryrc file.
  • SPANNER_EXTERNAL_SOURCE: the full, qualified Spanner database name, with a prefix identifying the source, in the following format: google-cloudspanner://[DATABASE_ROLE@]/projects/PROJECT_ID/instances/INSTANCE/databases/DATABASE. For example: google-cloudspanner://admin@/projects/my_project/instances/my_instance/databases/my_database or google-cloudspanner:/projects/my_project/instances/my_instance/databases/my_database.
  • DATASET_NAME: the name of your new dataset in BigQuery. To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

Terraform

Use the google_bigquery_dataset resource.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

The following example creates a Spanner external dataset:

resource "google_bigquery_dataset" "dataset" {
  provider                    = google-beta
  dataset_id                  = "example_dataset"
  friendly_name               = "test"
  description                 = "This is a test description."
  location                    = "us-central1"

external_dataset_reference {
  external_source = "google-cloudspanner:/projects/my_project/instances/my_instance/databases/my_database"
  }
}

To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.

Prepare Cloud Shell

  1. Launch Cloud Shell.
  2. Set the default Google Cloud project where you want to apply your Terraform configurations.

    You only need to run this command once per project, and you can run it in any directory.

    export GOOGLE_CLOUD_PROJECT=PROJECT_ID

    Environment variables are overridden if you set explicit values in the Terraform configuration file.

Prepare the directory

Each Terraform configuration file must have its own directory (also called a root module).

  1. In Cloud Shell, create a directory and a new file within that directory. The filename must have the .tf extension—for example main.tf. In this tutorial, the file is referred to as main.tf.
    mkdir DIRECTORY && cd DIRECTORY && touch main.tf
  2. If you are following a tutorial, you can copy the sample code in each section or step.

    Copy the sample code into the newly created main.tf.

    Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution.

  3. Review and modify the sample parameters to apply to your environment.
  4. Save your changes.
  5. Initialize Terraform. You only need to do this once per directory.
    terraform init

    Optionally, to use the latest Google provider version, include the -upgrade option:

    terraform init -upgrade

Apply the changes

  1. Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
    terraform plan

    Make corrections to the configuration as necessary.

  2. Apply the Terraform configuration by running the following command and entering yes at the prompt:
    terraform apply

    Wait until Terraform displays the "Apply complete!" message.

  3. Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.

API

Call the datasets.insert method with a defined dataset resource and externalDatasetReference field for your Spanner database.

Note that names of the tables in the external datasets are case insensitive.

List tables in an external dataset

To list the tables that are available for query in your external dataset, see Listing datasets.

Get table information

To get information on the tables in your external dataset, such as schema details, see Get table information.

Control access to tables

Spanner external datasets support end-user credentials (EUC). That means that access to the Spanner tables from external datasets is controlled by Spanner. Users can query these tables only if they have access granted in Spanner.

Query Spanner data

Querying tables in external datasets is the same as querying tables in any other BigQuery dataset. However, data modification operations (DML) aren't supported.

Queries against tables in Spanner external datasets use Data Boost by default and it cannot be changed. Because of that you need additional permissions to run such queries.

Create a view in an external dataset

You can't create a view in a external dataset. However, you can create a view in a standard dataset that's based on a table in an external dataset. For more information, see Create views.

Delete a external dataset

Deleting a external dataset is the same as deleting any other BigQuery dataset. Deleting external datasets does not impact tables in the Spanner database. For more information, see Delete datasets.

Limitations

  • BigQuery federated queries limitations apply.
  • If a table in Spanner database contains a column of a type that isn't supported by BigQuery then this column won't be accessible on BigQuery side.
  • You can't add, delete, or update data or metadata in tables in a Spanner external dataset.
  • You can't create new tables, views, or materialized views in a Spanner external dataset.
  • INFORMATION_SCHEMA views aren't supported.
  • Metadata caching isn't supported.
  • Dataset-level settings that are related to table creation defaults don't affect external datasets because you can't create tables manually.
  • Spanner databases that use PostgreSQL dialect aren't supported.
  • Write API and Read API aren't supported.
  • Row-level security, column-level security, and data masking aren't supported.
  • Materialized views based on tables from Spanner external datasets aren't supported.
  • Integration with Dataplex isn't supported. For example, data profiles and data quality scans aren't supported.
  • Tags on a table level aren't supported.
  • Metadata caching isn't supported.
  • SQL auto completion does not work with Spanner external tables when you write queries.
  • Scan with Sensitive Data Protection isn't supported for external datasets.
  • Sharing with Analytics Hub isn't supported for external datasets.
  • You can create an authorized view which references Spanner external dataset. However, when this view is queried, then EUC of a person who executes a query will be sent to Spanner.

What's next