Spanner to BigQuery template

The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.

Pipeline requirements

  • The source Spanner table must exist prior to running the pipeline.
  • The BigQuery dataset must exist prior to running the pipeline.
  • A JSON file that describes your BigQuery schema.

    The file must contain a top-level JSON array titled fields. The contents of the fields array must use the following pattern:
    {"name": "COLUMN_NAME", "type": "DATA_TYPE"}.

    The following JSON describes an example BigQuery schema:

    {
      "fields": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    

    The Spanner to BigQuery batch template doesn't support importing data into STRUCT (Record) fields in the target BigQuery table.

Template parameters

Required parameters

  • spannerInstanceId : The Spanner instance to read from.
  • spannerDatabaseId : The Spanner database to read from.
  • spannerTableId : The Spanner table to read from.
  • sqlQuery : Query used to read Spanner table.
  • outputTableSpec : BigQuery table location to write the output to. The name should be in the format <project>:<dataset>.<table_name>. The table's schema must match input objects.

Optional parameters

  • spannerProjectId : The project where the Spanner instance to read from is located. The default for this parameter is the project where the Dataflow pipeline is running.
  • spannerRpcPriority : The priority of Spanner job. Must be one of the following: [HIGH, MEDIUM, LOW]. Default is HIGH.
  • bigQuerySchemaPath : The Cloud Storage path for the BigQuery JSON schema. If createDisposition is not set, or set to CREATE_IF_NEEDED, this parameter must be specified. (Example: gs://your-bucket/your-schema.json).
  • writeDisposition : BigQuery WriteDisposition. For example, WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE. Defaults to: WRITE_APPEND.
  • createDisposition : BigQuery CreateDisposition. For example, CREATE_IF_NEEDED, CREATE_NEVER. Defaults to: CREATE_IF_NEEDED.
  • useStorageWriteApi : If enabled (set to true) the pipeline will use Storage Write API when writing the data to BigQuery (see https://cloud.google.com/blog/products/data-analytics/streaming-data-into-bigquery-using-storage-write-api). Defaults to: false.
  • useStorageWriteApiAtLeastOnce : This parameter takes effect only if "Use BigQuery Storage Write API" is enabled. If enabled the at-least-once semantics will be used for Storage Write API, otherwise exactly-once semantics will be used. Defaults to: false.

Run the template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow flex-template run JOB_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --parameters \
       spannerInstanceId=SPANNER_INSTANCE_ID,\
       spannerDatabaseId=SPANNER_DATABASE_ID,\
       spannerTableId=SPANNER_TABLE_ID,\
       sqlQuery=SQL_QUERY,\
       outputTableSpec=OUTPUT_TABLE_SPEC,\

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
  • SPANNER_INSTANCE_ID: the Spanner instance ID
  • SPANNER_DATABASE_ID: the Spanner database ID
  • SPANNER_TABLE_ID: the Spanner table name
  • SQL_QUERY: the SQL query
  • OUTPUT_TABLE_SPEC: the BigQuery table location

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launchParameter": {
     "jobName": "JOB_NAME",
     "parameters": {
       "spannerInstanceId": "SPANNER_INSTANCE_ID",
       "spannerDatabaseId": "SPANNER_DATABASE_ID",
       "spannerTableId": "SPANNER_TABLE_ID",
       "sqlQuery": "SQL_QUERY",
       "outputTableSpec": "OUTPUT_TABLE_SPEC",
     },
     "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex",
     "environment": { "maxWorkers": "10" }
  }
}

Replace the following:

  • PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • LOCATION: the region where you want to deploy your Dataflow job—for example, us-central1
  • SPANNER_INSTANCE_ID: the Spanner instance ID
  • SPANNER_DATABASE_ID: the Spanner database ID
  • SPANNER_TABLE_ID: the Spanner table name
  • SQL_QUERY: the SQL query
  • OUTPUT_TABLE_SPEC: the BigQuery table location

What's next