The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.
Pipeline requirements
- The source Spanner table must exist prior to running the pipeline.
- The BigQuery dataset must exist prior to running the pipeline.
- A JSON file that describes your BigQuery schema.
The file must contain a top-level JSON array titled
fields
. The contents of thefields
array must use the following pattern:{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
.The following JSON describes an example BigQuery schema:
{ "fields": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
The Spanner to BigQuery batch template doesn't support importing data into
STRUCT
(Record) fields in the target BigQuery table.
Template parameters
Required parameters
- spannerInstanceId : The Spanner instance to read from.
- spannerDatabaseId : The Spanner database to read from.
- spannerTableId : The Spanner table to read from.
- sqlQuery : Query used to read Spanner table.
- outputTableSpec : BigQuery table location to write the output to. The name should be in the format
<project>:<dataset>.<table_name>
. The table's schema must match input objects.
Optional parameters
- spannerProjectId : The project where the Spanner instance to read from is located. The default for this parameter is the project where the Dataflow pipeline is running.
- spannerRpcPriority : The priority of Spanner job. Must be one of the following: [HIGH, MEDIUM, LOW]. Default is HIGH.
- bigQuerySchemaPath : The Cloud Storage path for the BigQuery JSON schema. If
createDisposition
is not set, or set to CREATE_IF_NEEDED, this parameter must be specified. (Example: gs://your-bucket/your-schema.json). - writeDisposition : BigQuery WriteDisposition. For example, WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE. Defaults to: WRITE_APPEND.
- createDisposition : BigQuery CreateDisposition. For example, CREATE_IF_NEEDED, CREATE_NEVER. Defaults to: CREATE_IF_NEEDED.
- useStorageWriteApi : If enabled (set to true) the pipeline will use Storage Write API when writing the data to BigQuery (see https://cloud.google.com/blog/products/data-analytics/streaming-data-into-bigquery-using-storage-write-api). Defaults to: false.
- useStorageWriteApiAtLeastOnce : This parameter takes effect only if "Use BigQuery Storage Write API" is enabled. If enabled the at-least-once semantics will be used for Storage Write API, otherwise exactly-once semantics will be used. Defaults to: false.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex \ --project=PROJECT_ID \ --region=REGION_NAME \ --parameters \ spannerInstanceId=SPANNER_INSTANCE_ID,\ spannerDatabaseId=SPANNER_DATABASE_ID,\ spannerTableId=SPANNER_TABLE_ID,\ sqlQuery=SQL_QUERY,\ outputTableSpec=OUTPUT_TABLE_SPEC,\
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
SPANNER_INSTANCE_ID
: the Spanner instance IDSPANNER_DATABASE_ID
: the Spanner database IDSPANNER_TABLE_ID
: the Spanner table nameSQL_QUERY
: the SQL queryOUTPUT_TABLE_SPEC
: the BigQuery table location
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launchParameter": { "jobName": "JOB_NAME", "parameters": { "spannerInstanceId": "SPANNER_INSTANCE_ID", "spannerDatabaseId": "SPANNER_DATABASE_ID", "spannerTableId": "SPANNER_TABLE_ID", "sqlQuery": "SQL_QUERY", "outputTableSpec": "OUTPUT_TABLE_SPEC", }, "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex", "environment": { "maxWorkers": "10" } } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
SPANNER_INSTANCE_ID
: the Spanner instance IDSPANNER_DATABASE_ID
: the Spanner database IDSPANNER_TABLE_ID
: the Spanner table nameSQL_QUERY
: the SQL queryOUTPUT_TABLE_SPEC
: the BigQuery table location
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.