Export data as Protobuf columns
This document describes how you can export BigQuery data as Protocol Buffers (Protobuf) columns by using BigQuery user-defined functions (UDFs).
When to use Protobuf columns
BigQuery offers a number of built-in functions to format selected data. One option is to merge multiple column values into a single Protobuf value, which has the following benefits:
- Object type safety.
- Improved compression, data transfer time, and cost as compared with JSON.
- Flexibility, as most programming languages have libraries to handle Protobuf.
- Less overhead when reading from multiple columns and building a single object.
While other column types can also provide type safety, using Protobuf columns provides a fully typed object, which can reduce the amount of work that needs to be done on the application layer or on another part of the pipeline.
However, there are limitations to exporting BigQuery data as Protobuf columns:
- Protobuf columns are not well indexed or filtered. Searching by the content of the Protobuf columns can be less effective.
- Sorting data in Protobuf format can be difficult.
If these limitations apply to the export workflow, you might consider other methods of exporting BigQuery data:
- Use scheduled queries with
EXPORT DATA
statements to sort the exported BigQuery data by date or time, and to schedule exports on a recurring basis. BigQuery supports exporting data into Avro, CSV, JSON, and Parquet formats. - Use Dataflow to export BigQuery data in either the Avro or CSV file format.
Required roles
To get the permissions that you need to export BigQuery data as Protobuf columns, ask your administrator to grant you the following IAM roles on your project:
-
Create a user-defined function:
BigQuery Data Editor (
roles/bigquery.dataEditor
) -
Export data from a BigQuery table:
BigQuery Data Viewer (
roles/bigquery.dataViewer
) -
Read and upload files to Cloud Storage:
Storage Object Creator (
roles/storage.objectCreator
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Create a UDF
Create a UDF that converts a BigQuery STRUCT
data type into a Protobuf column:
In a command line, clone the
bigquery-utils.git
repository:git clone https://github.com/GoogleCloudPlatform/bigquery-utils.git
Navigate to the Protobuf export folder:
cd bigquery-utils/tools/protobuf_export
Use the
cp
command or your operating system's file browser to copy your proto file to the./protos
child folder.There is a sample proto file named
dummy.proto
already in the./protos
folder.Install the necessary packages from the GitHub repository:
npm install
Bundle the package by using webpack:
npx webpack --config webpack.config.js --stats-error-details
Locate the
pbwrapper.js
file in the./dist
child folder, and then upload the file to a Cloud Storage bucket.Go to the BigQuery page.
Using the query editor, create a UDF named
toMyProtoMessage
that builds a Protobuf column from existing BigQuery table columns:CREATE FUNCTION DATASET_ID.toMyProtoMessage(input STRUCT<INPUT_FIELDS>) RETURNS BYTES LANGUAGE js OPTIONS ( library=["gs://BUCKET_NAME/pbwrapper.js"] ) AS r""" let message = pbwrapper.setup("PROTO_PACKAGE.PROTO_MESSAGE") return pbwrapper.parse(message, input) """;
Replace the following:
DATASET_ID
: the ID of the dataset to contain the UDF.INPUT_FIELDS
: the fields used in the proto message type for the proto file, in the formatfield_name_1 field_type_1 [, field_name_2 field_type_2, ...]
.You must translate any message type fields that use underscores to use camel case instead. For example, if the message type looks like the following, then the input fields value must be
itemId int64, itemDescription string
:message ThisMessage { int64 item_id = 1; string item_description = 2; }
BUCKET_NAME
: the name of the Cloud Storage bucket that contains thepbwrapper.js
file.PROTO_PACKAGE
: the package for the proto file.PROTO_MESSAGE
: the message type for the proto file.
For example, if you use the provided
dummy.proto
file, theCREATE FUNCTION
statement looks as follows:CREATE OR REPLACE FUNCTION mydataset.toMyProtoMessage(input STRUCT<dummyField STRING>) RETURNS BYTES LANGUAGE js OPTIONS ( library=["gs://mybucket/pbwrapper.js"] ) AS r""" let message = pbwrapper.setup("dummypackage.DummyMessage") return pbwrapper.parse(message, input) """;
Format columns as Protobuf values
Run the toMyProtoMessage
UDF to format BigQuery
table columns as Protobuf values:
SELECT
UDF_DATASET_ID.toMyProtoMessage(STRUCT(INPUT_COLUMNS)) AS protoResult
FROM
`PROJECT_ID.DATASET_ID.TABLE_NAME`
LIMIT
100;
Replace the following:
UDF_DATASET_ID
: the ID of the dataset that contains the UDF.INPUT_COLUMNS
: the names of the columns to format as a Protobuf value, in the formatcolumn_name_1 [, column_name_2, ...]
. Columns can be of any supported scalar value type or non-scalar type, includingARRAY
andSTRUCT
. Input columns must match the type and number of the proto message type fields.PROJECT_ID
: the ID of the project that contains the table. You can skip identifying the project if the dataset is in your current project.DATASET_ID
: the ID of the dataset that contains the table.TABLE_NAME
: the name of the table that contains the columns to format.
For example, if you use a toMyProtoMessage
UDF based on dummy.proto
,
the following SELECT
statement works:
SELECT
mydataset.toMyProtoMessage(STRUCT(word)) AS protoResult
FROM
`bigquery-public-data.samples.shakespeare`
LIMIT 100;
Work with Protobuf values
With the BigQuery data exported in the Protobuf format, you can now work with the data as a fully typed object or struct.
The following code samples provide several examples of ways that you can process or work with the exported data: