User-defined functions in Python
A Python user-defined function (UDF) lets you implement a scalar function in Python and use it in a SQL query. Python UDFs are similar to SQL and Javascript UDFs, but with additional capabilities. Python UDFs let you install third-party libraries from the Python Package Index (PyPI) and let you access external services using a Cloud resource connection.
Python UDFs are built and run on BigQuery managed resources.
Limitations
python-3.11
is the only supported runtime.- You cannot create a temporary Python UDF.
- You cannot use a Python UDF with a materialized view.
- The results of a query that calls a Python UDF are not cached because the return value of a Python UDF is always assumed to be non-deterministic.
- Python UDFs are not fully supported in
INFORMATION_SCHEMA
views. - You cannot create or update a Python UDF using the Routine API.
- VPC service controls are not supported.
- Customer-managed encryption keys (CMEK) are not supported.
- These data types are not supported:
JSON
,RANGE
,INTERVAL
, andGEOGRAPHY
.
Required IAM roles
The required IAM roles are based on whether you are a Python UDF owner or a Python UDF user. A Python UDF owner typically creates or updates a UDF. A Python UDF user invokes a UDF created by someone else.
Additional roles are also required if you create or run a Python UDF that references a Cloud resource connection.
UDF owners
If you're creating or updating a Python UDF, the following predefined IAM roles should be granted on the appropriate resource:
Role | Required permissions | Resource |
---|---|---|
BigQuery Data Editor (roles/bigquery.dataEditor )
|
|
The dataset where the Python UDF is created or updated. |
BigQuery Job User (roles/bigquery.jobUser )
|
|
The project where you're running the CREATE FUNCTION
statement.
|
BigQuery Connection Admin (roles/bigquery.connectionAdmin )
|
|
The connection you're giving access to an external resource. This
connection is required only if your UDF uses the
WITH CONNECTION clause to
access an external service.
|
UDF users
If you're invoking a Python UDF, the following predefined IAM roles should be granted on the appropriate resource:
Role | Required permissions | Resource |
---|---|---|
BigQuery User (roles/bigquery.user ) |
bigquery.jobs.create to run a query job that references the
UDF | .
The project where you're running a query job that invokes the Python UDF. |
BigQuery Data Viewer (roles/bigquery.dataViewer ) |
bigquery.routines.get to run a UDF created by someone
else. |
The dataset where the Python UDF is stored. |
Bigquery Connection User (roles/bigquery.connectionUser ) |
bigquery.connections.use to run a Python UDF that
references a Cloud resource connection. |
The Cloud resource connection referenced by the Python UDF. This connection is required only if your UDF references a connection. |
For more information about roles in BigQuery, see Predefined IAM roles.
Create a persistent Python UDF
Follow these rules when you create a Python UDF:
The body of the Python UDF must be a quoted string literal that represents the Python code. To learn more about quoted string literals, see Formats for quoted literals.
The body of the Python UDF must include a Python function that is used in the
entry_point
argument in the Python UDF options list.A Python runtime version needs to be specified in the
runtime_version
option. The only supported Python runtime version ispython-3.11
. For a full list of available options, see the Function option list for theCREATE FUNCTION
statement.
To create a persistent Python UDF, use the CREATE FUNCTION
statement
without the TEMP
or TEMPORARY
keyword. To delete a persistent Python UDF,
use the DROP FUNCTION
statement.
When you create a Python UDF using the CREATE FUNCTION
statement,
BigQuery creates or updates a container image that is based on a
base image. The container is built on the base image using your code and any
specified package dependencies. Creating the container is a long-running
process. The first query after you run the CREATE FUNCTION
statement might
automatically wait for the image to complete. Without any external dependencies,
the container image should typically be created in less than a minute.
The following example creates a persistent Python UDF named multiplyInputs
and
calls the UDF from within a SELECT
statement:
Go to the BigQuery page.
In the query editor, enter the following
CREATE FUNCTION
statement:CREATE FUNCTION `PROJECT_ID.DATASET_ID`.multiplyInputs(x FLOAT64, y FLOAT64) RETURNS FLOAT64 LANGUAGE python OPTIONS(runtime_version="python-3.11", entry_point="multiply") AS r''' def multiply(x, y): return x * y '''; -- Call the Python UDF. WITH numbers AS (SELECT 1 AS x, 5 as y UNION ALL SELECT 2 AS x, 10 as y UNION ALL SELECT 3 as x, 15 as y) SELECT x, y, `PROJECT_ID.DATASET_ID`.multiplyInputs(x, y) AS product FROM numbers;
Replace PROJECT_ID.DATASET_ID with your project ID and dataset ID.
Click
Run.This example produces the following output:
+-----+-----+--------------+ | x | y | product | +-----+-----+--------------+ | 1 | 5 | 5.0 | | 2 | 10 | 20.0 | | 3 | 15 | 45.0 | +-----+-----+--------------+
Create a vectorized Python UDF
You can implement your Python UDF to process a batch of rows instead of a single row by using vectorization. Vectorization can improve query performance.
To control batching behavior, specify the maximum number of rows in each batch
by using the max_batching_rows
option in the CREATE OR REPLACE FUNCTION
option list.
If you specify max_batching_rows
, BigQuery determines the
number of rows in a batch, up to the max_batching_rows
limit. If
max_batching_rows
is not specified, the number of rows to batch is determined
automatically.
A vectorized Python UDF has a single pandas.DataFrame
argument that must be annotated. The pandas.DataFrame
argument has the same
number of columns as the Python UDF parameters defined in the CREATE FUNCTION
statement. The column names in the pandas.DataFrame
argument have the same
names as the UDF's parameters.
Your function needs to return either a pandas.Series
or a single-column pandas.DataFrame
with the same number of rows as the input.
The following example creates a vectorized Python UDF named multiplyInputs
with two parameters—x
and y
:
Go to the BigQuery page.
In the query editor, enter the following
CREATE FUNCTION
statement:CREATE FUNCTION `PROJECT_ID.DATASET_ID`.multiplyVectorized(x FLOAT64, y FLOAT64) RETURNS FLOAT64 LANGUAGE python OPTIONS(runtime_version="python-3.11", entry_point="vectorized_multiply") AS r''' import pandas as pd def vectorized_multiply(df: pd.DataFrame): return df['x'] * df['y'] ''';
Replace PROJECT_ID.DATASET_ID with your project ID and dataset ID.
Calling the UDF is the same as in the previous example.
Click
Run.
Supported Python UDF data types
The following table defines the mapping between BigQuery data types, Python data types, and Pandas data types:
BigQuery data type | Python built-in data type used by standard UDF | Pandas data type used by vectorized UDF | PyArrow data type used for ARRAY and STRUCT in vectorized UDF |
---|---|---|---|
BOOL |
bool |
BooleanDtype |
DataType(bool) |
INT64 |
int |
Int64Dtype |
DataType(int64) |
FLOAT64 |
float |
FloatDtype |
DataType(double) |
STRING |
str |
StringDtype |
DataType(string) |
BYTES |
bytes |
binary[pyarrow] |
DataType(binary) |
TIMESTAMP |
Function parameter: Function return value: |
Function parameter: Function return value: |
TimestampType(timestamp[us]) , with timezone |
DATE |
datetime.date |
date32[pyarrow] |
DataType(date32[day]) |
TIME |
datetime.time |
time64[pyarrow] |
Time64Type(time64[us]) |
DATETIME |
datetime.datetime (without timezone) |
timestamp[us][pyarrow] |
TimestampType(timestamp[us]) , without timezone |
ARRAY |
list |
list<...>[pyarrow] , where the element data type is a pandas.ArrowDtype |
ListType |
STRUCT |
dict |
struct<...>[pyarrow] , where the field data type is a pandas.ArrowDtype |
StructType |
Supported runtime versions
BigQuery Python UDFs support the python-3.11
runtime. This
Python version includes some additional pre-installed packages. For
system libraries, check the runtime base image.
Runtime version | Python version | Includes | Runtime base image |
---|---|---|---|
python-3.11 | Python 3.11 | numpy 1.26.3 pyarrow 14.0.2 pandas 2.1.4 python-dateutil 2.8.2 |
google-22-full/python311 |
Use third-party packages
You can use the CREATE FUNCTION
option list
to use modules other than those provided by the Python standard library
and pre-installed packages. You can install packages from the Python Package
Index (PyPI), or you can import Python files from
Cloud Storage.
Install a package from the Python package index
When you install a package, you must provide the package name, and you can
optionally provide the package version using Python package version specifiers.
If the package is in the runtime, that package is used unless a particular
version is specified in the CREATE FUNCTION
option list. If a package version
is not specified, and the package isn't in the runtime, the latest available
version is used. Only packages with the wheels binary format
are supported.
The following example shows you how to create a Python UDF that installs the
Cloud Translation API client library package using the CREATE OR REPLACE FUNCTION
option list:
Go to the BigQuery page.
In the query editor, enter the following
CREATE FUNCTION
statement:CREATE FUNCTION `PROJECT_ID.DATASET_ID`.translate(src STRING) RETURNS STRING LANGUAGE python OPTIONS (entry_point='do_translate', runtime_version='python-3.11', packages=['google-cloud-translate>=3.11']) AS r""" from google.cloud import translate def do_translate(src): # See the example in following section for the detail guide and # the implementation return """;
Replace PROJECT_ID.DATASET_ID with your project ID and dataset ID.
Click
Run.
Import additional Python files as libraries
You can extend your Python UDFs using the Function option list by importing Python files from Cloud Storage.
In your UDF's Python code, you can import the Python files from
Cloud Storage as modules by using the import statement followed by the
path to the Cloud Storage object. For example, if you are importing
gs://BUCKET_NAME/path/to/lib1.py
, then your import statement would be
import path.to.lib1
.
The Python filename needs to be a Python identifier. Each folder
name in
the object name (after the /
) should be a valid Python identifier. Within the
ASCII range (U+0001..U+007F), the following characters can be used in
identifiers:
- Uppercase and lowercase letters A through Z.
- Underscores.
- The digits zero through nine, but a number cannot appear as the first character in the identifier.
The following example shows you how to create a Python UDF that imports the
lib1.py
client library package from a Cloud Storage bucket named
my_bucket
:
Go to the BigQuery page.
In the query editor, enter the following
CREATE FUNCTION
statement:CREATE FUNCTION `PROJECT_ID.DATASET_ID`.myFunc(a FLOAT64, b STRING) RETURNS STRING LANGUAGE python OPTIONS ( entry_point='compute', runtime_version='python-3.11', library=['gs://my_bucket/path/to/lib1.py']) AS r""" import path.to.lib1 as lib1 def compute(a, b): # doInterestingStuff is a function defined in # gs://my_bucket/path/to/lib1.py return lib1.doInterestingStuff(a, b); """;
Replace PROJECT_ID.DATASET_ID with your project ID and dataset ID.
Click
Run.
Call Google Cloud or online services in Python code
A Python UDF accesses a Google Cloud service or an external service by using the Cloud resource connection service account. The connection's service account must be granted permissions to access the service. The permissions required vary depending on the service that is accessed and the APIs that are called from your Python code.
If you create a Python UDF without using a Cloud resource connection, the function is executed in an environment that blocks network access. If your UDF accesses online services, you must create the UDF with a Cloud resource connection. If you don't, the UDF is blocked from accessing the network until an internal connection timeout is reached.
The following example shows you how to access the Cloud Translation service
from a Python UDF. This example has two projects—a project named
my_query_project
where you create the UDF and the Cloud resource connection,
and a project where you are running the Cloud Translation named
my_translate_project
.
Create a Cloud resource connection
First, you create a Cloud resource connection in my_query_project
. To create
the cloud resource connection, follow the steps on the Create a Cloud resource
connection
page.
After you create the connection, open it, and in the Connection info pane, copy the service account ID. You need this ID when you configure permissions for the connection. When you create a connection resource, BigQuery creates a unique system service account and associates it with the connection.
Grant access to the connection's service account
To grant the Cloud resource connection service account access to your projects,
grant the service account the Service usage consumer role
(roles/serviceusage.serviceUsageConsumer
) in my_query_project
and the
Cloud Translation API user role
(roles/cloudtranslate.user
) in my_translate_project
.
Go to the IAM page.
Verify that
my_query_project
is selected.Click
Grant Access.In the New principals field, enter the Cloud resource connection's service account ID that you copied previously.
In the Select a role field, choose Service usage, and then select Service usage consumer.
Click Save.
In the project selector, choose
my_translate_project
.Go to the IAM page.
Click
Grant Access.In the New principals field, enter the Cloud resource connection's service account ID that you copied previously.
In the Select a role field, choose Cloud translation, and then select Cloud Translation API user.
Click Save.
Create a Python UDF that calls the Cloud Translation service
In my_query_project
, create a Python UDF that calls the Cloud Translation
service using your Cloud resource connection.
Go to the BigQuery page.
Enter the following
CREATE FUNCTION
statement in the query editor:CREATE FUNCTION `PROJECT_ID.DATASET_ID`.translate_to_es(x STRING) RETURNS STRING LANGUAGE python WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID` OPTIONS (entry_point='do_translate', runtime_version='python-3.11', packages=['google-cloud-translate>=3.11', 'google-api-core']) AS r""" from google.api_core.retry import Retry from google.cloud import translate project = "my_translate_project" translate_client = translate.TranslationServiceClient() def do_translate(x : str) -> str: response = translate_client.translate_text( request={ "parent": f"projects/{project}/locations/us-central1", "contents": [x], "target_language_code": "es", "mime_type": "text/plain", }, retry=Retry(), ) return response.translations[0].translated_text """; -- Call the UDF. WITH text_table AS (SELECT "Hello" AS text UNION ALL SELECT "Good morning" AS text UNION ALL SELECT "Goodbye" AS text) SELECT text, `PROJECT_ID.DATASET_ID`.translate_to_es(text) AS translated_text FROM text_table;
Replace the following:
PROJECT_ID.DATASET_ID
: your project ID and dataset IDREGION.CONNECTION_ID
: your connection's region and connection ID
Click
Run.The output should look like the following:
+--------------------------+-------------------------------+ | text | translated_text | +--------------------------+-------------------------------+ | Hello | Hola | | Good morning | Buen dia | | Goodbye | Adios | +--------------------------+-------------------------------+
Supported locations
During preview, Python UDFs are supported in all BigQuery multi-region and regional locations except for the following:
- Mexico
- The
northamerica-south1
region is not supported.
- The
- Stockholm
- The
europe-north2
region is not supported.
- The
Pricing
Python UDFs are offered without any additional charges.
When billing is enabled, the following apply:
- Python UDF charges are billed using the BigQuery Services SKU.
- The charges are proportional to the amount of compute and memory consumed when the Python UDF is invoked.
- Python UDF customers are also charged for the cost of building or rebuilding the UDF container image. This charge is proportional to the resources used to build the image with customer code and dependencies.
- If Python UDFs result in external or internet network egress, you also see a Premium Tier internet egress charge from Cloud Networking.