Use BigQuery DataFrames

BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine. BigQuery DataFrames is an open-source package. You can run pip install --upgrade bigframes to install the latest version.

BigQuery DataFrames provides three libraries:

bigframes.pandas provides a pandas API that you can use to analyze and manipulate data in BigQuery. Many workloads can be migrated from pandas to bigframes by just changing a few imports. The bigframes.pandas API is scalable to support processing terabytes of BigQuery data, and the API uses the BigQuery query engine to perform calculations.
bigframes.bigquery provides many BigQuery SQL functions that might not have a pandas equivalent.
bigframes.ml provides an API similar to the scikit-learn API for ML. The ML capabilities in BigQuery DataFrames let you preprocess data, and then train models on that data. You can also chain these actions together to create data pipelines.

Required roles

To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:

BigQuery Job User (roles/bigquery.jobUser)
BigQuery Read Session User (roles/bigquery.readSessionUser)
Use BigQuery DataFrames in a BigQuery notebook:
- BigQuery User (roles/bigquery.user)
- Notebook Runtime User (roles/aiplatform.notebookRuntimeUser)
- Code Creator (roles/dataform.codeCreator)
Use BigQuery DataFrames remote functions:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Connection Admin (roles/bigquery.connectionAdmin)
- Cloud Functions Developer (roles/cloudfunctions.developer)
- Service Account User (roles/iam.serviceAccountUser)
- Storage Object Viewer (roles/storage.objectViewer)
Use BigQuery DataFrames ML remote models: BigQuery Connection Admin (roles/bigquery.connectionAdmin)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

In addition, when using BigQuery DataFrames remote functions or BigQuery DataFrames ML remote models, you need the Project IAM Admin role (roles/resourcemanager.projectIamAdmin) if you're using a default BigQuery connection, or the Browser role (roles/browser) if you're using a pre-configured connection. This requirement can be avoided by setting the bigframes.pandas.options.bigquery.skip_bq_connection_check option to True, in which case the connection (default or pre-configured) is used as-is without any existence or permission check. If you're using the pre-configured connection and skipping the connection check, verify the following:

The connection is created in the right location.
If you're using BigQuery DataFrames remote functions, the service account has the Cloud Run Invoker role (roles/run.invoker) on the project.
If you're using BigQuery DataFrames ML remote models, the service account has the Vertex AI User role (roles/aiplatform.user) on the project.

When you're performing end user authentication in an interactive environment like a notebook, Python REPL, or the command line, BigQuery DataFrames prompts for authentication, if needed. Otherwise, see how to set up application default credentials for various environments.

Configure installation options

After you install BigQuery DataFrames, you can specify the following options.

Location and project

You need to specify the location and project in which you want to use BigQuery DataFrames.

You can define the location and project in your notebook in the following way:

import bigframes.pandas as bpd

PROJECT_ID = "bigframes-dev"  # @param {type:"string"}
REGION = "US"  # @param {type:"string"}

# Set BigQuery DataFrames options
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

Data processing location

BigQuery DataFrames is designed for scale, which it achieves by keeping data and processing on the BigQuery service. However, you can bring data into the memory of your client machine by calling .to_pandas() on a DataFrame orSeries object. If you choose to do this, the memory limitation of your client machine applies.

Migrate to BigQuery DataFrames version 2.0

Version 2.0 of BigQuery DataFrames makes security and performance improvements to the BigQuery DataFrames API, adds new features, and introduces breaking changes. This document describes the changes and provides migration guidance. You can apply these recommendations before installing the 2.0 version by using the latest version 1.x of BigQuery DataFrames.

BigQuery DataFrames version 2.0 has the following benefits:

Faster queries and fewer tables are created when you run queries that return results to the client, because allow_large_results defaults to False. This can reduce storage costs, especially if you use physical bytes billing.
Improved security by default in the remote functions deployed by BigQuery DataFrames.

Install BigQuery DataFrames version 2.0

To avoid breaking changes, pin to a specific version of BigQuery DataFrames in your requirements.txt file (for example, bigframes==1.42.0) or your pyproject.toml file (for example, dependencies = ["bigframes = 1.42.0"]). When you're ready to try the latest version, you can run pip install --upgrade bigframes to install the latest version of BigQuery DataFrames.

Use the `allow_large_results` option

BigQuery has a maximum response size limit for query jobs. Starting in BigQuery DataFrames version 2.0, BigQuery DataFrames enforces this limit by default in methods that return results to the client, such as peek(), to_pandas(), and to_pandas_batches(). If your job returns large results, you can set allow_large_results to True in your BigQueryOptions object to avoid breaking changes. This option is set to False by default in BigQuery DataFrames version 2.0.

import bigframes.pandas as bpd

bpd.options.bigquery.allow_large_results = True

You can override the allow_large_results option by using the allow_large_results parameter in to_pandas() and other methods. For example:

bf_df = bpd.read_gbq(query)
# ... other operations on bf_df ...
pandas_df = bf_df.to_pandas(allow_large_results=True)

Use the `@remote_function` decorator

BigQuery DataFrames version 2.0 makes some changes to the default behavior of the @remote_function decorator.

Keyword arguments are enforced for ambiguous parameters

To prevent passing values to an unintended parameter, BigQuery DataFrames version 2.0 and beyond enforces the use of keyword arguments for the following parameters:

bigquery_connection
reuse
name
packages
cloud_function_service_account
cloud_function_kms_key_name
cloud_function_docker_repository
max_batching_rows
cloud_function_timeout
cloud_function_max_instances
cloud_function_vpc_connector
cloud_function_memory_mib
cloud_function_ingress_settings

When using these parameters, supply the parameter name. For example:

@remote_function(
  name="my_remote_function",
  ...
)
def my_remote_function(parameter: int) -> str:
  return str(parameter)

Set a service account

As of version 2.0, BigQuery DataFrames no longer uses the Compute Engine service account by default for the Cloud Run functions it deploys. To limit the permissions of the function that you deploy,

Create a service account with minimal permissions.
Supply the service account email to the cloud_function_service_account parameter of the @remote_function decorator.

For example:

@remote_function(
  cloud_function_service_account="my-service-account@my-project.iam.gserviceaccount.com",
  ...
)
def my_remote_function(parameter: int) -> str:
  return str(parameter)

If you would like to use the Compute Engine service account, you can set the cloud_function_service_account parameter of the @remote_function decorator to "default". For example:

# This usage is discouraged. Use only if you have a specific reason to use the
# default Compute Engine service account.
@remote_function(cloud_function_service_account="default", ...)
def my_remote_function(parameter: int) -> str:
  return str(parameter)

Set ingress settings

As of version 2.0, BigQuery DataFrames sets the ingress settings of the Cloud Run functions it deploys to "internal-only". Previously, the ingress settings were set to "all" by default. You can change the ingress settings by setting the cloud_function_ingress_settings parameter of the @remote_function decorator. For example:

@remote_function(cloud_function_ingress_settings="internal-and-gclb", ...)
def my_remote_function(parameter: int) -> str:
  return str(parameter)

Use custom endpoints

In BigQuery DataFrames versions earlier than 2.0, if a region didn't support regional service endpoints and bigframes.pandas.options.bigquery.use_regional_endpoints = True, then BigQuery DataFrames would fall back to locational endpoints. Version 2.0 of BigQuery DataFrames removes this fallback behavior. To connect to locational endpoints in version 2.0, set the bigframes.pandas.options.bigquery.client_endpoints_override option. For example:

import bigframes.pandas as bpd

bpd.options.bigquery.client_endpoints_override = {
  "bqclient": "https://LOCATION-bigquery.googleapis.com",
  "bqconnectionclient": "LOCATION-bigqueryconnection.googleapis.com",
  "bqstoragereadclient": "LOCATION-bigquerystorage.googleapis.com",
}

Replace LOCATION with the name of the BigQuery location that you want to connect to.

Use the `bigframes.ml.llm` module

In BigQuery DataFrames version 2.0, the default model_name for GeminiTextGenerator has been updated to "gemini-2.0-flash-001". It is recommended that you supply a model_name directly to avoid breakages if the default model changes in the future.

import bigframes.ml.llm

model = bigframes.ml.llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

Data manipulation

The following sections describe the data manipulation capabilities for BigQuery DataFrames. You can find the functions that are described in the bigframes.bigquery library.

pandas API

A notable feature of BigQuery DataFrames is that the bigframes.pandas API is designed to be similar to APIs in the pandas library. This design lets you employ familiar syntax patterns for data manipulation tasks. Operations defined through the BigQuery DataFrames API are executed server-side, operating directly on data stored within BigQuery and eliminating the need to transfer datasets out of BigQuery.

To check which pandas APIs are supported by BigQuery DataFrames, see Supported pandas APIs.

Inspect and manipulate data

You can use the bigframes.pandas API to perform data inspection and calculation operations. The following code sample uses the bigframes.pandas library to inspect the body_mass_g column, calculate the mean body_mass, and calculate the mean body_mass by species:

import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Inspect one of the columns (or series) of the DataFrame:
bq_df["body_mass_g"]

# Compute the mean of this series:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

# Find the heaviest species using the groupby operation to calculate the
# mean body_mass_g:
(
    bq_df["body_mass_g"]
    .groupby(by=bq_df["species"])
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

BigQuery library

The BigQuery library provides BigQuery SQL functions that might not have a pandas equivalent. The following sections present some examples.

Process array values

You can use the bigframes.bigquery.array_agg() function in the bigframes.bigquery library to aggregate values after a groupby operation:

import bigframes.bigquery as bbq
import bigframes.pandas as bpd

s = bpd.Series([0, 1, 2, 3, 4, 5])

# Group values by whether they are divisble by 2 and aggregate them into arrays
bbq.array_agg(s.groupby(s % 2 == 0))
# False    [1 3 5]
# True     [0 2 4]
# dtype: list<item: int64>[pyarrow]

You can also use the array_length() and array_to_string() array functions.

Create a struct `Series` object

You can use the bigframes.bigquery.struct() function in the bigframes.bigquery library to create a new struct Series object with subfields for each column in a DataFrame:

import bigframes.bigquery as bbq
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Create a new STRUCT Series with subfields for each column in a DataFrames.
lengths = bbq.struct(
    bq_df[["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]]
)

lengths.peek()
# 146	{'culmen_length_mm': 51.1, 'culmen_depth_mm': ...
# 278	{'culmen_length_mm': 48.2, 'culmen_depth_mm': ...
# 337	{'culmen_length_mm': 36.4, 'culmen_depth_mm': ...
# 154	{'culmen_length_mm': 46.5, 'culmen_depth_mm': ...
# 185	{'culmen_length_mm': 50.1, 'culmen_depth_mm': ...
# dtype: struct[pyarrow]

Convert timestamps to Unix epochs

You can use the bigframes.bigquery.unix_micros() function in the bigframes.bigquery library to convert timestamps into Unix microseconds:

import pandas as pd

import bigframes.bigquery as bbq
import bigframes.pandas as bpd

# Create a series that consists of three timestamps: [1970-01-01, 1970-01-02, 1970-01-03]
s = bpd.Series(pd.date_range("1970-01-01", periods=3, freq="d", tz="UTC"))

bbq.unix_micros(s)
# 0               0
# 1     86400000000
# 2    172800000000
# dtype: Int64

You can also use the unix_seconds() and unix_millis() time functions.

Use the SQL scalar function

You can use the bigframes.bigquery.sql_scalar() function in the bigframes.bigquery library to access arbitrary SQL syntax representing a single column expression:

import bigframes.bigquery as bbq
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"

# The sql_scalar function can be used to inject SQL syntax that is not supported
# or difficult to express with the bigframes.pandas APIs.
bq_df = bpd.read_gbq(query_or_table)
shortest = bbq.sql_scalar(
    "LEAST({0}, {1}, {2})",
    columns=[
        bq_df["culmen_depth_mm"],
        bq_df["culmen_length_mm"],
        bq_df["flipper_length_mm"],
    ],
)

shortest.peek()
#         0
# 149	18.9
# 33	16.3
# 296	17.2
# 287	17.0
# 307	15.0
# dtype: Float64

Custom Python functions

BigQuery DataFrames lets you turn your custom Python functions into BigQuery artifacts that you can run on BigQuery DataFrames objects at scale. This extensibility support lets you perform operations beyond what is possible with BigQuery DataFrames and SQL APIs, so you can potentially take advantage of open source libraries. The two variants of this extensibility mechanism are described in the following sections.

User-defined functions (UDFs)

With UDFs (Preview), you can turn your custom Python function into a Python UDF. For an example usage, see Create a persistent Python UDF.

Creating a UDF in BigQuery DataFrames creates a BigQuery routine as the Python UDF in the specified dataset. For a full set of supported parameters, see udf.

Clean up

In addition to cleaning up the cloud artifacts directly in the Google Cloud console or with other tools, you can clean up the BigQuery DataFrames UDFs that were created with an explicit name argument by using the bigframes.pandas.get_global_session().bqclient.delete_routine(routine_id) command.

Requirements

To use a BigQuery DataFrames UDF, enable the BigQuery API in your project. If you're providing the bigquery_connection parameter in your project, you must also enable the BigQuery Connection API.

Limitations

The code in the UDF must be self-contained, meaning, it must not contain any references to an import or variable defined outside of the function body.
The code in the UDF must be compatible with Python 3.11, as that is the environment in which the code is executed in the cloud.
Re-running the UDF definition code after trivial changes in the function code—for example, renaming a variable or inserting a new line—causes the UDF to be re-created, even if these changes are inconsequential to the behavior of the function.
The user code is visible to users with read access on the BigQuery routines, so you should include sensitive content only with caution.
A project can have up to 1,000 Cloud Run functions at a time in a BigQuery location.

The BigQuery DataFrames UDF deploys a user-defined BigQuery Python function, and the related limitations apply.

Remote functions

BigQuery DataFrames lets you turn your custom scalar functions into BigQuery remote functions. For an example usage, see Create a remote function. For a full set of supported parameters, see remote_function.

Creating a remote function in BigQuery DataFrames creates the following:

A Cloud Run function.
A BigQuery connection. By default, a connection named bigframes-default-connection is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped. The service account for the default connection is granted the Cloud Run role (roles/run.invoker).
A BigQuery remote function that uses the Cloud Run function that's been created with the BigQuery connection.

BigQuery connections are created in the same location as the BigQuery DataFrames session, using the name you provide in the custom function definition. To view and manage connections, do the following:

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
Select the project in which you created the remote function.
In the Explorer pane, expand the project and then expand External connections.

BigQuery remote functions are created in the dataset you specify, or they are created in an anonymous dataset, which is a type of hidden dataset. If you don't set a name for a remote function during its creation, BigQuery DataFrames applies a default name that begins with the bigframes prefix. To view and manage remote functions created in a user-specified dataset, do the following:

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
Select the project in which you created the remote function.
In the Explorer pane, expand the project, expand the dataset in which you created the remote function, and then expand Routines.

To view and manage Cloud Run functions, do the following:

Go to the Cloud Run page.

Go to Cloud Run
Select the project in which you created the function.
Filter on the Function Deployment type in the list of available services.
To identify functions created by BigQuery DataFrames, look for function names with the bigframes prefix.

Clean up

In addition to cleaning up the cloud artifacts directly in the Google Cloud console or with other tools, you can clean up the BigQuery remote functions that were created without an explicit name argument and their associated Cloud Run functions in the following ways:

For a BigQuery DataFrames session, use the session.close() command.
For the default BigQuery DataFrames session, use the bigframes.pandas.close_session() command.
For a past session with session_id, use the bigframes.pandas.clean_up_by_session_id(session_id) command.

You can also clean up the BigQuery remote functions that were created with an explicit name argument and their associated Cloud Run functions by using the bigframes.pandas.get_global_session().bqclient.delete_routine(routine_id) command.

Requirements

To use BigQuery DataFrames remote functions, you must enable the following APIs:

BigQuery API (bigquery.googleapis.com)
BigQuery Connection API (bigqueryconnection.googleapis.com)
Cloud Functions API (cloudfunctions.googleapis.com)
Cloud Run Admin API (run.googleapis.com)
Artifact Registry API (artifactregistry.googleapis.com)
Cloud Build API (cloudbuild.googleapis.com)
Compute Engine API (compute.googleapis.com)
Cloud Resource Manager API (cloudresourcemanager.googleapis.com)

You can avoid this requirement by setting the bigframes.pandas.options.bigquery.skip_bq_connection_check option to True, in which case the connection (either default or pre-configured) is used as-is without checking for the existence of the connection or verifying its permissions.

Limitations

Remote functions take about 90 seconds to become usable when you first create them. Additional package dependencies might add to the latency.
Re-running the remote function definition code after trivial changes in and around the function code—for example, renaming a variable, inserting a new line, or inserting a new cell in the notebook—might cause the remote function to be re-created, even if these changes are inconsequential to the behavior of the function.
The user code is visible to users with read access on the Cloud Run functions, so you should include sensitive content only with caution.
A project can have up to 1,000 Cloud Run functions at a time in a region. For more information, see Quotas.

ML and AI

The following sections describe the ML and AI capabilities for BigQuery DataFrames. These capabilities use the bigframes.ml library.

ML locations

The bigframes.ml library supports the same locations as BigQuery ML. BigQuery ML model prediction and other ML functions are supported in all BigQuery regions. Support for model training varies by region. For more information, see BigQuery ML locations.

Preprocess data

Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:

Use the KBinsDiscretizer class in the bigframes.ml.preprocessing module to bin continuous data into intervals.
Use the LabelEncoder class in the bigframes.ml.preprocessing module to normalize the target labels as integer values.
Use the MaxAbsScaler class in the bigframes.ml.preprocessing module to scale each feature to the range [-1, 1] by its maximum absolute value.
Use the MinMaxScaler class in the bigframes.ml.preprocessing module to standardize features by scaling each feature to the range [0, 1].
Use the StandardScaler class in the bigframes.ml.preprocessing module to standardize features by removing the mean and scaling to unit variance.
Use the OneHotEncoder class in the bigframes.ml.preprocessing module to transform categorical values into numeric format.
Use the ColumnTransformer class in the bigframes.ml.compose module to apply transformers to DataFrames columns.

Train models

You can create estimators to train models in BigQuery DataFrames.

Clustering models

You can create estimators for clustering models by using the bigframes.ml.cluster module.

Use the KMeans class to create K-means clustering models. Use these models for data segmentation. For example, identifying customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.

You can use the bigframes.ml.cluster module to create estimators for clustering models.

The following code sample shows using the bigframes.ml.cluster KMeans class to create a k-means clustering model for data segmentation:

from bigframes.ml.cluster import KMeans
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Create the KMeans model
cluster_model = KMeans(n_clusters=10)
cluster_model.fit(bq_df["culmen_length_mm"], bq_df["sex"])

# Predict using the model
result = cluster_model.predict(bq_df)
# Score the model
score = cluster_model.score(bq_df)

Decomposition models

You can create estimators for decomposition models by using the bigframes.ml.decomposition module.

Use the PCA class to create principal component analysis (PCA) models. Use these models for computing principal components and using them to perform a change of basis on the data. This provides dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

Ensemble models

You can create estimators for ensemble models by using the bigframes.ml.ensemble module.

Use the RandomForestClassifier class to create random forest classifier models. Use these models for constructing multiple learning method decision trees for classification.
Use the RandomForestRegressor class to create random forest regression models. Use these models for constructing multiple learning method decision trees for regression.
Use the XGBClassifier class to create gradient boosted tree classifier models. Use these models for additively constructing multiple learning method decision trees for classification.
Use the XGBRegressor class to create gradient boosted tree regression models. Use these models for additively constructing multiple learning method decision trees for regression.

Forecasting models

You can create estimators for forecasting models by using the bigframes.ml.forecasting module.

Use the ARIMAPlus class to create time series forecasting models.

Imported models

You can create estimators for imported models by using the bigframes.ml.imported module.

Use the ONNXModel class to import Open Neural Network Exchange (ONNX) models.
Use the TensorFlowModel class to import TensorFlow models.
Use the XGBoostModel class to import XGBoostModel models.

Linear models

Create estimators for linear models by using the bigframes.ml.linear_model module.

Use the LinearRegression class to create linear regression models. Use these models for forecasting. For example, forecasting the sales of an item on a given day.
Use the LogisticRegression class to create logistic regression models. Use these models for the classification of two or more possible values such as whether an input is low-value, medium-value, or high-value.

The following code sample shows using bigframes.ml to do the following:

Load data from BigQuery
Clean and prepare training data
Create and apply a bigframes.ml.LinearRegression regression model

from bigframes.ml.linear_model import LinearRegression
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Filter down to the data to the Adelie Penguin species
adelie_data = bq_df[bq_df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Specify your feature (or input) columns and the label (or output) column:
feature_columns = training_data[
    ["island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex"]
]
label_columns = training_data[["body_mass_g"]]

test_data = adelie_data[adelie_data.body_mass_g.isnull()]

# Create the linear model
model = LinearRegression()
model.fit(feature_columns, label_columns)

# Score the model
score = model.score(feature_columns, label_columns)

# Predict using the model
result = model.predict(test_data)

Large language models

You can create estimators for LLMs by using the bigframes.ml.llm module.

Use the GeminiTextGenerator class to create Gemini text generator models. Use these models for text generation tasks.

Use the bigframes.ml.llm module to create estimators for remote large language models (LLMs).
The following code sample shows using the bigframes.ml.llm GeminiTextGenerator class to create a Gemini model for code generation:

from bigframes.ml.llm import GeminiTextGenerator
import bigframes.pandas as bpd

# Create the Gemini LLM model
session = bpd.get_global_session()
connection = f"{PROJECT_ID}.{REGION}.{CONN_NAME}"
model = GeminiTextGenerator(
    session=session, connection_name=connection, model_name="gemini-2.0-flash-001"
)

df_api = bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")

# Prepare the prompts and send them to the LLM model for prediction
df_prompt_prefix = "Generate Pandas sample code for DataFrame."
df_prompt = df_prompt_prefix + df_api["API"]

# Predict using the model
df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)

Remote models

To use BigQuery DataFrames ML remote models (bigframes.ml.remote or bigframes.ml.llm), you must enable the following APIs:

BigQuery API (bigquery.googleapis.com)
BigQuery Connection API (bigqueryconnection.googleapis.com)
Vertex AI API (aiplatform.googleapis.com)
Cloud Resource Manager API (cloudresourcemanager.googleapis.com)

You can avoid this requirement by setting the bigframes.pandas.options.bigquery.skip_bq_connection_check option to True, in which case the connection (either default or pre-configured) is used as-is without checking for the existence of the connection or verifying its permissions.

Creating a remote model in BigQuery DataFrames creates a BigQuery connection. By default, a connection of the name bigframes-default-connection is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped. The service account for the default connection is granted the Vertex AI User role (roles/aiplatform.user) on the project.

Create pipelines

You can create ML pipelines by using bigframes.ml.pipeline module. Pipelines let you assemble several ML steps to be cross-validated together while setting different parameters. This simplifies your code, and lets you deploy data preprocessing steps and an estimator together.

Use the Pipeline class to create a pipeline of transforms with a final estimator.

Select models

Use the bigframes.ml.model_selection module module to split your training and testing datasets and select the best models:

Use the train_test_split function to split the data into training and testing (evaluation) sets, as shown in the following code sample:
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```
Use the KFold class and the KFold.split method to create multi-fold training and testing sets to train and evaluate models, as shown in the following code sample. This feature is valuable for small datasets.
```
kf = KFold(n_splits=5)
for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
# Train and evaluate models with training and testing sets
```
Use the cross_validate function to automatically create multi-fold training and testing sets, train and evaluate the model, and get the result of each fold, as shown in the following code sample:
```
scores = cross_validate(model, X, y, cv=5)
```

Performance optimization

This section presents ways to optimize your BigQuery DataFrames performance.

Partial ordering mode

BigQuery DataFrames provides an ordering mode feature. Set the ordering_mode property to partial to generate more efficient queries.

The partial ordering mode contrasts with the default strict mode, which creates a total ordering over all rows. A total ordering makes BigQuery DataFrames more compatible with pandas by providing order-based access to rows with the DataFrame.iloc property. However, total ordering and the default sequential index over that ordering mean that neither column filters nor row filters reduce the number of bytes scanned, unless those filters are applied as parameters to the read_gbq and read_gbq_table functions. To provide a total ordering over all the rows in the DataFrame, BigQuery DataFrames creates a hash of all rows. This can result in a full data scan that ignores row and column filters.

Setting the ordering_mode property to partial stops BigQuery DataFrames from generating a total ordering over all the rows. The partial ordering mode also turns off features that require a total ordering over all rows, such as the DataFrame.iloc property. The partial ordering mode sets the DefaultIndexKind class to a null index, instead of to a sequential index over the ordering.

When filtering a DataFrame with the ordering_mode property set to partial, BigQuery DataFrames no longer has to compute which rows are missing in the sequential index, so it generates faster and more efficient queries. The BigQuery DataFrames API is still the familiar pandas API, just like the default experience with the strict ordering mode. However, the partial ordering mode will differ from common pandas behavior---for example, the partial ordering mode does not perform implicit joins by index.

With both the partial and strict ordering modes, you pay for the BigQuery resources you use. However, using the partial ordering mode can reduce costs when working with large clustered tables and partitioned tables, because row filters on cluster and partition columns reduce the number of bytes processed.

Usage

To use partial ordering, set the ordering_mode property to partial before performing any other operation with BigQuery DataFrames, as shown in the following code sample:

import bigframes.pandas as bpd

bpd.options.bigquery.ordering_mode = "partial"

Because there is no sequential index with the partial ordering mode, unrelated BigQuery DataFrames aren't implicitly joined. Instead, you must explicitly call the DataFrame.merge method to join two BigQuery DataFrames that derive from different table expressions.

The Series.unique() and Series.drop_duplicates() features are not compatible with the partial ordering mode. Instead, use the groupby method to find unique values in this way:

# Avoid order dependency by using groupby instead of drop_duplicates.
unique_col = df.groupby(["column"], as_index=False).size().drop(columns="size")

With the partial ordering mode, the output of the DataFrame.head(n) and Series.head(n) functions isn't idempotent across all invocations. To download a small, arbitrary sample of the data, use the DataFrame.peek() or Series.peek() methods.

For a detailed tutorial in which you use the ordering_mode = "partial" property, see this BigQuery DataFrames notebook demonstrating use of the partial ordering mode.

Troubleshooting

Because DataFrames in partial ordering mode don't always have an ordering or index, you might encounter the following issues when you use some pandas-compatible methods.

Order required error

Some features require an ordering, such as the DataFrame.head() and DataFrame.iloc functions. For a list of features that require ordering, see the Requires ordering column in Supported pandas APIs.

When there is no ordering on the object, the operation fails with an OrderRequiredError message like the following:

OrderRequiredError: Op iloc requires an ordering. Use .sort_values or .sort_index to provide an ordering.

As the error message describes, you can provide an ordering using the DataFrame.sort_values() method to sort by a column or columns. Other operations, such as the DataFrame.groupby() operation, implicitly provide a total ordering over the group by keys.

If the ordering can't be determined to be a fully stable total ordering over all the rows, subsequent operations might warn you with an AmbiguousWindowWarning message like the following:

AmbiguousWindowWarning: Window ordering may be ambiguous, this can cause unstable results.

If your workload can accommodate non-deterministic results or you can manually verify that the ordering you provide is a total ordering, you can filter the AmbiguousWindowWarning message in this way:

import warnings

import bigframes.exceptions

warnings.simplefilter(
    "ignore", category=bigframes.exceptions.AmbiguousWindowWarning
)

Null index error

Some features require an index, such as the DataFrame.unstack() and Series.interpolate() properties. For a list of features that require an index, see the Requires index column in Supported pandas APIs.

When you use an operation that requires an index with the partial ordering mode, the operation raises a NullIndexError message like the following:

NullIndexError: DataFrame cannot perform interpolate as it has no index. Set an index using set_index.

As the error message describes, you can provide an index using the DataFrame.set_index() method to sort by a column or columns. Other operations, such as the DataFrame.groupby() operation, implicitly provide an index over the group by keys, unless the as_index=False parameter is set.

What's next

Learn about the BigQuery DataFrames data type system.
Learn about BigQuery DataFrames sessions and I/O.
Learn how to visualize graphs using BigQuery DataFrames.
Learn how to generate BigQuery DataFrames code with Gemini.
Learn how to analyze package downloads from PyPI with BigQuery DataFrames.
View BigQuery DataFrames source code, sample notebooks, and samples on GitHub.
Explore the BigQuery DataFrames API reference.

Use BigQuery DataFrames

Required roles

Configure installation options

Location and project

Data processing location

Migrate to BigQuery DataFrames version 2.0

Install BigQuery DataFrames version 2.0

Use the allow_large_results option

Use the @remote_function decorator

Keyword arguments are enforced for ambiguous parameters

Set a service account

Set ingress settings

Use custom endpoints

Use the bigframes.ml.llm module

Data manipulation

pandas API

Inspect and manipulate data

BigQuery library

Process array values

Create a struct Series object

Convert timestamps to Unix epochs

Use the SQL scalar function

Custom Python functions

User-defined functions (UDFs)

Clean up

Requirements

Limitations

Remote functions

Clean up

Requirements

Limitations

ML and AI

ML locations

Preprocess data

Train models

Clustering models

Decomposition models

Ensemble models

Forecasting models

Imported models

Linear models

Large language models

Remote models

Create pipelines

Select models

Performance optimization

Partial ordering mode

Usage

Troubleshooting

Order required error

Null index error

What's next

Use the `allow_large_results` option

Use the `@remote_function` decorator

Use the `bigframes.ml.llm` module

Create a struct `Series` object