Use BigQuery DataFrames
BigQuery DataFrames provides a Pythonic DataFrame and machine
learning (ML) API powered by the BigQuery engine.
BigQuery DataFrames is an open-source package. You can run
pip install --upgrade bigframes
to install the latest version.
BigQuery DataFrames provides two libraries:
bigframes.pandas
, which provides a pandas-compatible API for analytics.bigframes.ml
, which provides a scikit-learn-like API for machine learning (ML).
Required permissions
You must have the BigQuery Job User and BigQuery Read Session User roles to use BigQuery DataFrames.
When you are performing end user authentication in an interactive environment like a notebook, Python REPL, or the command line, BigQuery DataFrames prompts for authentication if needed. Otherwise, see how to set up application default credentials for various environments.
Additional Identity and Access Management (IAM) requirements apply for using remote functions and ML remote models.
To use BigQuery DataFrames in a BigQuery notebook, you need the following IAM roles:
Options
After installation, you need specify the location and project in which you want to use BigQuery DataFrames.
You can define the location and project in your notebook in the following way:
Data processing location
BigQuery DataFrames is designed for scale, which it
achieves by keeping data and processing on the BigQuery
service. However, you can bring data into the memory of your client
machine by calling .to_pandas()
on a DataFrame or Series object. If
you choose to do this, the memory limitation of your client machine
applies.
Session location
BigQuery DataFrames uses a local session object to internally to
manage metadata. This session is tied to a
location. BigQuery DataFrames uses
the US
multi-region as the default location, but you can use
session_options.location
to set a different location. Every query in a
session is executed in the location where the session was created.
BigQuery DataFrames auto-populates
bf.options.bigquery.location
with the location of the table if the
user starts with read_gbq/read_gbq_table/read_gbq_query()
and
specifies a table, either directly or in a SQL statement.
If you want to reset the location of the created DataFrame or Series
objects, you can close the session by executing
bigframes.pandas.close_session()
. After that, you can reuse
bigframes.pandas.options.bigquery.location
to specify another
location.
read_gbq()
requires you to specify a location if the dataset you are
querying is not in the US
multi-region. If you try to read a table
from another location, you get a NotFound
exception.
Data types
BigQuery DataFrames supports the following numpy and pandas dtypes:
BigQuery | BigQuery DataFrames and pandas |
---|---|
ARRAY |
pandas.ArrowDtype(pa.list_()) |
BOOL |
pandas.BooleanDtype() |
DATE |
pandas.ArrowDtype(pa.date32()) |
DATETIME |
pandas.ArrowDtype(pa.timestamp("us")) |
FLOAT64 |
pandas.Float64Dtype() |
GEOGRAPHY |
Supported by |
INT64 |
pandas.Int64Dtype() |
STRING |
pandas.StringDtype(storage="pyarrow") |
STRUCT |
pandas.ArrowDtype(pa.struct()) |
TIME |
pandas.ArrowDtype(pa.time64("us")) |
TIMESTAMP |
pandas.ArrowDtype(pa.timestamp("us", tz="UTC")) |
BigQuery DataFrames doesn't support the following BigQuery data types:
NUMERIC
BIGNUMERIC
INTERVAL
RANGE
JSON
All other BigQuery data types display as the object type.
Partial ordering mode
BigQuery DataFrames provides an ordering mode feature. Set the
ordering_mode
to partial
to generate more efficient queries.
The partial
ordering mode contrasts with the default strict
mode,
which creates a total ordering over all rows. A total ordering makes
BigQuery DataFrames more compatible with pandas by providing
order-based access to rows with the DataFrame.iloc
property. However,
total ordering and the default sequential index over that ordering mean
that neither column filters nor row filters reduce the number of bytes
scanned, unless those filters are applied as parameters to the
read_gbq
and read_gbq_table
functions. To provide a total ordering
over all the rows in the DataFrame, BigQuery DataFrames creates
a hash of all rows. This can result in a full data scan that ignores row
and column filters.
Setting the ordering_mode
property to partial
stops
BigQuery DataFrames from generating a total ordering over all
the rows. The partial ordering mode also turns off features that require
a total ordering over all rows, such as the DataFrame.iloc
property.
The partial ordering mode sets the
DefaultIndexKind
to a null index, instead of to a sequential index over the ordering.
When filtering a DataFrame with ordering_mode
set to partial
,
BigQuery DataFrames no longer has to compute which rows are
missing in the sequential index, so it generates faster and more
efficient queries. The BigQuery DataFrames API is still
pandas-like, just like the default experience with the strict ordering
mode. However, the partial ordering mode will differ from common pandas
behavior---for example, the partial ordering mode does not perform
implicit joins by index.
With both the partial and strict ordering modes, you pay for the BigQuery resources you use. However, using the partial ordering mode can reduce costs when working with large clustered tables and partitioned tables, because row filters on cluster and partition columns reduce the number of bytes processed.
Usage
To use partial ordering, set ordering_mode
to partial
before
performing any other operation with BigQuery DataFrames, as
shown in the following code sample:
Since there is no sequential index with the partial ordering mode,
unrelated BigQuery DataFrames aren\'t implicitly joined.
Instead, you must explicitly call the DataFrame.merge
method to join
two BigQuery DataFrames that derive from different table
expressions.
The Series.unique()
and Series.drop_duplicates()
features are not
compatible with the partial ordering mode. Instead, use the groupby
method to find unique values in this way:
With the partial ordering mode, the output of the DataFrame.head(n)
and Series.head(n)
functions isn't idempotent across
all invocations. To download a small, arbitrary sample of the data, use
the DataFrame.peek()
or Series.peek()
methods.
For a detailed tutorial in which you use the ordering_mode = "partial"
property, see this BigQuery DataFrames notebook demonstrating
use of the partial ordering
mode.
Troubleshooting
Since DataFrames in partial ordering mode don't always have an ordering or index, this can cause common issues when using some pandas-compatible methods.
Order required error
Some features require an ordering, such as the DataFrame.head()
and
DataFrame.iloc
functions. For a list of features that require
ordering, see the Requires ordering column in Supported pandas
APIs.
When there is no ordering on the object, the operation fails with an
OrderRequiredError
message like the following:
OrderRequiredError: Op iloc requires an ordering. Use .sort_values or .sort_index to provide an ordering.
As the error message describes, you can provide an ordering using the
DataFrame.sort_values()
method to sort by a column or columns. Other operations, such as the
DataFrame.groupby()
operation, implicitly provide a total ordering over the group by keys.
If the ordering can\'t be determined to be a fully stable total ordering
over all the rows, subsequent operations might warn you with an
AmbiguousWindowWarning
message like the following:
AmbiguousWindowWarning: Window ordering may be ambiguous, this can cause unstable results.
If your workload can accommodate non-deterministic results or you can
manually verify that the ordering you provide is a total ordering, you
can filter the AmbiguousWindowWarning
message in this way:
Null index error
Some features require an index, such as the DataFrame.unstack()
and
Series.interpolate()
properties.For a list of features that require an
index, see the Requires index column in Supported pandas
APIs.
When you use an operation that requires an index with the partial
ordering mode, the operation raises a NullIndexError
message like the
following:
NullIndexError: DataFrame cannot perform interpolate as it has no index. Set an index using set_index.
As the error message describes, you can provide an index using the
DataFrame.set_index()
method to sort by a column or columns. Other operations, such as the
DataFrame.groupby()
operation, implicitly provide an index over the group by keys, unless
the as_index=False
parameter is set.
Using the bigframes.pandas
library
The bigframes.pandas
library provides a pandas-like
API that you
can use to analyze and manipulate data in BigQuery. The
bigframes.pandas
API is scalable to support processing terabytes of
BigQuery data, and uses the BigQuery query
engine to perform calculations. For more information, see Supported
pandas
APIs.
The bigframes.pandas
API provides the following capabilities:
Input and output
You can access data from a variety of sources including local CSV files,
Cloud Storage files, pandas
DataFrames, BigQuery
models, and BigQuery functions, and load it into a
BigQuery DataFrames DataFrame. You can also create
BigQuery tables from BigQuery DataFrames.
Data manipulation
You can use Python instead of SQL for your development. You can develop
all BigQuery data manipulations in Python, eliminating
the need to switch between languages and try to capture SQL statements
as text strings. The bigframes.pandas
API offers over 750 pandas
functions.
Python ecosystem and visualizations
The bigframes.pandas
API is a gateway to the full Python ecosystem of
tools. The API supports advanced statistical operations, and you can
visualize the aggregations generated from BigQuery DataFrames.
You can also switch from a BigQuery DataFrames DataFrame to a
pandas
DataFrame with built-in sampling operations.
Custom Python functions
BigQuery DataFrames gives you the ability to turn your custom scalar functions into BigQuery remote functions . Creating a remote function in BigQuery DataFrames creates:
A BigQuery connection. By default a connection of the name
bigframes-default-connection
is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped.The service account for the default connection is granted the Cloud Run Invoker (
roles/run.invoker
) IAM role.A BigQuery remote function that uses the cloud function (1) using the BigQuery connection (2).
For an example, see Create a remote function.
BigQuery connections are created in the same location as the BigQuery DataFrames session, using the name you provide in the custom function definition. To view and manage connections, do the following:
Select the project in which you created the remote function.
In the Explorer pane, expand that project and then expand External connections.
BigQuery remote functions are created in the dataset you
specify, or in an anonymous dataset, a type of hidden
dataset. If you don\'t set a
name for a remote function during its creation,
BigQuery DataFrames applies a default name that begins with the
bigframes
prefix. To view and manage remote functions created in a
user-specified dataset, do the following:
Select the project in which you created the remote function.
In the Explorer pane, expand that project, expand the dataset in which you created the remote function, and then expand Routines.
To view and manage Cloud Run functions functions, use the
Functions
page and use the project picker to select the project in which you
created the function. To identify functions created by
BigQuery DataFrames, look for the functions whose names are prefixed by
bigframes
.
You can clean up unnamed BigQuery remote functions and their associated Cloud Run functions functions in the following ways:
- For a BigQuery DataFrames
session
, usesession.close()
. - For the default BigQuery DataFrames session, use
bigframes.pandas.close_session()
. - For a past session with
session_id
, usebigframes.pandas.clean_up_by_session_id(session_id)
.
Requirements
To use BigQuery DataFrames remote functions, you must enable the following APIs:
The BigQuery API (
bigquery.googleapis.com
)The BigQuery Connection API (
bigqueryconnection.googleapis.com
)The Cloud Functions API (
cloudfunctions.googleapis.com
)The Cloud Run Admin API (
run.googleapis.com
)The Artifact Registry API (
artifactregistry.googleapis.com
)The Cloud Build API (
cloudbuild.googleapis.com
)The Compute Engine API (
compute.googleapis.com
)The Cloud Resource Manager API (
cloudresourcemanager.googleapis.com
)You can avoid this requirement by setting the
bigframes.pandas.options.bigquery.skip_bq_connection_check
option toTrue
, in which case the connection (either default or pre-configured) is used as-is without checking for the existence of the connection or verifying its permissions.
To use BigQuery DataFrames remote functions, you must be granted the following IAM roles in the project:
BigQuery Data Editor (
roles/bigquery.dataEditor
)BigQuery Connection Admin (
roles/bigquery.connectionAdmin
)Cloud Functions Developer (
roles/cloudfunctions.developer
)Service Account User (
roles/iam.serviceAccountUser
)Storage Object Viewer (
roles/storage.objectViewer
)Project IAM Admin (
roles/resourcemanager.projectIamAdmin
) if using default BigQuery connection, or Browser (roles/browser) if using a pre-configured connection. This requirement can be avoided by settingbigframes.pandas.options.bigquery.skip_bq_connection_check
option toTrue
, in which case the connection (default or pre-configured) would be used as-is without any existence or permission check. If you are using the pre-configured connection and skipping the connection check, make sure the connection is created in the right location and its service account has the Cloud Run Invoker (roles/run.invoker
)role on the project.
Limitations
Remote functions take about 90 seconds to become available when you first create them.
Trivial changes in the notebook, such as inserting a new cell or renaming a variable, might cause the remote function to be re-created, even if these changes are unrelated to the remote function code.
BigQuery DataFrames does not differentiate any personal data you include in the remote function code. The remote function code is serialized as an opaque box to deploy it as a Cloud Run functions function.
The Cloud Run functions (2nd gen) functions, BigQuery connections, and BigQuery remote functions created by BigQuery DataFrames persist in Google Cloud. If you don\'t want to keep these resources, you must delete them separately using an appropriate Cloud Run functions or BigQuery interface.
A project can have up to 1000 Cloud Run functions (2nd gen) functions at a time. See Cloud Run functions quotas for all the limits.
bigframes.pandas
examples
The following examples show common ways of using bigframes.pandas
.
Load data from a BigQuery table or query
You can create a DataFrame from a BigQuery table or query in the following way:
Load data from a CSV file
You can create a DataFrame from a local or Cloud Storage CSV file in the following way:
Inspect and manipulate data
You can use bigframes.pandas
to perform data inspection and
calculation operations.
The following code sample shows using bigframes.pandas
to inspect the
body_mass_g
column, calculate the mean body_mass
, and calculate the
mean body_mass
by species
:
Using the bigframes.bigquery
library
The bigframes.bigquery
library provides many BigQuery SQL functions
which might not have a pandas equivalent.
The following examples show common ways of using bigframes.bigquery
.
Create a struct series
The bigframes.bigquery.struct()
function creates a new struct series
with subfields for each column in a DataFrame.
Use the SQL scalar function
Use the bigframes.bigquery.sql_scalar()
function to access arbitrary
SQL syntax representing a single column expression.
Using the bigframes.ml
library
The ML capabilities in BigQuery DataFrames let you preprocess data, and then train models on that data. You can also chain these actions together to create data pipelines.
ML locations
bigframes.ml
supports the same locations as BigQuery ML.
BigQuery ML model prediction and other ML functions are
supported in all BigQuery regions. Support for model
training varies by region. For more information, see
BigQuery ML
locations.
Preprocess data
Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:
Use the KBinsDiscretizer class in the
bigframes.ml.preprocessing
module to bin continuous data into intervals.Use the LabelEncoder class in the
bigframes.ml.preprocessing
module to normalize the target labels as integer values.Use the MaxAbsScaler class in the
bigframes.ml.preprocessing
module to scale each feature to the range[-1, 1]
by its maximum absolute value.Use the MinMaxScaler class in the
bigframes.ml.preprocessing
module to standardize features by scaling each feature to the range[0, 1]
.Use the StandardScaler class in the
bigframes.ml.preprocessing
module to standardize features by removing the mean and scaling to unit variance.Use the OneHotEncoder class in the
bigframes.ml.preprocessing
module to transform categorical values into numeric format.Use the ColumnTransformer class in the
bigframes.ml.compose
module to apply transformers to DataFrames columns.
Train models
Create estimators to train models in BigQuery DataFrames.
Clustering models
Create estimators for clustering models by using the bigframes.ml.cluster module.
- Use the KMeans class to create K-means clustering models. Use these models for data segmentation. For example, identifying customer segments. K-means is an unsupervised learning technique, so model training doesn\'t require labels or split data for training or evaluation.
You can use the bigframes.ml.cluster
module to create estimators for
clustering models.
The following code sample shows using the bigframes.ml.cluster KMeans
class to create a k-means clustering model for data segmentation:
Decomposition models
Create estimators for decomposition models by using the bigframes.ml.decomposition module.
- Use the PCA class to create principal component analysis (PCA) models. Use these models for computing principal components and using them to perform a change of basis on the data. This provides dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data\'s variation as possible.
Ensemble models
Create estimators for ensemble models by using the bigframes.ml.ensemble module.
Use the RandomForestClassifier class to create random forest classifier models. Use these models for constructing multiple learning method decision trees for classification.
Use the RandomForestRegressor class to create random forest regression models. Use these models for constructing multiple learning method decision trees for regression.
Use the XGBClassifier class to create gradient boosted tree classifier models. Use these models for additively constructing multiple learning method decision trees for classification.
Use the XGBRegressor class to create gradient boosted tree regression models. Use these models for additively constructing multiple learning method decision trees for regression.
Forecasting models
Create estimators for forecasting models by using the bigframes.ml.forecasting module.
- Use the ARIMAPlus class to create time series forecasting models.
Imported models
Create estimators for imported models by using the bigframes.ml.imported module.
Use the ONNXModel class to import Open Neural Network Exchange (ONNX) models.
Use the TensorFlowModel class to import TensorFlow models.
Use the XGBoostModel class to import XGBoostModel models.
Linear models
Create estimators for linear models by using the bigframes.ml.linear_model module.
Use the LinearRegression class to create linear regression models. Use these models for forecasting. For example, forecasting the sales of an item on a given day.
Use the LogisticRegression class to create logistic regression models. Use these models for the classification of two or more possible values such as whether an input is
low-value
,medium-value
, orhigh-value
.
The following code sample shows using bigframes.ml
to do the
following:
- Load data from BigQuery
- Clean and prepare training data
- Create and apply a bigframes.ml.LinearRegression regression model
Large language models
Create estimators for LLMs by using the bigframes.ml.llm module.
Use the GeminiTextGenerator class to create Gemini text generator models. Use these models for text generation tasks.
Use the PaLM2TextGenerator class to create PaLM2 text generator models. Use these models for text generation tasks.
Use the PaLM2TextEmbeddingGenerator class to create PaLM2 text embedding generator models. Use these models for text embedding generation tasks.
You can use the
bigframes.ml.llm
module to create estimators for remote large language models (LLMs).
The following code sample shows using the bigframes.ml.llm
GeminiTextGenerator class to create a Gemini model for code generation:
Remote models
To use BigQuery DataFrames ML remote models (bigframes.ml.remote or bigframes.ml.llm), you must enable the following APIs:
The BigQuery API (
bigquery.googleapis.com
)The BigQuery Connection API (
bigqueryconnection.googleapis.com
)The Vertex AI API (
aiplatform.googleapis.com
)The Cloud Resource Manager API (
cloudresourcemanager.googleapis.com
)You can avoid this requirement by setting the
bigframes.pandas.options.bigquery.skip_bq_connection_check
option toTrue
, in which case the connection (either default or pre-configured) is used as-is without checking for the existence of the connection or verifying its permissions.
You must also be granted the following IAM roles in the project:
- BigQuery Connection Admin
(
roles/bigquery.connectionAdmin
) - Project IAM Admin
(
roles/resourcemanager.projectIamAdmin
) if using default BigQuery connection, or Browser (roles/browser) if using a pre-configured connection. This requirement can be avoided by settingbigframes.pandas.options.bigquery.skip_bq_connection_check
option toTrue
, in which case the connection (default or pre-configured) would be used as-is without any existence or permission check. If you are using the pre-configured connection and skipping the connection check, make sure the connection is created in the right location and its service account has the Vertex AI User (roles/aiplatform.user
) role on the project.
Creating a remote model in BigQuery DataFrames creates a
BigQuery
connection. By default
a connection of the name bigframes-default-connection
is used. You can
use a pre-configured BigQuery connection if you prefer,
in which case the connection creation is skipped. The service account
for the default connection is granted the Vertex AI User
(roles/aiplatform.user
)
IAM role.
Create pipelines
Create ML pipelines by using bigframes.ml.pipeline module. Pipelines let you assemble several ML steps to be cross-validated together while setting different parameters. This simplifies your code, and lets you deploy data preprocessing steps and an estimator together.
Use the Pipeline class to create a pipeline of transforms with a final estimator.