Best practices for implementing machine learning on Google Cloud

Last reviewed 2022-12-15 UTC

This document introduces best practices for implementing machine learning (ML) on Google Cloud, with a focus on custom-trained models based on your data and code. We provide recommendations on how to develop a custom-trained model throughout the machine learning workflow, including key actions and links for further reading.

The following diagram gives a high-level overview of the stages in the ML workflow addressed in this document, which include:

  1. ML development
  2. Data processing
  3. Operationalized training
  4. Model deployment and serving
  5. ML workflow orchestration
  6. Artifact organization
  7. Model monitoring

Machine learning workflow on Google Cloud

The document is not an exhaustive list of recommendations; its goal is to help data scientists and machine learning architects understand the scope of activities involved in using ML on Google Cloud and plan accordingly. And while ML development alternatives like AutoML are mentioned in Use recommended tools and products, this document focuses on custom-trained models.

Before following the best practices in this document, we recommend that you read Introduction to Vertex AI.

For the purposes of this document, it is assumed that:

  • You are primarily using Google Cloud services; hybrid and on-premises approaches are not addressed in this document.

  • You plan to collect training data and store it in Google Cloud.

  • You have an intermediate-level knowledge of machine learning, big data tools, and data preprocessing, as well as a familiarity with Cloud Storage, BigQuery, and Google Cloud fundamentals.

If you are new to machine learning, check out Google's Machine Learning Crash Course.

The following table lists recommended tools and products for each phase of the ML workflow as outlined in this document:

Machine learning workflow step Recommended tools and products
ML environment setup
ML development
Data processing
Operationalized training
Model deployment and serving
ML workflow orchestration
Artifact organization
Model monitoring

Google offers AutoML, Vertex AI Forecasting, and BigQuery ML as pre-built training routine alternatives to Vertex AI custom-trained model solutions. The following table provides recommendations about when to use these options for Vertex AI.

ML environment Description Choose this environment if...
BigQuery ML BigQuery ML brings together data, infrastructure, and pre-defined model types into a single system.
  • All of your data is contained in BigQuery.
  • You are comfortable with SQL.
  • The set of models available in BigQuery ML matches the problem you are trying to solve.
AutoML (in the context of Vertex AI) AutoML provides training routines for common problems like image classification and tabular regression. Nearly all aspects of training and serving a model, like choosing an architecture, hyperparameter tuning, and provisioning machines, are handled for you.
Vertex AI custom trained models Vertex lets you run your own custom training routines and deploy models of any type on serverless architecture. Vertex AI offers additional services, like hyperparameter tuning and monitoring, to make it easier to develop a model. See Choosing a custom training method.
  • Your problem does not match the criteria listed in this table for BigQuery ML or AutoML.
  • You are already running training on-premises or on another cloud platform, and you need consistency across the platforms.

Machine learning environment setup

Use Vertex AI Workbench notebooks for experimentation and development

Regardless of your tooling, we recommend that you use Vertex AI Workbench notebooks for experimentation and development, including writing code, starting jobs, running queries, and checking status. Notebook instances let you access all of Google Cloud's data and artificial intelligence (AI) services in a simple, reproducible way.

Notebook instances also give you a secure set of software and access patterns right out of the box. It is a common practice to customize Google Cloud properties like network and Identity and Access Management, and software (through a container) associated with a notebook instance. See Components of Vertex AI and Introduction to user-managed notebooks for more information.

Create a notebook instance for each team member

Create a user-managed notebooks instance for each member of your data science team. If a team member is involved in multiple projects, especially projects that have different dependencies, we recommend using multiple notebook instances, treating each instance as a virtual workspace. Note that you can stop notebook instances when they are not being used.

Store your ML resources and artifacts based on your corporate policy

The simplest access control is to store both your raw and Vertex AI resources and artifacts, such as datasets and models, in the same Google Cloud project. More typically, your corporation has policies that control access. In cases where your resources and artifacts are stored across projects, you can configure your corporate cross-project access control with Identity and Access Management (IAM).

Use Vertex AI SDK for Python

Use Vertex AI SDK for Python, a Pythonic way to use Vertex AI for your end-to-end model building workflows, which works seamlessly with your favorite ML frameworks including PyTorch, TensorFlow, XGBoost, and scikit-learn.

Alternatively, you can use the Google Cloud console, which supports the functionality of Vertex AI as a user interface through the browser.

Machine learning development

Machine learning development addresses preparing the data, experimenting, and evaluating the model. When solving a machine learning problem, it is typically necessary to build and compare many different models to figure out what works best.

Typically, data scientists train models using different architectures, input data sets, hyperparameters, and hardware. Data scientists evaluate the resulting models by looking at aggregate performance metrics like accuracy, precision, and recall on test datasets. Finally, data scientists evaluate the performance of the models against particular subsets of their data, different model versions, and different model architectures.

Prepare training data

The data used to train a model can originate from any number of systems, for example, logs from an online service system, images from a local device, or documents scraped from the web.

Regardless of your data's origin, extract data from the source systems and convert to the format and storage (separate from the operational source) optimized for ML training. For more information on preparing training data for use with Vertex AI, see Prepare training data for use with Vertex AI.

Store structured and semi-structured data in BigQuery

If you're working with structured or semi-structured data, we recommend that you store all data in BigQuery, following BigQuery's recommendation for project structure. In most cases, you can store intermediate, processed data in BigQuery as well. For maximum speed, it's better to store materialized data instead of using views or subqueries for training data.

Read data out of BigQuery using the BigQuery Storage API. For artifact tracking, consider using a managed tabular dataset. The following table lists Google Cloud tools that make it easier to use the API:

If you're using... Use this Google Cloud tool
TensorFlow or Keras tf.data.dataset reader for BigQuery
TFX BigQuery client
Dataflow BigQuery I/O Connector
Any other framework (such as PyTorch, XGBoost, or scilearn-kit) BigQuery Python client library

Store image, video, audio and unstructured data on Cloud Storage

Store these data in large container formats on Cloud Storage. This applies to sharded TFRecord files if you're using TensorFlow, or Avro files if you're using any other framework.

Combine many individual images, videos, or audio clips into large files, as this will improve your read and write throughput to Cloud Storage. Aim for files of at least 100mb, and between 100 and 10,000 shards.

To enable data management, use Cloud Storage buckets and directories to group the shards. For more information, see What is Cloud Storage?

Use Vertex AI Data Labeling for unstructured data

You might need humans to provide labels to your data, especially when it comes to unstructured data. Use Vertex AI Data Labeling for this work. You can hire your own labelers and use Google Cloud's software for managing their work, or you can use Google's in-house labelers for the task. For more information, see Requesting data labeling.

Use Vertex AI Feature Store with structured data

When you're training a model with structured data, irrespective of where you're training that model, follow these steps:

  1. Search Vertex AI Feature Store to determine if existing features satisfy your requirements.

    1. Open Vertex AI Feature Store and do a search to see if a feature already exists that relates your use case or covers the signal that you're interested in passing to the model.

    2. If there are features in Vertex AI Feature Store that you want to use, fetch those features for your training labels using Vertex AI Feature Store's batch serving capability.

  2. Create a new feature. If Vertex AI Feature Store doesn't have the features you need, create a new feature using data from your data lake.

    1. Fetch raw data from your data lake and write your scripts to perform the necessary feature processing and engineering.

    2. Join the feature values you fetched from Vertex AI Feature Store and the new feature values that you created from the data lake. Merging those feature values produces the training data set.

    3. Set up a periodic job to compute updated values of the new feature. Once you determine that a feature is useful and you want to put it into production, set up a regularly scheduled job with the required cadence to compute updated values of that feature and ingest it into Vertex AI Feature Store. By adding your new feature to Vertex AI Feature Store, you automatically have a solution to do online serving of the features (for online prediction use cases), and you can share your feature with others in the organization that may get value from it for their own ML models.

To learn more, see Vertex AI Feature Store.

Avoid storing data in block storage

Avoid storing data in block storage, like Network File Systems or on virtual machine (VM) hard disks. Those tools are harder to manage than Cloud Storage or BigQuery, and often come with challenges in tuning performance. Similarly, avoid reading data directly from databases like Cloud SQL. Instead, store data in BigQuery and Cloud Storage. For more information, see Cloud Storage documentation and Introduction to loading data for BigQuery.

Use Vertex AI TensorBoard and Vertex AI Experiments for analyzing experiments

When developing models, use Vertex AI TensorBoard to visualize and compare specific experiments—for example, based on hyperparameters. Vertex AI TensorBoard is an enterprise-ready managed Vertex AI TensorBoard service with a cost-effective, secure solution that lets data scientists and ML researchers collaborate easily by making it seamless to track, compare, and share their experiments. Vertex AI TensorBoard enables tracking experiment metrics like loss and accuracy over time, visualizing the model graph, projecting embeddings to a lower dimensional space, and much more.

Use Vertex AI Experiments to integrate with Vertex ML Metadata and to log and build linkage across parameters, metrics, and dataset and model artifacts.

Train a model within a notebook instance for small datasets

Training a model within the notebook instance may be sufficient for small datasets, or subsets of a larger dataset. It may be helpful to use the training service for larger datasets or for distributed training. Using the Vertex AI training service is also recommended to productionize training even on small datasets if the training is carried out on a schedule or in response to the arrival of additional data.

Maximize your model's predictive accuracy with hyperparameter tuning

To maximize your model's predictive accuracy use hyperparameter tuning, the automated model enhancer provided by Vertex AI Training which takes advantage of the processing infrastructure of Google Cloud and Vertex AI Vizier to test different hyperparameter configurations when training your model. Hyperparameter tuning removes the need to manually adjust hyperparameters over the course of numerous training runs to arrive at the optimal values.

To learn more about hyperparameter tuning, see Overview of hyperparameter tuning and Using hyperparameter tuning.

Use a notebook instance to understand your models

Use a notebook instance to evaluate and understand your models. In addition to built-in common libraries like scikit-learn, notebook instances include What-if Tool (WIT) and Language Interpretability Tool (LIT). WIT lets you interactively analyze your models for bias using multiple techniques, while LIT enables you to understand natural language processing model behavior through a visual, interactive, and extensible tool.

Use feature attributions to gain insights into model predictions

Vertex Explainable AI is an integral part of the ML implementation process, offering feature attributions to provide insights into why models generate predictions. By detailing the importance of each feature that a model uses as input to make a prediction, Vertex Explainable AI helps you better understand your model's behavior and build trust in your models.

Vertex Explainable AI supports custom-trained models based on tabular and image data.

For more information about Vertex Explainable AI, see:

Data processing

The recommended approach for processing your data depends on the framework and data types you're using. This section provides high-level recommendations for common scenarios.

Use BigQuery to process structured and semi-structured data

Use BigQuery for storing unprocessed structured or semi-structured data. If you're building your model using BigQuery ML, use the transformations built into BigQuery for preprocessing data. If you're using AutoML, use the transformations built into AutoML for preprocessing data. If you're building a custom model, using the BigQuery transformations may be the most cost-effective method.

Use Dataflow to process data

With large volumes of data, consider using Dataflow, which uses the Apache Beam programming model. You can use Dataflow to convert the unstructured data into binary data formats like TFRecord, which can improve performance of data ingestion during the training process.

Use Dataproc for serverless Spark data processing

Alternatively, if your organization has an investment in an Apache Spark codebase and skills, consider using Dataproc. Use one-off Python scripts for smaller datasets that fit into memory.

If you need to perform transformations that are not expressible in Cloud SQL or are for streaming, you can use a combination of Dataflow and the pandas library.

Use managed datasets with ML metadata

After your data is pre-processed for ML, you may want to consider using a managed dataset in Vertex AI. Managed datasets enable you to create a clear link between your data and custom-trained models, and provide descriptive statistics and automatic or manual splitting into train, test, and validation sets.

Managed datasets are not required; you may choose not to use them if you want more control over splitting your data in your training code, or if lineage between your data and model isn't critical to your application.

For more information, see Datasets and Using a managed dataset in a custom training application.

Operationalized training