AI and ML perspective: Reliability

Last reviewed 2025-08-07 UTC

This document in the Google Cloud Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Google Cloud Well-Architected Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential in order to ensure customer satisfaction and achieve business goals. To meet the unique demands of both predictive ML and generative AI, you need AI and ML systems that are robust, reliable, and adaptable. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with site reliability engineering (SRE) principles and that provides a powerful foundation for reliable AI and ML systems.

The recommendations in this document are mapped to the following core principles:

Ensure that infrastructure is scalable and highly available
Use a modular and loosely coupled architecture
Build an automated end-to-end MLOps platform
Maintain trust and control through data and model governance
Implement holistic observability and reliability practices

Ensure that ML infrastructure is scalable and highly available

Reliable AI and ML systems in the cloud require scalable and highly available infrastructure. These systems have dynamic demands, diverse resource needs, and critical dependencies on model availability. Scalable architectures adapt to fluctuating loads and variations in data volume or inference requests. High availability (HA) helps to ensure resilience against failures at the component, zone, or region level.

To build scalable and highly available ML infrastructure, consider the following recommendations.

Implement automatic and dynamic scaling capabilities

AI and ML workloads are dynamic, with demand that fluctuates based on data arrival rates, training frequency, and the volume of inference traffic. Automatic and dynamic scaling adapts infrastructure resources seamlessly to demand fluctuations. Scaling your workloads effectively helps to prevent downtime, maintain performance, and optimize costs.

To autoscale your AI and ML workloads, use the following products and features in Google Cloud:

Data processing pipelines: Create data pipelines in Dataflow. Configure the pipelines to use Dataflow's horizontal autoscaling feature, which dynamically adjusts the number of worker instances based on CPU utilization, pipeline parallelism, and pending data. You can configure autoscaling parameters through pipeline options when you launch jobs.
Training jobs: Automate the scaling of training jobs by using Vertex AI custom training. You can define worker pool specifications such as the machine type, the type and number of accelerators, and the number of worker pools. For jobs that can tolerate interruptions and for jobs where the training code implements checkpointing, you can reduce costs by using Spot VMs.
Online inference: For online inference, use Vertex AI endpoints. To enable autoscaling, configure the minimum and maximum replica count. Specify a minimum of two replicas for HA. Vertex AI automatically adjusts the number of replicas based on traffic and the configured autoscaling metrics, such as CPU utilization and replica utilization.
Containerized workloads in Google Kubernetes Engine: Configure autoscaling at the node and Pod levels. Configure the cluster autoscaler and node auto-provisioning to adjust the node count based on pending Pod resource requests like CPU, memory, GPU, and TPU. Use Horizontal Pod Autoscaler (HPA) for deployments to define scaling policies based on metrics like CPU and memory utilization. You can also scale based on custom AI and ML metrics, such as GPU or TPU utilization and prediction requests per second.
Serverless containerized services: Deploy the services in Cloud Run and configure autoscaling by specifying the minimum and maximum number of container instances. Use best practices to autoscale GPU-enabled instances by specifying the accelerator type. Cloud Run automatically scales instances between the configured minimum and maximum limits based on incoming requests. When there are no requests, it scales efficiently to zero instances. You can leverage the automatic, request-driven scaling of Cloud Run to deploy Vertex AI agents and to deploy third-party workloads like quantized models using Ollama, LLM model inference using vLLM, and Huggingface Text Generation Inference (TGI).

Design for HA and fault tolerance

For production-grade AI and ML workloads, it's crucial that you ensure continuous operation and resilience against failures. To implement HA and fault tolerance, you need to build redundancy and replication into your architecture on Google Cloud. This approach helps to ensure that a failure of an individual component doesn't cause a failure of the complete system.

For HA and low latency in model serving, particularly for real-time inference and generative AI models, distribute your deployments across multiple locations.
For global availability and resilience, deploy the models to multiple Vertex AI endpoints across Google Cloud regions or use the global endpoint.
Use global load balancing to route traffic.
For training on GKE or Compute Engine MIGs, implement monitoring for Xid errors. When you identify Xid errors, take appropriate remedial action. For example, reset GPUs, reset Compute Engine instances, or trigger hardware replacement by using the gcloud CLI report faulty host command.
Explore fault-tolerant or elastic and resilient training solutions like recipes to use the Google Resiliency Library or integration of the Resilient training with Pathways logic for TPU workloads.

Implement redundancy for critical AI and ML components in Google Cloud. The following are examples of products and features that let you implement resource redundancy:

Deploy GKE regional clusters across multiple zones.
Ensure data redundancy for datasets and checkpoints by using Cloud Storage multi-regional or dual-region buckets.
Use Spanner for globally consistent, highly available storage of metadata.
Configure Cloud SQL read replicas for operational databases.
Ensure that vector databases for retrieval augmented generation (RAG) are highly available and multi-zonal or multi-regional.

Manage resources proactively and anticipate requirements

Effective resource management is important to help you optimize costs, performance, and reliability. AI and ML workloads are dynamic and there's high demand for specialized hardware like GPUs and TPUs. Therefore, it's crucial that you apply proactive resource management and ensure resource availability.

Plan for capacity based on historical monitoring data, such as GPU or TPU utilization and throughput rates, from Cloud Monitoring and logs in Cloud Logging. Analyze this telemetry data by using BigQuery or Looker Studio and forecast future demand for GPUs based on growth or new models. Analysis of resource usage patterns and trends helps you to predict when and where you need critical specialized accelerators.

Validate capacity estimates through rigorous load testing. Simulate traffic on AI and ML services like serving and pipelines by using tools like Apache JMeter or LoadView.
Analyze system behavior under stress.
- To anticipate and meet increased workload demands in production, proactively identify resource requirements. Monitor latency, throughput, errors, and resource utilization, especially GPU and TPU utilization. Increase resource quotas as necessary.
- For generative AI serving, test under high concurrent loads and identify the level at which accelerator availability limits performance.
Perform continuous monitoring for model queries and set up proactive alerts for agents.
- Use the model observability dashboard to view metrics that are collected by Cloud Monitoring, such as model queries per second (QPS), token throughput, and first token latencies.

Optimize resource availability and obtainability

Optimize costs and ensure resource availability by strategically selecting appropriate compute resources based on workload requirements.

For stable 24x7 inference or for training workloads with fixed or predictable capacity requirements, use committed use discounts (CUDs) for VMs and accelerators.
For GKE nodes and Compute Engine VMs, use Spot VMs and Dynamic Workload Scheduler (DWS) capabilities:
- For fault-tolerant tasks such as evaluation and experimentation workloads, use Spot VMs. Spot VMs can be preempted, but they can help reduce your overall costs.
- To manage preemption risk for high-demand accelerators, you can ensure better obtainability by using DWS.
  - For complex batch training that needs high-end GPUs to run up to seven days, use the DWS Flex-Start mode.
  - For longer running workloads that run up to three months, use the Calendar mode to reserve specific GPUs (H100 and H200) and TPUs (Trillium).
To optimize AI inference on GKE, you can run a vLLM engine that dynamically uses TPUs and GPUs to address fluctuating capacity and performance needs. For more information, see vLLM GPU/TPU Fungibility.
For advanced scenarios with complex resource and topology needs that involve accelerators, use tools to abstract resource management.
- Cluster Director lets you deploy and manage accelerator groups with colocation and scheduling for multi-GPU training (A3 Ultra H200 and A4 B200). Cluster Director supports GKE and Slurm clusters.
- Ray on Vertex AI abstracts distributed computing infrastructure. It enables applications to request resources for training and serving without the need for direct management of VMs and containers.

Distribute incoming traffic across multiple instances

Effective load balancing is crucial for AI applications that have fluctuating demands. Load balancing distributes traffic, optimizes resource utilization, provides HA and low latency, and helps to ensure a seamless user experience.

Inference with varying resource needs: Implement load balancing based on model metrics. GKE Inference Gateway lets you deploy models behind a load balancer with model-aware routing. The gateway prioritizes instances with GPU and TPU accelerators for compute-intensive tasks like generative AI and LLM inference. Configure detailed health checks to assess model status. Use serving frameworks like vLLM or Triton for LLM metrics and integrate the metrics into Cloud Monitoring by using Google Cloud Managed Service for Prometheus.
Inference workloads that need GPUs or TPUs: To ensure that critical AI and ML inference workloads consistently run on machines that are suitable to the workloads' requirements, particularly when GPU and TPU availability is constrained, use GKE custom compute classes. You can define specific compute profiles with fallback policies for autoscaling. For example, you can define a profile that specifies a higher priority for reserved GPU or TPU instances. The profile can include a fallback to use cost-efficient Spot VMs if the reserved resources are temporarily unavailable.
Generative AI on diverse orchestration platforms: Use a centralized load balancer. For example, for cost and management efficiency, you can route requests that have low GPU needs to Cloud Run and route more complex, GPU-intensive tasks to GKE. For inter-service communication and policy management, implement a service mesh by using Cloud Service Mesh. Ensure consistent logging and monitoring by using Cloud Logging and Cloud Monitoring.
Global load distribution: To load balance traffic from global users who need low latency, use a global external Application Load Balancer. Configure geolocation routing to the closest region and implement failover. Establish regional endpoint replication in Vertex AI or GKE. Configure Cloud CDN for static assets. Monitor global traffic and latency by using Cloud Monitoring.
Granular traffic management: For requests that have diverse data types or complexity and long-running requests, implement granular traffic management.
- Configure content-based routing to direct requests to specialized backends based on attributes like URL paths and headers. For example, direct requests to GPU-enabled backends for image or video models and to CPU-optimized backends for text-based models.
- For long-running generative AI requests or batch workloads, use WebSockets or gRPC. Implement traffic management to handle timeouts and buffering. Configure request timeouts and retries and implement rate limiting and quotas by using API Gateway or Apigee.

Use a modular and loosely coupled architecture

In a modular, loosely coupled AI and ML architecture, complex systems are divided into smaller, self-contained components that interact through well-defined interfaces. This architecture minimizes module dependencies, simplifies development and testing, enhances reproducibility, and improves fault tolerance by containing failures. The modular approach is crucial for managing complexity, accelerating innovation, and ensuring long-term maintainability.

To design a modular and loosely coupled architecture for AI and ML workloads, consider the following recommendations.

Implement small self-contained modules or components

Separate your end-to-end AI and ML system into small, self-contained modules or components. Each module or component is responsible for a specific function, such as data ingestion, feature transformation, model training, inference serving, or evaluation. A modular design provides several key benefits for AI and ML systems: improved maintainability, increased scalability, reusability, and greater flexibility and agility.

The following sections describe Google Cloud products, features, and tools that you can use to design a modular architecture for your AI and ML systems.

Containerized microservices on GKE

For complex AI and ML systems or intricate generative AI pipelines that need fine-grained orchestration, implement modules as microservices that are orchestrated by using GKE. Package each distinct stage as an individual microservice within Docker containers. These distinct stages include data ingestion that's tailored for diverse formats, specialized data preprocessing or feature engineering, distributed model training or fine tuning of large foundation models, evaluation, or serving.

Deploy the containerized microservices on GKE and leverage automated scaling based on CPU and memory utilization or custom metrics like GPU utilization, rolling updates, and reproducible configurations in YAML manifests. Ensure efficient communication between the microservices by using GKE service discovery. For asynchronous patterns, use message queues like Pub/Sub.

The microservices-on-GKE approach helps you build scalable, resilient platforms for tasks like complex RAG applications where the stages can be designed as distinct services.

Serverless event-driven services

For event-driven tasks that can benefit from serverless, automatic scaling, use Cloud Run or Cloud Run functions. These services are ideal for asynchronous tasks like preprocessing or for smaller inference jobs. Trigger Cloud Run functions on events, such as a new data file that's created in Cloud Storage or model updates in Artifact Registry. For web-hook tasks or services that need a container environment, use Cloud Run.

Cloud Run services and Cloud Run functions can scale up rapidly and scale down to zero, which helps to ensure cost efficiency for fluctuating workloads. These services are suitable for modular components in Vertex AI Agents workflows. You can orchestrate component sequences with Workflows or Application Integration.

Vertex AI managed services

Vertex AI services support modularity and help you simplify the development and deployment of your AI and ML systems. The services abstract the infrastructure complexities so that you can focus on the application logic.

To orchestrate workflows that are built from modular steps, use Vertex AI Pipelines.
To run custom AI and ML code, package the code in Docker containers that can run on managed services like Vertex AI custom training and Vertex AI prediction.
For modular feature engineering pipelines, use Vertex AI Feature Store.
For modular exploration and prototyping, use notebook environments like Vertex AI Workbench or Colab Enterprise. Organize your code into reusable functions, classes, and scripts.

Agentic applications

For AI agents, the Agent Development Kit (ADK) provides modular capabilities like Tools and State. To enable interoperability between frameworks like LangChain, LangGraph, LlamaIndex, and Vertex AI, you can combine the ADK with the Agent2Agent (A2A) protocol and the Model Context Protocol (MCP). This interoperability lets you compose agentic workflows by using diverse components.

You can deploy agents on Vertex AI Agent Engine, which is a managed runtime that's optimized for scalable agent deployment. To run containerized agents, you can leverage the autoscaling capabilities in Cloud Run.

Design well-defined interfaces

To build robust and maintainable software systems, it's crucial to ensure that the components of a system are loosely coupled and modularized. This approach offers significant advantages, because it minimizes the dependencies between different parts of the system. When modules are loosely coupled, changes in one module have minimal impact on other modules. This isolation enables independent updates and development workflows for individual modules.

The following sections provide guidance to help ensure seamless communication and integration between the modules of your AI and ML systems.

Protocol choice

For universal access, use HTTP APIs, adhere to RESTful principles, and use JSON for language-agnostic data exchange. Design the API endpoints to represent actions on resources.
For high-performance internal communication among microservices, use gRPC with Protocol Buffers (ProtoBuf) for efficient serialization and strict typing. Define data structures like ModelInput, PredictionResult, or ADK Tool data by using .proto files, and then generate language bindings.
For use cases where performance is critical, leverage gRPC streaming for large datasets or for continuous flows such as live text-to-speech or video applications. Deploy the gRPC services on GKE.

Standardized and comprehensive documentation

Regardless of the interface protocol that you choose, standardized documentation is crucial. The OpenAPI Specification describes RESTful APIs. Use OpenAPI to document your AI and ML APIs: paths, methods, parameters, request-response formats that are linked to JSON schemas, and security. Comprehensive API documentation helps to improve discoverability and client integration. For API authoring and visualization, use UI tools like Swagger Editor. To accelerate development and ensure consistency, you can generate client SDKs and server stubs by using AI-assisted coding tools like Gemini Code Assist. Integrate OpenAPI documentation into your CI/CD flow.

Interaction with Google Cloud managed services like Vertex AI

Choose between the higher abstraction of the Vertex AI SDK, which is preferred for development productivity, and the granular control that the REST API provides.

The Vertex AI SDK simplifies tasks and authentication. Use the SDK when you need to interact with Vertex AI.
The REST API is a powerful alternative especially when interoperability is required between layers of your system. It's useful for tools in languages that don't have an SDK or when you need fine-grained control.

Use APIs to isolate modules and abstract implementation details

For security, scalability, and visibility, it's crucial that you implement robust API management for your AI and ML services. To implement API management for your defined interfaces, use the following products:

API Gateway: For APIs that are externally exposed and managed, API Gateway provides a centralized, secure entry point. It simplifies access to serverless backend services, such as prediction, training, and data APIs. API Gateway helps to consolidate access points, enforce API contracts, and manage security capabilities like API keys and OAuth 2.0. To protect backends from overload and ensure reliability, implement rate limiting and usage quotas in API Gateway.
Cloud Endpoints: To streamline API development and deployment on GKE and Cloud Run, use Cloud Endpoints, which offers a developer-friendly solution for generating API keys. It also provides integrated monitoring and tracing for API calls and it automates the generation of OpenAPI specs, which simplifies documentation and client integration. You can use Cloud Endpoints to manage access to internal or controlled AI and ML APIs, such as to trigger training and manage feature stores.
Apigee: For enterprise-scale AI and ML, especially sophisticated generative AI APIs, Apigee provides advanced, comprehensive API management. Use Apigee for advanced security like threat protection and OAuth 2.0, for traffic management like caching, quotas, and mediation, and for analytics. Apigee can help you to gain deep insights into API usage patterns, performance, and engagement, which are crucial for understanding generative AI API usage.

Plan for graceful degradation

In production AI and ML systems, component failures are unavoidable, just like in other systems. Graceful degradation ensures that essential functions continue to operate, potentially with reduced performance. This approach prevents complete outages and improves overall availability. Graceful degradation is critical for latency-sensitive inference, distributed training, and generative AI.

The following sections describe techniques that you use to plan and implement graceful degradation.

Fault isolation

To isolate faulty components in distributed architectures, implement the circuit breaker pattern by using resilience libraries, such as Resilience4j in Java and CircuitBreaker in Python.
To prevent cascading failures, configure thresholds based on AI and ML workload metrics like error rates and latency and define fallbacks like simpler models and cached data.

Component redundancy

For critical components, implement redundancy and automatic failover. For example, use GKE multi-zone clusters or regional clusters and deploy Cloud Run services redundantly across different regions. To route traffic to healthy instances when unhealthy instances are detected, use Cloud Load Balancing.

Ensure data redundancy by using Cloud Storage multi-regional buckets. For distributed training, implement asynchronous checkpointing to resume after failures. For resilient and elastic training, use Pathways.

Proactive monitoring

Graceful degradation helps to ensure system availability during failure, but you must also implement proactive measures for continuous health checks and comprehensive monitoring. Collect metrics that are specific to AI and ML, such as latency, throughput, and GPU utilization. Also, collect model performance degradation metrics like model and data drift by using Cloud Monitoring and Vertex AI Model Monitoring.

Health checks can trigger the need to replace faulty nodes, deploy more capacity, or automatically trigger continuous retraining or fine-tuning of pipelines that use updated data. This proactive approach helps to prevent both accuracy-based degradation and system-level graceful degradation and it helps to enhance overall reliability.

SRE practices

To monitor the health of your systems, consider adopting SRE practices to implement service level objectives (SLOs). Alerts on error budget loss and burn rate can be early indicators of reliability problems with the system. For more information about SRE practices, see the Google SRE book.

Build an automated end-to-end MLOps platform

A robust, scalable, and reliable AI and ML system on Google Cloud requires an automated end-to-end MLOps platform for the model development lifecycle. The development lifecycle includes initial data handling, continuous model training, deployment, and monitoring in production. By automating these stages on Google Cloud, you establish repeatable processes, reduce manual toil, minimize errors, and accelerate the pace of innovation.

An automated MLOps platform is essential for establishing production-grade reliability for your applications. Automation helps to ensure model quality, guarantee reproducibility, and enable continuous integration and delivery of AI and ML artifacts.

To build an automated end-to-end MLOps platform, consider the following recommendations.

Automate the model development lifecycle

A core element of an automated MLOps platform is the orchestration of the entire AI and ML workflow as a series of connected, automated steps: from data preparation and validation to model training, evaluation, deployment, and monitoring.

Use Vertex AI Pipelines as your central orchestrator:
- Define end-to-end workflows with modular components for data processing, training, evaluation, and deployment.
- Automate pipeline runs by using schedules or triggers like new data or code changes.
- Implement automated parameterization and versioning for each pipeline run and create a version history.
- Monitor pipeline progress and resource usage by using built-in logging and tracing, and integrate with Cloud Monitoring alerts.
Define your ML pipelines programmatically by using the Kubeflow Pipelines (KFP) SDK or TensorFlow Extended SDK. For more information, see Interfaces for Vertex AI Pipelines.
Orchestrate operations by using Google Cloud services like Dataflow, Vertex AI custom training, Vertex AI Model Registry, and Vertex AI endpoints.
For generative AI workflows, orchestrate the steps for prompt management, batched inference, human-in-the-loop (HITL) evaluation, and coordinating ADK components.

Manage infrastructure as code

Infrastructure as code (IaC) is crucial for managing AI and ML system infrastructure and for enabling reproducible, scalable, and maintainable deployments. The infrastructure needs of AI and ML systems are dynamic and complex. The systems often require specialized hardware like GPUs and TPUs. IaC helps to mitigate the risks of manual infrastructure management by ensuring consistency, enabling rollbacks, and making deployments repeatable.

To effectively manage your infrastructure resources as code, use the following techniques.

Automate resource provisioning

To effectively manage IaC on Google Cloud, define and provision your AI and ML infrastructure resources by using Terraform. The infrastructure might include resources such as the following:

GKE clusters that are configured with node pools. The node pools can be optimized based on workload requirements. For example, you can use A100, H100, H200, or B200 GPUs for training, and use L4 GPUs for inference.
Vertex AI endpoints that are configured for model serving, with defined machine types and scaling policies.
Cloud Storage buckets for data and artifacts.

Use configuration templates

Organize your Terraform configurations as modular templates. To accelerate the provisioning of AI and ML resources, you can use Cluster Toolkit. The toolkit provides example blueprints, which are Google-curated Terraform templates that you can use to deploy ready-to-use HPC, AI, and ML clusters in Slurm or GKE. You can customize the Terraform code and manage it in your version control system. To automate the resource provisioning and update workflow, you can integrate the code into your CI/CD pipelines by using Cloud Build.

Automate configuration changes

After you provision your infrastructure, manage the ongoing configuration changes declaratively:

In Kubernetes-centric environments, manage your Google Cloud resources as Kubernetes objects by using Config Connector.
Define and manage Vertex AI resources like datasets, models, and endpoints, Cloud SQL instances, Pub/Sub topics, and Cloud Storage buckets by using YAML manifests.
Deploy the manifests to your GKE cluster in order to integrate the application and infrastructure configuration.
Automate configuration updates by using CI/CD pipelines and use templating to handle environment differences.
Implement configurations for Identity and Access Management (IAM) policies and service accounts by using IaC.

Integrate with CI/CD

Automate the lifecycle of the Google Cloud infrastructure resources by integrating IaC into CI/CD pipelines by using tools like Cloud Build and Infrastructure Manager.
Define triggers for automatic updates on code commits.
Implement automated testing and validation within the pipeline. For example, you can create a script to automatically run the Terraform validate and plan commands.
Store the configurations as artifacts and enable versioning.
Define separate environments, such as dev, staging, and prod, with distinct configurations in version control and automate environment promotion.

Validate model behavior

To maintain model accuracy and relevance over time, automate the training and evaluation process within your MLOps platform. This automation, coupled with rigorous validation, helps to ensure that the models behave as expected with relevant data before they're deployed to production.

Set up continuous training pipelines, which are either triggered by new data and monitoring signals like data drift or that run on a schedule.
- To manage automated training jobs, such as hyperparameter tuning trials and distributed training configurations for larger models, use Vertex AI custom training.
- For fine-tuning foundation models, automate the fine-tuning process and integrate the jobs into your pipelines.
Implement automated model versioning and securely store trained model artifacts after each successful training run. You can store the artifacts in Cloud Storage or register them in Model Registry.
Define evaluation metrics and set clear thresholds, such as minimum accuracy, maximum error rate, and minimum F1 score.
- Ensure that a model meets the thresholds to automatically pass the evaluation and be considered for deployment.
- Automate evaluation by using services like model evaluation in Vertex AI.
- Ensure that the evaluation includes metrics that are specific to the quality of generated output, factual accuracy, safety attributes, and adherence to specified style or format.
To automatically log and track the parameters, code versions, dataset versions, and results of each training and evaluation run, use Vertex AI Experiments. This approach provides a history that's useful for comparison, debugging, and reproducibility.
To optimize hyperparameter tuning and automate searching for optimal model configurations based on your defined objective, use Vertex AI Vizier.
To visualize training metrics and to debug during development, use Vertex AI TensorBoard.

Validate inputs and outputs of AI and ML pipelines

To ensure the reliability and integrity of your AI and ML systems, you must validate data when it enters the systems and moves through the pipelines. You must also verify the inputs and outputs at the component boundaries. Robust validation of all inputs and outputs—raw data, processed data, configurations, arguments, and files—helps to prevent unexpected behavior and maintain model quality throughout the MLOps lifecycle. When you integrate this proactive approach into your MLOps platform, it helps detect errors before they are propagated throughout a system and it saves time and resources.

To effectively validate the inputs and outputs of your AI and ML pipelines, use the following techniques.

Automate data validation

Implement automated data validation in your data ingestion and preprocessing pipelines by using TensorFlow Data Validation (TFDV).
- For large-scale, SQL-based data quality checks, leverage scalable processing services like BigQuery.
- For complex, programmatic validation on streaming or batch data, use Dataflow.
Monitor data distributions over time with TFDV capabilities.
- Visualize trends by using tools that are integrated with Cloud Monitoring to detect data drift. You can automatically trigger model retraining pipelines when data patterns change significantly.
Store validation results and metrics in BigQuery for analysis and historical tracking and archive validation artifacts in Cloud Storage.

Validate pipeline configurations and input data

To prevent pipeline failures or unexpected behavior caused by incorrect settings, implement strict validation for all pipeline configurations and command-line arguments:

Define clear schemas for your configuration files like YAML or JSON by using schema validation libraries like jsonschema for Python. Validate configuration objects against these schemas before a pipeline run starts and before a component executes.
Implement input validation for all command-line arguments and pipeline parameters by using argument-parsing libraries like argparse. Validation should check for correct data types, valid values, and required arguments.
Within Vertex AI Pipelines, define the expected types and properties of component parameters by using the built-in component input validation features.
To ensure reproducibility of pipeline runs and to maintain an audit trail, store validated, versioned configuration files in Cloud Storage or Artifact Registry.

Validate input and output files

Validate input and output files such as datasets, model artifacts, and evaluation reports for integrity and format correctness:

Validate file formats like CSV, Parquet, and image types by using libraries.
For large files or critical artifacts, validate file sizes and checksums to detect corruption or incomplete transfers by using Cloud Storage data validation and change detection.
Perform file validation by using Cloud Run functions (for example, based on file upload events) or within Dataflow pipelines.
Store validation results in BigQuery for easier retrieval and analysis.

Automate deployment and implement continuous monitoring

Automated deployment and continuous monitoring of models in production helps to ensure reliability, perform rapid updates, and detect issues promptly. This involves managing model versions, controlled deployment, automated deployment using CI/CD, and comprehensive monitoring as described in the following sections.

Manage model versions

Manage model iterations and associated artifacts by using versioning tools:

To track model versions and metadata and to link to underlying model artifacts, use Model Registry.
Implement a clear versioning scheme (such as, semantic versioning). For each model version, attach comprehensive metadata such as training parameters, evaluation metrics from validation pipelines, and dataset version.
Store model artifacts such as model files, pretrained weights, and serving container images in Artifact Registry and use its versioning and tagging features.
To meet security and governance requirements, define stringent access-control policies for Model Registry and Artifact Registry.
To programmatically register and manage versions and to integrate versions into automated CI/CD pipelines, use the Vertex AI SDK or API.

Perform controlled deployment

Control the deployment of model versions to endpoints by using your serving platform's traffic management capabilities.

Implement a rolling deployment by using the traffic splitting feature of Vertex AI endpoints.
If you deploy your model to GKE, use advanced traffic management techniques like canary deployment:
1. Route a small subset of the production traffic to a new model version.
2. Continuously monitor performance and error rates through metrics.
3. Establish that the model is reliable.
4. Roll out the version to all traffic.
Perform A/B testing of AI agents:
1. Deploy two different model-agent versions or entirely different models to the same endpoint.
2. Split traffic across the deployments.
3. Analyze the results against business objectives.
Implement automated rollback mechanisms that can quickly revert endpoint traffic to a previous stable model version if monitoring alerts are triggered or performance thresholds are missed.
Configure traffic splitting and deployment settings programmatically by using the Vertex AI SDK or API.
Use Cloud Monitoring to track performance and traffic across versions.
Automate deployment with CI/CD pipelines. You can use Cloud Build to build containers, version artifacts, and trigger deployment to Vertex AI endpoints.
Ensure that the CI/CD pipelines manage versions and pull from Artifact Registry.
Before you shift traffic, perform automated endpoint testing for prediction correctness, latency, throughput, and API function.
Store all configurations in version control.

Monitor continuously

Use Model Monitoring to automatically detect performance degradation, data drift (changes in input distribution compared to training), and prediction drift (changes in model outputs).
- Configure drift detection jobs with thresholds and alerts.
- Monitor real-time performance: prediction latency, throughput, error rates.
Define custom metrics in Cloud Monitoring for business KPIs.
Integrate Model Monitoring results and custom metrics with Cloud Monitoring for alerts and dashboards.
Configure notification channels like email, Slack, or PagerDuty and configure automated remediation.
To debug prediction logs, use Cloud Logging.
Integrate monitoring with incident management.

For generative AI endpoints, monitor output characteristics like toxicity and coherence:

Monitor feature serving for drift.
Implement granular prediction validation: validate outputs against expected ranges and formats by using custom logic.
Monitor prediction distributions for shifts.
Validate output schema.
Configure alerts for unexpected outputs and shifts.
Track and respond to real-time validation events by using Pub/Sub.

Ensure that the output of comprehensive monitoring feeds back into continuous training.

Maintain trust and control through data and model governance

AI and ML reliability extends beyond technical uptime. It includes trust and robust data and model governance. AI outputs might be inaccurate, biased, or outdated. Such issues erode trust and can cause harm. Comprehensive traceability, strong access control, automated validation, and transparent practices help to ensure that AI outputs are reliable, trustworthy, and meet ethics standards.

To maintain trust and control through data and model governance, consider the following recommendations.

Establish data and model catalogs for traceability

To facilitate comprehensive tracing, auditing, and understanding the lineage of your AI and ML assets, maintain a robust, centralized record of data and model versions throughout their lifecycle. A reliable data and model catalog serves as the single source of truth for all of the artifacts that are used and produced by your AI and ML pipelines–from raw data sources and processed datasets to trained model versions and deployed endpoints.

Use the following products, tools, and techniques to create and maintain catalogs for your data assets:

Build an enterprise-wide catalog of your data assets by using Dataplex Universal Catalog. To automatically discover and build inventories of the data assets, integrate Dataplex Universal Catalog with your storage systems, such as BigQuery, Cloud Storage, and Pub/Sub.
Ensure that your data is highly available and durable by storing it in Cloud Storage multi-region or dual-region buckets. Data that you upload to these buckets is stored redundantly across at least two separate geographic locations. This redundancy provides built-in resilience against regional outages and it helps to ensure data integrity.
Tag and annotate your datasets with relevant business metadata, ownership information, sensitivity levels, and lineage details. For example, link a processed dataset to its raw source and to the pipeline that created the dataset.
Create a central repository for model versions by using Model Registry. Register each trained model version and link it to the associated metadata. The metadata can include the following:
- Training parameters.
- Evaluation metrics from validation pipelines.
- Dataset version that was used for training, with lineage traced back to the relevant Dataplex Universal Catalog entry.
- Code version that produced the dataset.
- Details about the framework or foundation model that was used.
Before you import a model into Model Registry, store model artifacts like model files and pretrained weights in a service like Cloud Storage. Store custom container images for serving or custom training jobs in a secure repository like Artifact Registry.
To ensure that data and model assets are automatically registered and updated in the respective catalogs upon creation or modification, implement automated processes within your MLOps pipelines. This comprehensive cataloging provides end-to-end traceability from raw data to prediction, which lets you audit the inputs and processes that led to a specific model version or prediction. The auditing capability is vital for debugging unexpected behavior, ensuring compliance with data usage policies, and understanding the impact of data or model changes over time.
For Generative AI and foundation models, your catalog must also track details about the specific foundation model used, fine-tuning parameters, and evaluation results that are specific to the quality and safety of the generated output.

Implement robust access controls and audit trails

To maintain trust and control in your AI and ML systems, it's essential that you protect sensitive data and models from unauthorized access and ensure accountability for all changes.

Implement strict access controls and maintain detailed audit trails across all components of your AI and ML systems in Google Cloud.
Define granular permissions in IAM for users, groups, and service accounts that interact with your AI and ML resources.
Follow the principle of least privilege rigorously.
Grant only the minimum necessary permissions for specific tasks. For example, a training service account needs read access to training data and write access for model artifacts, but the service might not need write access to production serving endpoints.

Apply IAM policies consistently across all relevant assets and resources in your AI and ML systems, including the following:

Cloud Storage buckets that contain sensitive data or model artifacts.
BigQuery datasets.
Vertex AI resources, such as model repositories, endpoints, pipelines, and Feature Store resources.
Compute resources, such as GKE clusters and Cloud Run services.

Use auditing and logs to capture, monitor, and analyze access activity:

Enable Cloud Audit Logs for all of the Google Cloud services that are used by your AI and ML system.
Configure audit logs to capture detailed information about API calls, data access events, and configuration changes made to your resources. Monitor the logs for suspicious activity, unauthorized access attempts, or unexpected modifications to critical data or model assets.
For real-time analysis, alerting, and visualization, stream the audit logs to Cloud Logging.
For cost-effective long-term storage and retrospective security analysis or compliance audits, export the logs to BigQuery.
For centralized security monitoring, integrate audit logs with your security information and event management (SIEM) systems. Regularly review access policies and audit trails to ensure they align with your governance requirements and detect potential policy violations.
For applications that handle sensitive data, such as personally identifiable information (PII) for training or inference, use Sensitive Data Protection checks within pipelines or on data storage.
For generative AI and agentic solutions, use audit trails to help track who accessed specific models or tools, what data was used for fine-tuning or prompting, and what queries were sent to production endpoints. The audit trails help you to ensure accountability and they provide crucial data for you to investigate misuse of data or policy violations.

Address bias, transparency, and explainability

To build trustworthy AI and ML systems, you need to address potential biases that are inherent in data and models, strive for transparency in system behavior, and provide explainability for model outputs. It's especially crucial to build trustworthy systems in sensitive domains or when you use complex models like those that are typically used for generative AI applications.

Implement proactive practices to identify and mitigate bias throughout the MLOps lifecycle.
Analyze training data for bias by using tools that detect skew in feature distributions across different demographic groups or sensitive attributes.
Evaluate the overall model performance and the performance across predefined slices of data. Such evaluation helps you to identify disparate performance or bias that affects specific subgroups.

For model transparency and explainability, use tools that help users and developers understand why a model made a particular prediction or produced a specific output.

For tabular models that are deployed on Vertex AI endpoints, generate feature attributions by using Vertex Explainable AI. Feature attributions indicate the input features that contributed most to the prediction.
Interactively explore model behavior and potential biases on a dataset by using model-agnostic tools like the What-If Tool, which integrates with TensorBoard.
Integrate explainability into your monitoring dashboards. In situations where understanding the model's reasoning is important for trust or decision-making, provide explainability data directly to end users through your application interfaces.
For complex models like LLMs that are used for generative AI models, explain the process that an agent followed, such as by using trace logs. Explainability is relatively challenging for such models, but it's still vital.
In RAG applications, provide citations for retrieved information. You can also use techniques like prompt engineering to guide the model to provide explanations or show its reasoning steps.
Detect shifts in model behavior or outputs that might indicate emerging bias or unfairness by implementing continuous monitoring in production. Document model limitations, intended use cases, and known potential biases as part of the model's metadata in the Model Registry.

Implement holistic AI and ML observability and reliability practices

Holistic observability is essential for managing complex AI and ML systems in production. It's also essential for measuring the reliability of complex AI and ML systems, especially for generative AI, due to its complexity, resource intensity, and potential for unpredictable outputs. Holistic observability involves observing infrastructure, application code, data, and model behavior to gain insights for proactive issue detection, diagnosis, and response. This observability ultimately leads to high-performance, reliable systems. To achieve holistic observability you need to do the following:

Adopt SRE principles.
Define clear reliability goals.
Track metrics across system layers.
Use insights from observability for continuous improvement and proactive management.

To implement holistic observability and reliability practices for AI and ML workloads in Google Cloud, consider the following recommendations.

Establish reliability goals and business metrics

Identify the key performance indicators (KPIs) that your AI and ML system directly affects. The KPIs might include revenue that's influenced by AI recommendations, customer churn that the AI systems predicted or mitigated, and user engagement and conversion rates that are driven by generative AI features.

For each KPI, define the corresponding technical reliability metrics that affect the KPI. For example, if the KPI is "customer satisfaction with a conversational AI assistant," then the corresponding reliability metrics can include the following:

The success rate of user requests.
The latency of responses: time to first token (TTFT) and token streaming for LLMs.
The rate of irrelevant or harmful responses.
The rate of successful task completion by the agent.

For AI and ML training, reliability metrics can include model FLOPS utilization (MFU), iterations per second, tokens per second, and tokens per device.

To effectively measure and improve AI and ML reliability, begin by setting clear reliability goals that are aligned with the overarching business objectives. Adopt the SRE approach by defining SLOs that quantify acceptable levels of reliability and performance for your AI and ML services from the users' perspective. Quantify these technical reliability metrics with specific SLO targets.

The following are examples of SLO targets:

99.9% of API calls must return a successful response.
95th percentile inference latency must be below 300 ms.
TTFT must be below 500 ms for 99% of requests.
Rate of harmful output must be below 0.1%.

Aligning SLOs directly with business needs ensures that reliability efforts are focused on the most critical system behavior that affects users and the business. This approach helps to transform reliability into a measurable and actionable engineering property.

Monitor infrastructure and application performance

Track infrastructure metrics across all of the resources that are used by your AI and ML systems. The metrics include processor usage (CPU, GPU, and TPU), memory usage, network throughput and latency, and disk I/O. Track the metrics for managed environments like Vertex AI training and serving and for self-managed resources like GKE nodes and Cloud Run instances.

Monitor the four golden signals for your AI and ML applications:

Latency: Time to respond to requests.
Traffic: Volume of requests or workload.
Error rate: Rate of failed requests or operations.
Saturation: Utilization of critical resources like CPU, memory, and GPU or TPU accelerators, which indicates how close your system is to capacity limits.

Perform monitoring by using the following techniques:

Collect, store, and visualize the infrastructure and application metrics by using Cloud Monitoring. You can use pre-built dashboards for Google Cloud services and create custom dashboards that are tailored based on your workload's specific performance indicators and infrastructure health.
- Collect and integrate metrics from specialized serving frameworks like vLLM or NVIDIA Triton Inference Server into Cloud Monitoring by using Google Cloud Managed Service for Prometheus.
- Create dashboards and configure alerts for metrics that are related to custom training, endpoints, and performance, and for metrics that Vertex AI exports to Cloud Monitoring.
Collect detailed logs from your AI and ML applications and the underlying infrastructure by using Cloud Logging. These logs are essential for troubleshooting and performance analysis. They provide context around events and errors.
Pinpoint latency issues and understand request flows across distributed AI and ML microservices by using Cloud Trace. This capability is crucial for debugging complex Vertex AI Agents interactions or multi-component inference pipelines.
Identify performance bottlenecks within function blocks in application code by using Cloud Profiler. Identifying performance bottlenecks can help you optimize resource usage and execution time.
Gather specific accelerator-related metrics like detailed GPU utilization per process, memory usage per process, and temperature, by using tools like NVIDIA Data Center GPU Manager (DCGM).

Implement data and model observability

Reliable generative AI systems require robust data and model observability, which starts with end-to-end pipeline monitoring.

Track data ingestion rates, processed volumes, and transformation latencies by using services like Dataflow.
Monitor job success and failure rates within your MLOps pipelines, including pipelines that are managed by Vertex AI Pipelines.

Continuous assessment of data quality is crucial.

Manage and govern data by using Dataplex Universal Catalog:
- Evaluate accuracy by validating against ground truth or by tracking outlier detection rates.
- Monitor freshness based on the age of data and frequency of updates against SLAs.
- Assess completeness by tracking null-value percentages and required field-fill rates.
- Ensure validity and consistency through checks for schema-adherence and duplication.
Proactively detect anomalies by using Cloud Monitoring alerting and through clear data lineage for traceability.
For RAG systems, examine the relevance of the retrieved context and the groundedness (attribution to source) of the responses.
Monitor the throughput of vector database queries.

Key model observability metrics include input-output token counts and model-specific error rates, such as hallucination or query resolution failures. To track these metrics, use Model Monitoring.

Continuously monitor the toxicity scores of the output and user-feedback ratings.
Automate the assessment of model outputs against defined criteria by using the Gen AI evaluation service.
Ensure sustained performance by systematically monitoring for data and concept drift with comprehensive error-rate metrics.

To track model metrics, you can use TensorBoard or MLflow. For deep analysis and profiling to troubleshoot performance issues, you can use PyTorch XLA profiling or NVIDIA Nsight.

Contributors

Authors:

Rick (Rugui) Chen | AI Infrastructure Field Solutions Architect
Stef Ruinard | Generative AI Field Solutions Architect

Other contributors:

Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
Hossein Sarshar | AI Infrastructure Field Solution Architect
Jose Andrade | Customer Engineer, SRE Specialist
Kumar Dhanagopal | Cross-Product Solution Developer
Laura Hyatt | Customer Engineer, FSI
Olivier Martin | AI Infrastructure Field Solution Architect
Radhika Kanakam | Program Lead, Google Cloud Well-Architected Framework

Security

Cost optimization