Well-Architected Framework: AI and ML perspective

Last reviewed 2025-02-14 UTC

This document in the Google Cloud Well-Architected Framework describes principles and recommendations to help you to design, build, and manage AI and ML workloads in Google Cloud that meet your operational, security, reliability, cost, and performance goals.

The target audience for this document includes decision makers, architects, administrators, developers, and operators who design, build, deploy, and maintain AI and ML workloads in Google Cloud.

The following pages describe principles and recommendations that are specific to AI and ML, for each pillar of the Well-Architected Framework:

Contributors

Authors:

Benjamin Sadik | AI and ML Specialist Customer Engineer
Charlotte Gistelinck, PhD | Partner Engineer
Filipe Gracio, PhD | Customer Engineer
Isaac Lo | AI Business Development Manager
Kamilla Kurta | GenAI/ML Specialist Customer Engineer
Mohamed Fawzi | Benelux Security and Compliance Lead
Rick (Rugui) Chen | AI Infrastructure Solutions Architect
Sannya Dang | AI Solution Architect

Other contributors:

Daniel Lees | Cloud Security Architect
Gary Harmson | Principal Architect
Jose Andrade | Enterprise Infrastructure Customer Engineer
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
Radhika Kanakam | Senior Program Manager, Cloud GTM
Ryan Cox | Principal Architect
Samantha He | Technical Writer
Stef Ruinard | Generative AI Field Solutions Architect
Wade Holmes | Global Solutions Director
Zach Seils | Networking Specialist

AI and ML perspective: Operational excellence

This document in the Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to build and operate robust AI and ML systems on Google Cloud. These recommendations help you set up foundational elements like observability, automation, and scalability. The recommendations in this document align with the operational excellence pillar of the Google Cloud Well-Architected Framework.

Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the AI and ML systems and pipelines that help drive your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that your operations remain aligned with business goals.

The recommendations in this document are mapped to the following core principles:

Build a robust foundation for model development
Automate the model development lifecycle
Implement observability
Build a culture of operational excellence
Design for scalability

Build a robust foundation for model development

To develop and deploy scalable, reliable AI systems that help you achieve your business goals, a robust model-development foundation is essential. Such a foundation enables consistent workflows, automates critical steps in order to reduce errors, and ensures that the models can scale with demand. A strong model-development foundation ensures that your ML systems can be updated, improved, and retrained seamlessly. The foundation also helps you to align your models' performance with business needs, deploy impactful AI solutions quickly, and adapt to changing requirements.

To build a robust foundation to develop your AI models, consider the following recommendations.

Define the problems and the required outcomes

Before you start any AI or ML project, you must have a clear understanding of the business problems to be solved and the required outcomes. Start with an outline of the business objectives and break the objectives down into measurable key performance indicators (KPIs). To organize and document your problem definitions and hypotheses in a Jupyter notebook environment, use tools like Vertex AI Workbench. To implement versioning for code and documents and to document your projects, goals, and assumptions, use tools like Git. To develop and manage prompts for generative AI applications, you can use Vertex AI Studio.

Collect and preprocess the necessary data

To implement data preprocessing and transformation, you can use Dataflow (for Apache Beam), Dataproc (for Apache Spark), or BigQuery if an SQL-based process is appropriate. To validate schemas and detect anomalies, use TensorFlow Data Validation (TFDV) and take advantage of automated data quality scans in BigQuery where applicable.

For generative AI, data quality includes accuracy, relevance, diversity, and alignment with the required output characteristics. In cases where real-world data is insufficient or imbalanced, you can generate synthetic data to help improve model robustness and generalization. To create synthetic datasets based on existing patterns or to augment training data for better model performance, use BigQuery DataFrames and Gemini. Synthetic data is particularly valuable for generative AI because it can help improve prompt diversity and overall model robustness. When you build datasets for fine-tuning generative AI models, consider using the synthetic data generation capabilities in Vertex AI.

For generative AI tasks like fine-tuning or reinforcement learning from human feedback (RLHF), ensure that labels accurately reflect the quality, relevance, and safety of the generated outputs.

Select an appropriate ML approach

When you design your model and parameters, consider the model's complexity and computational needs. Depending on the task (such as classification, regression, or generation), consider using Vertex AI custom training for custom model building or AutoML for simpler ML tasks. For common applications, you can also access pretrained models through Vertex AI Model Garden. You can experiment with a variety of state-of-the-art foundation models for various use cases, such as generating text, images, and code.

You might want to fine-tune a pretrained foundation model to achieve optimal performance for your specific use case. For high-performance requirements in custom training, configure Cloud Tensor Processing Units (TPUs) or GPU resources to accelerate the training and inference of deep-learning models, like large language models (LLMs) and diffusion models.

Set up version control for code, models, and data

To manage and deploy code versions effectively, use tools like GitHub or GitLab. These tools provide robust collaboration features, branching strategies, and integration with CI/CD pipelines to ensure a streamlined development process.

Use appropriate solutions to manage each artifact of your ML system, like the following examples:

For code artifacts like container images and pipeline components, Artifact Registry provides a scalable storage solution that can help improve security. Artifact Registry also includes versioning and can integrate with Cloud Build and Cloud Deploy.
To manage data artifacts, like datasets used for training and evaluation, use solutions like BigQuery or Cloud Storage for storage and versioning.
To store metadata and pointers to data locations, use your version control system or a separate data catalog.

To maintain the consistency and versioning of your feature data, use Vertex AI Feature Store. To track and manage model artifacts, including binaries and metadata, use Vertex AI Model Registry, which lets you store, organize, and deploy model versions seamlessly.

To ensure model reliability, implement Vertex AI Model Monitoring. Detect data drift, track performance, and identify anomalies in production. For generative AI systems, monitor shifts in output quality and safety compliance.

Automate the model-development lifecycle

Automation helps you to streamline every stage of the AI and ML lifecycle. Automation reduces manual effort and standardizes processes, which leads to enhanced operational efficiency and a lower risk of errors. Automated workflows enable faster iteration, consistent deployment across environments, and more reliable outcomes, so your systems can scale and adapt seamlessly.

To automate the development lifecycle of your AI and ML systems, consider the following recommendations.

Use a managed pipeline orchestration system

Use Vertex AI Pipelines to automate every step of the ML lifecycle—from data preparation to model training, evaluation, and deployment. To accelerate deployment and promote consistency across projects, automate recurring tasks with scheduled pipeline runs, monitor workflows with execution metrics, and develop reusable pipeline templates for standardized workflows. These capabilities extend to generative AI models, which often require specialized steps like prompt engineering, response filtering, and human-in-the-loop evaluation. For generative AI, Vertex AI Pipelines can automate these steps, including the evaluation of generated outputs against quality metrics and safety guidelines. To improve prompt diversity and model robustness, automated workflows can also include data augmentation techniques.

Implement CI/CD pipelines

To automate the building, testing, and deployment of ML models, use Cloud Build. This service is particularly effective when you run test suites for application code, which ensures that the infrastructure, dependencies, and model packaging meet your deployment requirements.

ML systems often need additional steps beyond code testing. For example, you need to stress test the models under varying loads, perform bulk evaluations to assess model performance across diverse datasets, and validate data integrity before retraining. To simulate realistic workloads for stress tests, you can use tools like Locust, Grafana k6, or Apache JMeter. To identify bottlenecks, monitor key metrics like latency, error rate, and resource utilization through Cloud Monitoring. For generative AI, the testing must also include evaluations that are specific to the type of generated content, such as text quality, image fidelity, or code functionality. These evaluations can involve automated metrics like perplexity for language models or human-in-the-loop evaluation for more nuanced aspects like creativity and safety.

To implement testing and evaluation tasks, you can integrate Cloud Build with other Google Cloud services. For example, you can use Vertex AI Pipelines for automated model evaluation, BigQuery for large-scale data analysis, and Dataflow pipeline validation for feature validation.

You can further enhance your CI/CD pipeline by using Vertex AI for continuous training to enable automated retraining of models on new data. Specifically for generative AI, to keep the generated outputs relevant and diverse, the retraining might involve automatically updating the models with new training data or prompts. You can use Vertex AI Model Garden to select the latest base models that are available for tuning. This practice ensures that the models remain current and optimized for your evolving business needs.

Implement safe and controlled model releases

To minimize risks and ensure reliable deployments, implement a model release approach that lets you detect issues early, validate performance, and roll back quickly when required.

To package your ML models and applications into container images and deploy them, use Cloud Deploy. You can deploy your models to Vertex AI endpoints.

Implement controlled releases for your AI applications and systems by using strategies like canary releases. For applications that use managed models like Gemini, we recommend that you gradually release new application versions to a subset of users before the full deployment. This approach lets you detect potential issues early, especially when you use generative AI models where outputs can vary.

To release fine-tuned models, you can use Cloud Deploy to manage the deployment of the model versions, and use the canary release strategy to minimize risk. With managed models and fine-tuned models, the goal of controlled releases is to test changes with a limited audience before you release the applications and models to all users.

For robust validation, use Vertex AI Experiments to compare new models against existing ones, and use Vertex AI model evaluation to assess model performance. Specifically for generative AI, define evaluation metrics that align with the intended use case and the potential risks. You can use the Gen AI evaluation service in Vertex AI to assess metrics like toxicity, coherence, factual accuracy, and adherence to safety guidelines.

To ensure deployment reliability, you need a robust rollback plan. For traditional ML systems, use Vertex AI Model Monitoring to detect data drift and performance degradation. For generative AI models, you can track relevant metrics and set up alerts for shifts in output quality or the emergence of harmful content by using Vertex AI model evaluation along with Cloud Logging and Cloud Monitoring. Configure alerts based on generative AI-specific metrics to trigger rollback procedures when necessary. To track model lineage and revert to the most recent stable version, use insights from Vertex AI Model Registry.

Implement observability

The behavior of AI and ML systems can change over time due to changes in the data or environment and updates to the models. This dynamic nature makes observability crucial to detect performance issues, biases, or unexpected behavior. This is especially true for generative AI models because the outputs can be highly variable and subjective. Observability lets you proactively address unexpected behavior and ensure that your AI and ML systems remain reliable, accurate, and fair.

To implement observability for your AI and ML systems, consider the following recommendations.

Monitor performance continuously

Use metrics and success criteria for ongoing evaluation of models after deployment.

You can use Vertex AI Model Monitoring to proactively track model performance, identify training-serving skew and prediction drift, and receive alerts to trigger necessary model retraining or other interventions. To effectively monitor for training-serving skew, construct a golden dataset that represents the ideal data distribution, and use TFDV to analyze your training data and establish a baseline schema.

Configure Model Monitoring to compare the distribution of input data against the golden dataset for automatic skew detection. For traditional ML models, focus on metrics like accuracy, precision, recall, F1-score, AUC-ROC, and log loss. Define custom thresholds for alerts in Model Monitoring. For generative AI, use the Gen AI evaluation service to continuously monitor model output in production. You can also enable automatic evaluation metrics for response quality, safety, instruction adherence, grounding, writing style, and verbosity. To assess the generated outputs for quality, relevance, safety, and adherence to guidelines, you can incorporate human-in-the-loop evaluation.

Create feedback loops to automatically retrain models with Vertex AI Pipelines when Model Monitoring triggers an alert. Use these insights to improve your models continuously.

Evaluate models during development

Before you deploy your LLMs and other generative AI models, thoroughly evaluate them during the development phase. Use Vertex AI model evaluation to achieve optimal performance and to mitigate risk. Use Vertex AI rapid evaluation to let Google Cloud automatically run evaluations based on the dataset and prompts that you provide.

You can also define and integrate custom metrics that are specific to your use case. For feedback on generated content, integrate human-in-the-loop workflows by using Vertex AI Model Evaluation.

Use adversarial testing to identify vulnerabilities and potential failure modes. To identify and mitigate potential biases, use techniques like subgroup analysis and counterfactual generation. Use the insights gathered from the evaluations that were completed during the development phase to define your model monitoring strategy in production. Prepare your solution for continuous monitoring as described in the Monitor performance continuously section of this document.

Monitor for availability

To gain visibility into the health and performance of your deployed endpoints and infrastructure, use Cloud Monitoring. For your Vertex AI endpoints, track key metrics like request rate, error rate, latency, and resource utilization, and set up alerts for anomalies. For more information, see Cloud Monitoring metrics for Vertex AI.

Monitor the health of the underlying infrastructure, which can include Compute Engine instances, Google Kubernetes Engine (GKE) clusters, and TPUs and GPUs. Get automated optimization recommendations from Active Assist. If you use autoscaling, monitor the scaling behavior to ensure that autoscaling responds appropriately to changes in traffic patterns.

Track the status of model deployments, including canary releases and rollbacks, by integrating Cloud Deploy with Cloud Monitoring. In addition, monitor for potential security threats and vulnerabilities by using Security Command Center.

Set up custom alerts for business-specific thresholds

For timely identification and rectification of anomalies and issues, set up custom alerting based on thresholds that are specific to your business objectives. Examples of Google Cloud products that you can use to implement a custom alerting system include the following:

Cloud Logging: Collect, store, and analyze logs from all components of your AI and ML system.
Cloud Monitoring: Create custom dashboards to visualize key metrics and trends, and define custom metrics based on your needs. Configure alerts to get notifications about critical issues, and integrate the alerts with your incident management tools like PagerDuty or Slack.
Error Reporting: Automatically capture and analyze errors and exceptions.
Cloud Trace: Analyze the performance of distributed systems and identify bottlenecks. Tracing is particularly useful for understanding latency between different components of your AI and ML pipeline.
Cloud Profiler: Continuously analyze the performance of your code in production and identify performance bottlenecks in CPU or memory usage.

Build a culture of operational excellence

Shift the focus from just building models to building sustainable, reliable, and impactful AI solutions. Empower teams to continuously learn, innovate, and improve, which leads to faster development cycles, reduced errors, and increased efficiency. By prioritizing automation, standardization, and ethical considerations, you can ensure that your AI and ML initiatives consistently deliver value, mitigate risks, and promote responsible AI development.

To build a culture of operational excellence for your AI and ML systems, consider the following recommendations.

Champion automation and standardization

To emphasize efficiency and consistency, embed automation and standardized practices into every stage of the AI and ML lifecycle. Automation reduces manual errors and frees teams to focus on innovation. Standardization ensures that processes are repeatable and scalable across teams and projects.

Prioritize continuous learning and improvement

Foster an environment where ongoing education and experimentation are core principles. Encourage teams to stay up-to-date with AI and ML advancements, and provide opportunities to learn from past projects. A culture of curiosity and adaptation drives innovation and ensures that teams are equipped to meet new challenges.

Cultivate accountability and ownership

Build trust and alignment with clearly defined roles, responsibilities, and metrics for success. Empower teams to make informed decisions within these boundaries, and establish transparent ways to measure progress. A sense of ownership motivates teams and ensures collective responsibility for outcomes.

Embed AI ethics and safety considerations

Prioritize considerations for ethics in every stage of development. Encourage teams to think critically about the impact of their AI solutions, and foster discussions on fairness, bias, and societal impact. Clear principles and accountability mechanisms ensure that your AI systems align with organizational values and promote trust.

Design for scalability

To accommodate growing data volumes and user demands and to maximize the value of AI investments, your AI and ML systems need to be scalable. The systems must adapt and perform optimally to avoid performance bottlenecks that hinder effectiveness. When you design for scalability, you ensure that the AI infrastructure can handle growth and maintain responsiveness. Use scalable infrastructure, plan for capacity, and employ strategies like horizontal scaling and managed services.

To design your AI and ML systems for scalability, consider the following recommendations.

Plan for capacity and quotas

Assess future growth, and plan your infrastructure capacity and resource quotas accordingly. Work with business stakeholders to understand the projected growth and then define the infrastructure requirements accordingly.

Use Cloud Monitoring to analyze historical resource utilization, identify trends, and project future needs. Conduct regular load testing to simulate workloads and identify bottlenecks.

Familiarize yourself with Google Cloud quotas for the services that you use, such as Compute Engine, Vertex AI, and Cloud Storage. Proactively request quota increases through the Google Cloud console, and justify the increases with data from forecasting and load testing. Monitor quota usage and set up alerts to get notifications when the usage approaches the quota limits.

To optimize resource usage based on demand, rightsize your resources, use Spot VMs for fault-tolerant batch workloads, and implement autoscaling.

Prepare for peak events

Ensure that your system can handle sudden spikes in traffic or workload during peak events. Document your peak event strategy and conduct regular drills to test your system's ability to handle increased load.

To aggressively scale up resources when the demand spikes, configure autoscaling policies in Compute Engine and GKE. For predictable peak patterns, consider using predictive autoscaling. To trigger autoscaling based on application-specific signals, use custom metrics in Cloud Monitoring.

Distribute traffic across multiple application instances by using Cloud Load Balancing. Choose an appropriate load balancer type based on your application's needs. For geographically distributed users, you can use global load balancing to route traffic to the nearest available instance. For complex microservices-based architectures, consider using Cloud Service Mesh.

Cache static content at the edge of Google's network by using Cloud CDN. To cache frequently accessed data, you can use Memorystore, which offers a fully managed in-memory service for Redis, Valkey, or Memcached.

Decouple the components of your system by using Pub/Sub for real-time messaging and Cloud Tasks for asynchronous task execution

Scale applications for production

To ensure scalable serving in production, you can use managed services like Vertex AI distributed training and Vertex AI Inference. Vertex AI Inference lets you configure the machine types for your prediction nodes when you deploy a model to an endpoint or request batch predictions. For some configurations, you can add GPUs. Choose the appropriate machine type and accelerators to optimize latency, throughput, and cost.

To scale complex AI and Python applications and custom workloads across distributed computing resources, you can use Ray on Vertex AI. This feature can help optimize performance and enables seamless integration with Google Cloud services. Ray on Vertex AI simplifies distributed computing by handling cluster management, task scheduling, and data transfer. It integrates with other Vertex AI services like training, prediction, and pipelines. Ray provides fault tolerance and autoscaling, and helps you adapt the infrastructure to changing workloads. It offers a unified framework for distributed training, hyperparameter tuning, reinforcement learning, and model serving. Use Ray for distributed data preprocessing with Dataflow or Dataproc, accelerated model training, scalable hyperparameter tuning, reinforcement learning, and parallelized batch prediction.

Contributors

Authors:

Charlotte Gistelinck, PhD | Partner Engineer
Sannya Dang | AI Solution Architect
Filipe Gracio, PhD | Customer Engineer

Other contributors:

Gary Harmson | Principal Architect
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Ryan Cox | Principal Architect
Stef Ruinard | Generative AI Field Solutions Architect

AI and ML perspective: Security

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Google Cloud Well-Architected Framework.

Secure deployment of AI and ML workloads is a critical requirement, particularly in enterprise environments. To meet this requirement, you need to adopt a holistic security approach that starts from the initial conceptualization of your AI and ML solutions and extends to development, deployment, and ongoing operations. Google Cloud offers robust tools and services that are designed to help secure your AI and ML workloads.

Define clear goals and requirements

It's easier to integrate the required security and compliance controls early in your design and development process, than to add the controls after development. From the start of your design and development process, make decisions that are appropriate for your specific risk environment and your specific business priorities.

Consider the following recommendations:

Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.
Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.

Keep data secure and prevent loss or mishandling

Data is a valuable and sensitive asset that must be kept secure. Data security helps you to maintain user trust, support your business objectives, and meet your compliance requirements.

Consider the following recommendations:

Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.
Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.
Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.
Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.
Guard against data poisoning for your ML training systems.

Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.

Consider the following recommendations:

Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.
Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.
Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.

Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.

Consider the following recommendations:

Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.
Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.
Prefer using validated prebuilt container images that are specifically designed for AI workloads.

Protect and monitor inputs

AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.

Consider the following recommendations:

Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.
Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.
Ensure that only the intended users of a deployed system can use it.

Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.

Consider the following recommendations:

Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.
Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.
Implement robust alerting and incident response procedures to address any potential issues.

Contributors

Authors:

Kamilla Kurta | GenAI/ML Specialist Customer Engineer
Filipe Gracio, PhD | Customer Engineer
Mohamed Fawzi | Benelux Security and Compliance Lead

Other contributors:

Daniel Lees | Cloud Security Architect
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Wade Holmes | Global Solutions Director

AI and ML perspective: Reliability

This document in the Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Google Cloud Well-Architected Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential for ensuring customer satisfaction and achieving business goals. You need AI and ML systems that are robust, reliable, and adaptable to meet the unique demands of both predictive ML and generative AI. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with Site Reliability Engineering (SRE) principles and provides a powerful foundation for reliable AI and ML systems.

Ensure that infrastructure is scalable and highly available

By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.

Consider the following recommendations:

Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.
Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.
Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.
Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.

Use a modular and loosely coupled architecture

To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.

Consider the following recommendations:

Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.
Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.
Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.
Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.

Build an automated MLOps platform

With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.

Consider the following recommendations:

Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.
Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.
Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.
Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.
Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.

Maintain trust and control through data and model governance

The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.

Consider the following recommendations:

Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.
Implement strict access controls and audit trails to protect sensitive data and models.
Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.
Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.

Implement holistic AI and ML observability and reliability practices

To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.

Consider the following recommendations:

Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.
Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.
Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.
Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).

Contributors

Authors:

Rick (Rugui) Chen | AI Infrastructure Solutions Architect
Filipe Gracio, PhD | Customer Engineer

Other contributors:

Jose Andrade | Enterprise Infrastructure Customer Engineer
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer

AI and ML perspective: Cost optimization

This document in Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline. The recommendations in this document align with the cost optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems can help you unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.

The recommendations in this document are mapped to the following core principles:

Define and measure costs and returns

To effectively manage AI and ML costs in Google Cloud, you must define and measure the cloud resource costs and the business value of your AI and ML initiatives. To help you track expenses granularly, Google Cloud provides comprehensive billing and cost management tools, such as the following:

Cloud Billing reports and tables
Looker Studio dashboards, budgets, and alerts
Cloud Monitoring
Cloud Logging

To make informed decisions about resource allocation and optimization, consider the following recommendations.

Establish business goals and KPIs

Align the technical choices in your AI and ML projects with business goals and key performance indicators (KPIs).

Define strategic objectives and ROI-focused KPIs

Ensure that AI and ML projects are aligned with strategic objectives like revenue growth, cost reduction, customer satisfaction, and efficiency. Engage stakeholders to understand the business priorities. Define AI and ML objectives that are specific, measurable, attainable, relevant, and time-bound (SMART). For example, a SMART objective is: "Reduce chat handling time for customer support by 15% in 6 months by using an AI chatbot".

To make progress towards your business goals and to measure the return on investment (ROI), define KPIs for the following categories of metrics:

Costs for training, inference, storage, and network resources, including specific unit costs (such as the cost per inference, data point, or task). These metrics help you gain insights into efficiency and cost optimization opportunities. You can track these costs by using Cloud Billing reports and Cloud Monitoring dashboards.
Business value metrics like revenue growth, cost savings, customer satisfaction, efficiency, accuracy, and adoption. You can track these metrics by using BigQuery analytics and Looker dashboards.
Industry-specific metrics like the following:
- Retail industry: measure revenue lift and churn
- Healthcare industry: measure patient time and patient outcomes
- Finance industry: measure fraud reduction
Project-specific metrics. You can track these metrics by using Vertex AI Experiments and evaluation.
- Predictive AI: measure accuracy and precision
- Generative AI: measure adoption, satisfaction, and content quality
- Computer vision AI: measure accuracy

Foster a culture of cost awareness and continuous optimization

Adopt FinOps principles to ensure that each AI and ML project has estimated costs and has ways to measure and track actual costs throughout its lifecycle. Ensure that the costs and business benefits of your projects have assigned owners and clear accountability.

For more information, see Foster a culture of cost awareness in the Cost Optimization pillar of the Google Cloud Well-Architected Framework.

Drive value and continuous optimization through iteration and feedback

Map your AI and ML applications directly to your business goals and measure the ROI.

To validate your ROI hypotheses, start with pilot projects and use the following iterative optimization cycle:

Monitor continuously and analyze data: Monitor KPIs and costs to identify deviations and opportunities for optimization.
Make data-driven adjustments: Optimize strategies, models, infrastructure, and resource allocation based on data insights.
Refine iteratively: Adapt business objectives and KPIs based on the things you learned and the evolving business needs. This iteration helps you maintain relevance and strategic alignment.
Establish a feedback loop: Review performance, costs, and value with stakeholders to inform ongoing optimization and future project planning.

Manage billing data with Cloud Billing and labels

Effective cost optimization requires visibility into the source of each cost element. The recommendations in this section can help you use Google Cloud tools to get granular insights into your AI and ML costs. You can also attribute costs to specific AI and ML projects, teams, and activities. These insights lay the groundwork for cost optimization.

Organize and label Google Cloud resources

Structure your projects and resources in a hierarchy that reflects your organizational structure and your AI and ML workflows. To track and analyze costs at different levels, organize your Google Cloud resources by using organizations, folders, and projects. For more information, see Decide a resource hierarchy for your Google Cloud landing zone.
Apply meaningful labels to your resources. You can use labels that indicate the project, team, environment, model name, dataset, use case, and performance requirements. Labels provide valuable context for your billing data and enable granular cost analysis.
Maintain consistency in your labeling conventions across all of your AI and ML projects. Consistent labeling conventions ensure that your billing data is organized and can be readily analyzed.

Use billing-related tools

To facilitate detailed analysis and reporting, export the billing data to BigQuery. BigQuery has powerful query capabilities that let you analyze the billing data to help you understand your costs.
To aggregate costs by labels, projects, or specific time periods, you can write custom SQL queries in BigQuery. Such queries let you attribute costs to specific AI and ML activities, such as model training, hyperparameter tuning, or inference.
To identify cost anomalies or unexpected spending spikes, use the analytic capabilities in BigQuery. This approach can help you detect potential issues or inefficiencies in your AI and ML workloads.
To identify and manage unexpected costs, use the anomaly detection dashboard in Cloud Billing.
To distribute costs across different teams or departments based on resource usage, use Google Cloud's cost allocation feature. Cost allocation promotes accountability and transparency.
To gain insights into spending patterns, explore the prebuilt Cloud Billing reports. You can filter and customize these reports to focus on specific AI and ML projects or services.

Monitor resources continuously with dashboards, alerts, and reports

To create a scalable and resilient way to track costs, you need continuous monitoring and reporting. Dashboards, alerts, and reports constitute the foundation for effective cost tracking. This foundation lets you maintain constant access to cost information, identify areas of optimization, and ensure alignment between business goals and costs.

Create a reporting system

Create scheduled reports and share them with appropriate stakeholders.

Use Cloud Monitoring to collect metrics from various sources, including your applications, infrastructure, and Google Cloud services like Compute Engine, Google Kubernetes Engine (GKE), and Cloud Run functions. To visualize metrics and logs in real time, you can use the prebuilt Cloud Monitoring dashboard or create custom dashboards. Custom dashboards let you define and add metrics to track specific aspects of your systems, like model performance, API calls, or business-level KPIs.

Use Cloud Logging for centralized collection and storage of logs from your applications, systems, and Google Cloud services. Use the logs for the following purposes:

Track costs and utilization of resources like CPU, memory, storage, and network.
Identify cases of over-provisioning (where resources aren't fully utilized) and under-provisioning (where there are insufficient resources). Over-provisioning results in unnecessary costs. Under-provisioning slows training times and might cause performance issues.
Identify idle or underutilized resources, such as VMs and GPUs, and take steps to shut down or rightsize them to optimize costs.
Identify cost spikes to detect sudden and unexpected increases in resource usage or costs.

Use Looker or Looker Studio to create interactive dashboards and reports. Connect the dashboards and reports to various data sources, including BigQuery and Cloud Monitoring.

Set alert thresholds based on key KPIs

For your KPIs, determine the thresholds that should trigger alerts. Meaningful alert thresholds can help you avoid alert fatigue. Create alerting policies in Cloud Monitoring to get notifications related to your KPIs. For example, you can get notifications when accuracy drops below a certain threshold or latency exceeds a defined limit. Alerts based on log data can notify you about potential cost issues in real time. Such alerts let you take corrective actions promptly and prevent further financial loss.

Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. To help you avoid unnecessary expenses and ensure that your workloads have the resources that they need to perform optimally, align resource allocation with the needs of your workloads.

To optimize the allocation of cloud resources to AI and ML workloads, consider the following recommendations.

Use autoscaling to dynamically adjust resources

Use Google Cloud services that support autoscaling, which automatically adjusts resource allocation to match the current demand. Autoscaling provides the following benefits:

Cost and performance optimization: You avoid paying for idle resources. At the same time, autoscaling ensures that your systems have the necessary resources to perform optimally, even at peak load.
Improved efficiency: You free up your team to focus on other tasks.
Increased agility: You can respond quickly to changing demands and maintain high availability for your applications.

The following table summarizes the techniques that you can use to implement autoscaling for different stages of your AI projects.

Stage	Autoscaling techniques
Training	Use managed services like Vertex AI or GKE, which offer built-in autoscaling capabilities for training jobs. Configure autoscaling policies to scale the number of training instances based on metrics like CPU utilization, memory usage, and job queue length. Use custom scaling metrics to fine-tune autoscaling behavior for your specific workloads.
Inference	Deploy your models on scalable platforms like Vertex AI Inference, GPUs on GKE, or TPUs on GKE. Use autoscaling features to adjust the number of replicas based on metrics like request rate, latency, and resource utilization. Implement load balancing to distribute traffic evenly across replicas and ensure high availability.

Start with small models and datasets

To help reduce costs, test ML hypotheses at a small scale when possible and use an iterative approach. This approach, with smaller models and datasets, provides the following benefits:

Reduced costs from the start: Less compute power, storage, and processing time can result in lower costs during the initial experimentation and development phases.
Faster iteration: Less training time is required, which lets you iterate faster, explore alternative approaches, and identify promising directions more efficiently.
Reduced complexity: Simpler debugging, analysis, and interpretation of results, which leads to faster development cycles.
Efficient resource utilization: Reduced chance of over-provisioning resources. You provision only the resources that are necessary for the current workload.

Consider the following recommendations:

Use sample data first: Train your models on a representative subset of your data. This approach lets you assess the model's performance and identify potential issues without processing the entire dataset.
Experiment by using notebooks: Start with smaller instances and scale as needed. You can use Vertex AI Workbench, a managed Jupyter notebook environment that's well suited for experimentation with different model architectures and datasets.
Start with simpler or pre-trained models: Use Vertex AI Model Garden to discover and explore the pre-trained models. Such models require fewer computational resources. Gradually increase the complexity as needed based on performance requirements.
- Use pre-trained models for tasks like image classification and natural language processing. To save on training costs, you can fine-tune the models on smaller datasets initially.
- Use BigQuery ML for structured data. BigQuery ML lets you create and deploy models directly within BigQuery. This approach can be cost-effective for initial experimentation, because you can take advantage of the pay-per-query pricing model for BigQuery.
Scale for resource optimization: Use Google Cloud's flexible infrastructure to scale resources as needed. Start with smaller instances and adjust their size or number when necessary.

Discover resource requirements through experimentation

Resource requirements for AI and ML workloads can vary significantly. To optimize resource allocation and costs, you must understand the specific needs of your workloads through systematic experimentation. To identify the most efficient configuration for your models, test different configurations and analyze their performance. Then, based on the requirements, right-size the resources that you used for training and serving.

We recommend the following approach for experimentation:

Start with a baseline: Begin with a baseline configuration based on your initial estimates of the workload requirements. To create a baseline, you can use the cost estimator for new workloads or use an existing billing report. For more information, see Unlock the true cost of enterprise AI on Google Cloud.
Understand your quotas: Before launching extensive experiments, familiarize yourself with your Google Cloud project quotas for the resources and APIs that you plan to use. The quotas determine the range of configurations that you can realistically test. By becoming familiar with quotas, you can work within the available resource limits during the experimentation phase.
Experiment systematically: Adjust parameters like the number of CPUs, amount of memory, number and type of GPUs and TPUs, and storage capacity. Vertex AI training and Vertex AI predictions let you experiment with different machine types and configurations.
Monitor utilization, cost, and performance: Track the resource utilization, cost, and key performance metrics such as training time, inference latency, and model accuracy, for each configuration that you experiment with.
- To track resource utilization and performance metrics, you can use the Vertex AI console.
- To collect and analyze detailed performance metrics, use Cloud Monitoring.
- To view costs, use Cloud Billing reports and Cloud Monitoring dashboards.
- To identify performance bottlenecks in your models and optimize resource utilization, use profiling tools like Vertex AI TensorBoard.
Analyze costs: Compare the cost and performance of each configuration to identify the most cost-effective option.
Establish resource thresholds and improvement targets based on quotas: Define thresholds for when scaling begins to yield diminishing returns in performance, such as minimal reduction in training time or latency for a significant cost increase. Consider project quotas when setting these thresholds. Determine the point where the cost and potential quota implications of further scaling are no longer justified by performance gains.
Refine iteratively: Repeat the experimentation process with refined configurations based on your findings. Always ensure that the resource usage remains within your allocated quotas and aligns with established cost-benefit thresholds.

Use MLOps to reduce inefficiencies

As organizations increasingly use ML to drive innovation and efficiency, managing the ML lifecycle effectively becomes critical. ML operations (MLOps) is a set of practices that automate and streamline the ML lifecycle, from model development to deployment and monitoring.

Align MLOps with cost drivers

To take advantage of MLOps for cost efficiency, identify the primary cost drivers in the ML lifecycle. Then, you can adopt and implement MLOps practices that are aligned with the cost drivers. Prioritize and adopt the MLOps features that address the most impactful cost drivers. This approach helps ensure a manageable and successful path to significant cost savings.

Implement MLOps for cost optimization

The following are common MLOps practices that help to reduce cost:

Version control: Tools like Git can help you to track versions of code, data, and models. Version control ensures reproducibility, facilitates collaboration, and prevents costly rework that can be caused by versioning issues.
Continuous integration and continuous delivery (CI/CD): Cloud Build and Artifact Registry let you implement CI/CD pipelines to automate building, testing, and deployment of your ML models. CI/CD pipelines ensure efficient resource utilization and minimize the costs associated with manual interventions.
Observability: Cloud Monitoring and Cloud Logging let you track model performance in production, identify issues, and trigger alerts for proactive intervention. Observability lets you maintain model accuracy, optimize resource allocation, and prevent costly downtime or performance degradation.
Model retraining: Vertex AI Pipelines simplifies the processes for retraining models periodically or when performance degrades. When you use Vertex AI Pipelines for retraining, it helps ensure that your models remain accurate and efficient, which can prevent unnecessary resource consumption and maintain optimal performance.
Automated testing and evaluation: Vertex AI helps you accelerate and standardize model evaluation. Implement automated tests throughout the ML lifecycle to ensure the quality and reliability of your models. Such tests can help you catch errors early, prevent costly issues in production, and reduce the need for extensive manual testing.

For more information, see MLOps: Continuous delivery and automation pipelines in machine learning.

Enforce data management and governance practices

Effective data management and governance practices are critical to cost optimization. Well organized data can encourage teams to reuse datasets, avoid needless duplication, and reduce the effort to obtain high quality data. By proactively managing data, you can reduce storage costs, enhance data quality, and ensure that your ML models are trained on the most relevant and valuable data.

To implement data management and governance practices, consider the following recommendations.

Establish and adopt a data governance framework

The growing prominence of AI and ML has made data the most valuable asset for organizations that are undergoing digital transformation. A robust framework for data governance is a crucial requirement for managing AI and ML workloads cost-effectively at scale. A data governance framework with clearly defined policies, procedures, and roles provides a structured approach for managing data throughout its lifecycle. Such a framework helps to improve data quality, enhance security, improve utilization, and reduce redundancy.

Establish a data governance framework

There are many pre-existing frameworks for data governance, such as the frameworks published by the EDM Council, with options available for different industries and organization sizes. Choose and adapt a framework that aligns with your specific needs and priorities.

Implement the data governance framework

Google Cloud provides the following services and tools to help you implement a robust data governance framework:

Dataplex Universal Catalog is an intelligent data fabric that helps you unify distributed data and automate data governance without the need to consolidate data sets in one place. This helps to reduce the cost to distribute and maintain data, facilitate data discovery, and promote reuse.
- To organize data, use Dataplex Universal Catalog abstractions and set up logical data lakes and zones.
- To administer access to data lakes and zones, use Google Groups and Dataplex Universal Catalog roles.
- To streamline data quality processes, enable auto data quality.
Dataplex Universal Catalog is also a fully managed and scalable metadata management service. The catalog provides a foundation that ensures that data assets are accessible and reusable.
- Metadata from the supported Google Cloud sources is automatically ingested into the universal catalog. For data sources outside of Google Cloud, create custom entries.
- To improve the discoverability and management of data assets, enrich technical metadata with business metadata by using aspects.
- Ensure that data scientists and ML practitioners have sufficient permissions to access Dataplex Universal Catalog and use the search function.
BigQuery sharing lets you efficiently and securely exchange data assets across your organizations to address challenges of data reliability and cost.
- Set up data exchanges and ensure that curated data assets can be viewed as listings.
- Use data clean rooms to securely manage access to sensitive data and efficiently partner with external teams and organizations on AI and ML projects.
- Ensure that data scientists and ML practitioners have sufficient permissions to view and publish datasets to BigQuery sharing.

Make datasets and features reusable throughout the ML lifecycle

For significant efficiency and cost benefits, reuse datasets and features across multiple ML projects. When you avoid redundant data engineering and feature development efforts, your organization can accelerate model development, reduce infrastructure costs, and free up valuable resources for other critical tasks.

Google Cloud provides the following services and tools to help you reuse datasets and features:

Data and ML practitioners can publish data products to maximize reuse across teams. The data products can then be discovered and used through Dataplex Universal Catalog and BigQuery sharing.
For tabular and structured datasets, you can use Vertex AI Feature Store to promote reusability and streamline feature management through BigQuery.
You can store unstructured data in Cloud Storage and govern the data by using BigQuery object tables and signed URLs.
You can manage vector embeddings by including metadata in your Vector Search indexes.

Automate and streamline with MLOps

A primary benefit of adopting MLOps practices is a reduction in costs for technology and personnel. Automation helps you avoid the duplication of ML activities and reduce the workload for data scientists and ML engineers.

To automate and streamline ML development with MLOps, consider the following recommendations.

Automate and standardize data collection and processing

To help reduce ML development effort and time, automate and standardize your data collection and processing technologies.

Automate data collection and processing

This section summarizes the products, tools, and techniques that you can use to automate data collection and processing.

Identify and choose the relevant data sources for your AI and ML tasks:

Database options such as Cloud SQL, Spanner, AlloyDB for PostgreSQL, Firestore, and BigQuery. Your choice depends on your requirements, such as latency on write access (static or dynamic), data volume (high or low), and data format (structured, unstructured, or semi-structured). For more information, see Google Cloud databases.
Data lakes such as Cloud Storage with BigLake.
Dataplex Universal Catalog for governing data across sources.
Streaming events platforms such as Pub/Sub, Dataflow, or Apache Kafka.
External APIs.

For each of your data sources, choose an ingestion tool:

Dataflow: For batch and stream processing of data from various sources, with ML-component integration. For an event-driven architecture, you can combine Dataflow with Eventarc to efficiently process data for ML. To enhance MLOps and ML job efficiency, use GPU and right-fitting capabilities.
Cloud Run functions: For event-driven data ingestion that gets triggered by changes in data sources for real-time applications.
BigQuery: For classical tabular data ingestion with frequent access.

Choose tools for data transformation and loading:

Use tools such as Dataflow or Dataform to automate data transformations like feature scaling, encoding categorical variables, and creating new features in batch, streaming, or real time. The tools that you select depend upon your requirements and chosen services.
Use Vertex AI Feature Store to automate feature creation and management. You can centralize features for reuse across different models and projects.

Standardize data collection and processing

To discover, understand, and manage data assets, use metadata management services like Dataplex Universal Catalog. It helps you standardize data definitions and ensure consistency across your organization.

To enforce standardization and avoid the cost of maintaining multiple custom implementations, use automated training pipelines and orchestration. For more information, see the next section.

Automate training pipelines and reuse existing assets

To boost efficiency and productivity in MLOps, automated training pipelines are crucial. Google Cloud offers a robust set of tools and services to build and deploy training pipelines, with a strong emphasis on reusing existing assets. Automated training pipelines help to accelerate model development, ensure consistency, and reduce redundant effort.

Automate training pipelines

The following table describes the Google Cloud services and features that you can use to automate the different functions of a training pipeline.

Function	Google Cloud services and features
Orchestration: Define complex ML workflows consisting of multiple steps and dependencies. You can define each step as a separate containerized task, which helps you manage and scale individual tasks with ease.	To create and orchestrate pipelines, use Vertex AI Pipelines or Kubeflow Pipelines. These tools support simple data transformation, model training, model deployment, and pipeline versioning. They let you define dependencies between steps, manage data flow, and automate the execution of the entire workflow. For complex operational tasks with heavy CI/CD and extract, transform, and load (ETL) requirements, use Cloud Composer. If you prefer Airflow for data orchestration, Cloud Composer is a compatible managed service that's built on Airflow. For pipelines that are managed outside of Vertex AI Pipelines, use Workflows for infrastructure-focused tasks like starting and stopping VMs or integrating with external systems. To automate your CI/CD process, use Cloud Build with Pub/Sub. You can set up notifications and automatic triggers for when new code is pushed or when a new model needs to be trained. For a fully-managed, scalable solution for pipeline management, use Cloud Data Fusion.
Versioning: Track and control different versions of pipelines and components to ensure reproducibility and auditability.	Store Kubeflow pipeline templates in a Kubeflow Pipelines repository in Artifact Registry.
Reusability: Reuse existing pipeline components and artifacts, such as prepared datasets and trained models, to accelerate development.	Store your pipeline templates in Cloud Storage and share them across your organization.
Monitoring: Monitor pipeline execution to identify and address any issues.	Use Cloud Logging and Cloud Monitoring. For more information, see Monitor resources continuously with dashboards, alerts, and reports.

Expand reusability beyond pipelines

Look for opportunities to expand reusability beyond training pipelines. The following are examples of Google Cloud capabilities that let you reuse ML features, datasets, models, and code.

Vertex AI Feature Store provides a centralized repository for organizing, storing, and serving ML features. It lets you reuse features across different projects and models, which can improve consistency and reduce feature engineering effort. You can store, share, and access features for both online and offline use cases.
Vertex AI datasets enable teams to create and manage datasets centrally, so your organization can maximize reusability and reduce data duplication. Your teams can search and discover the datasets by using Dataplex Universal Catalog.
Vertex AI Model Registry lets you store, manage, and deploy your trained models. Model Registry lets you reuse the models in subsequent pipelines or for online prediction, which helps you take advantage of previous training efforts.
Custom containers let you package your training code and dependencies into containers and store the containers in Artifact Registry. Custom containers let you provide consistent and reproducible training environments across different pipelines and projects.

Use Google Cloud services for model evaluation and tuning

Google Cloud offers a powerful suite of tools and services to streamline and automate model evaluation and tuning. These tools and services can help you reduce your time to production and reduce the resources required for continuous training and monitoring. By using these services, your AI and ML teams can enhance model performance with fewer expensive iterations, achieve faster results, and minimize wasted compute resources.

Use resource-efficient model evaluation and experimentation

Begin an AI project with experiments before you scale up your solution. In your experiments, track various metadata such as dataset version, model parameters, and model type. For further reproducibility and comparison of the results, use metadata tracking in addition to code versioning, similar to the capabilities in Git. To avoid missing information or deploying the wrong version in production, use Vertex AI Experiments before you implement full-scale deployment or training jobs.

Vertex AI Experiments lets you do the following:

Streamline and automate metadata tracking and discovery through a user friendly UI and API for production-ready workloads.
Analyze the model's performance metrics and compare metrics across multiple models.

After the model is trained, continuously monitor the performance and data drift over time for incoming data. To streamline this process, use Vertex AI Model Monitoring to directly access the created models in Model Registry. Model Monitoring also automates monitoring for data and results through online and batch predictions. You can export the results to BigQuery for further analysis and tracking.

Choose optimal strategies to automate training

For hyperparameter tuning, we recommend the following approaches:

To automate the process of finding the optimal hyperparameters for your models, use Vertex AI hyperparameter tuning. Vertex AI uses advanced algorithms to explore the hyperparameter space and identify the best configuration.
For efficient hyperparameter tuning, consider using Bayesian optimization techniques, especially when you deal with complex models and large datasets.

For distributed training, we recommend the following approaches:

For large datasets and complex models, use the distributed training infrastructure of Vertex AI. This approach lets you train your models on multiple machines, which helps to significantly reduce training time and associated costs. Use tools like the following:
- Vertex AI tuning to perform supervised fine-tuning of Gemini, Imagen, and other models.
- Vertex AI training or Ray on Vertex AI for custom distributed training.
Choose optimized ML frameworks, like Keras and PyTorch, that support distributed training and efficient resource utilization.

Use explainable AI

It's crucial to understand why a model makes certain decisions and to identify potential biases or areas for improvement. Use Vertex Explainable AI to gain insights into your model's predictions. Vertex Explainable AI offers a way to automate feature-based and example-based explanations that are linked to your Vertex AI experiments.

Feature-based: To understand which features are most influential in your model's predictions, analyze feature attributions. This understanding can guide feature-engineering efforts and improve model interpretability.
Example-based: To return a list of examples (typically from the training set) that are most similar to the input, Vertex AI uses nearest neighbor search. Because similar inputs generally yield similar predictions, you can use these explanations to explore and explain a model's behavior.

Use managed services and pre-trained models

Adopt an incremental approach to model selection and model development. This approach helps you avoid excessive costs that are associated with starting afresh every time. To control costs, use ML frameworks, managed services, and pre-trained models.

To get the maximum value from managed services and pre-trained models, consider the following recommendations.

Use notebooks for exploration and experiments

Notebook environments are crucial for cost-effective ML experimentation. A notebook provides an interactive and collaborative space for data scientists and engineers to explore data, develop models, share knowledge, and iterate efficiently. Collaboration and knowledge sharing through notebooks significantly accelerates development, code reviews, and knowledge transfer. Notebooks help streamline workflows and reduce duplicated effort.

Instead of procuring and managing expensive hardware for your development environment, you can use the scalable and on-demand infrastructure of Vertex AI Workbench and Colab Enterprise.

Vertex AI Workbench is a Jupyter notebook development environment for the entire data science workflow. You can interact with Vertex AI and other Google Cloud services from within an instance's Jupyter notebook. Vertex AI Workbench integrations and features help you do the following:
- Access and explore data from a Jupyter notebook by using BigQuery and Cloud Storage integrations.
- Automate recurring updates to a model by using scheduled executions of code that runs on Vertex AI.
- Process data quickly by running a notebook on a Dataproc cluster.
- Run a notebook as a step in a pipeline by using Vertex AI Pipelines.
Colab Enterprise is a collaborative, managed notebook environment that has the security and compliance capabilities of Google Cloud. Colab Enterprise is ideal if your project's priorities include collaborative development and reducing the effort to manage infrastructure. Colab Enterprise integrates with Google Cloud services and AI-powered assistance that uses Gemini. Colab Enterprise lets you do the following:
- Work in notebooks without the need to manage infrastructure.
- Share a notebook with a single user, Google group, or Google Workspace domain. You can control notebook access through Identity and Access Management (IAM).
- Interact with features built into Vertex AI and BigQuery.

To track changes and revert to previous versions when necessary, you can integrate your notebooks with version control tools like Git.

Start with existing and pre-trained models

Training complex models from scratch, especially deep-learning models, requires significant computational resources and time. To accelerate your model selection and development process, start with existing and pre-trained models. These models, which are trained on vast datasets, eliminate the need to train models from scratch and significantly reduce cost and development time.

Reduce training and development costs

Select an appropriate model or API for each ML task and combine them to create an end-to-end ML development process.

Vertex AI Model Garden offers a vast collection of pre-trained models for tasks such as image classification, object detection, and natural language processing. The models are grouped into the following categories:

Google models like the Gemini family of models and Imagen for image generation.
Open-source models like Gemma and Llama.
Third-party models from partners like Anthropic and Mistral AI.

Google Cloud provides AI and ML APIs that let developers integrate powerful AI capabilities into applications without the need to build models from scratch.

Cloud Vision API lets you derive insights from images. This API is valuable for applications like image analysis, content moderation, and automated data entry.
Cloud Natural Language API lets you analyze text to understand its structure and meaning. This API is useful for tasks like customer feedback analysis, content categorization, and understanding social media trends.
Speech-to-Text API converts audio to text. This API supports a wide range of languages and dialects.
Video Intelligence API analyzes video content to identify objects, scenes, and actions. Use this API for video content analysis, content moderation, and video search.
Document AI API processes documents to extract, classify, and understand data. This API helps you automate document processing workflows.
Dialogflow API enables the creation of conversational interfaces, such as chatbots and voice assistants. You can use this API to create customer service bots and virtual assistants.
Gemini API in Vertex AI provides access to Google's most capable and general-purpose AI model.

Reduce tuning costs

To help reduce the need for extensive data and compute time, fine-tune your pre-trained models on specific datasets. We recommend the following approaches:

Learning transfer: Use the knowledge from a pre-trained model for a new task, instead of starting from scratch. This approach requires less data and compute time, which helps to reduce costs.
Adapter tuning (parameter-efficient tuning): Adapt models to new tasks or domains without full fine-tuning. This approach requires significantly lower computational resources and a smaller dataset.
Supervised fine tuning: Adapt model behavior with a labeled dataset. This approach simplifies the management of the underlying infrastructure and the development effort that's required for a custom training job.

Explore and experiment by using Vertex AI Studio

Vertex AI Studio lets you rapidly test, prototype, and deploy generative AI applications.

Integration with Model Garden: Provides quick access to the latest models and lets you efficiently deploy the models to save time and costs.
Unified access to specialized models: Consolidates access to a wide range of pre-trained models and APIs, including those for chat, text, media, translation, and speech. This unified access can help you reduce the time spent searching for and integrating individual services.

Use managed services to train or serve models

Managed services can help reduce the cost of model training and simplify the infrastructure management, which lets you focus on model development and optimization. This approach can result in significant cost benefits and increased efficiency.

Reduce operational overhead

To reduce the complexity and cost of infrastructure management, use managed services such as the following:

Vertex AI training provides a fully managed environment for training your models at scale. You can choose from various prebuilt containers with popular ML frameworks or use your own custom containers. Google Cloud handles infrastructure provisioning, scaling, and maintenance, so you incur lower operational overhead.
Vertex AI predictions handles infrastructure scaling, load balancing, and request routing. You get high availability and performance without manual intervention.
Ray on Vertex AI provides a fully managed Ray cluster. You can use the cluster to run complex custom AI workloads that perform many computations (hyperparameter tuning, model fine-tuning, distributed model training, and reinforcement learning from human feedback) without the need to manage your own infrastructure.

Use managed services to optimize resource utilization

For details about efficient resource utilization, see Optimize resource utilization.

Contributors

Authors:

Isaac Lo | AI Business Development Manager
Anastasia Prokaeva | Field Solutions Architect, Generative AI
Amy Southwood | Technical Solutions Consultant, Data Analytics & AI

Other contributors:

Filipe Gracio, PhD | Customer Engineer
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Nicolas Pintaux | Customer Engineer, Application Modernization Specialist

AI and ML perspective: Performance optimization

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.

To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, LLMs that are very large can be highly performant on massive training infrastructure, but very large models might not perform well in capacity-constrained environments like mobile devices.

Translate business goals to performance objectives

To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.

Consider the following recommendations:

Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.
Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.
Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.

Run and track frequent experiments

To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.

Consider the following recommendations:

Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.
Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.

Build and automate training and serving services

Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.

Consider the following recommendations:

Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.
Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.

Match design choices to performance requirements

When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.

Consider the following recommendations:

Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.
Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.
Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.
Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.

Link performance metrics to design and configuration choices

To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.

Consider the following recommendations:

Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.
Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.

Contributors

Authors:

Benjamin Sadik | AI and ML Specialist Customer Engineer
Filipe Gracio, PhD | Customer Engineer

Other contributors:

Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Zach Seils | Networking Specialist

Well-Architected Framework: AI and ML perspective Stay organized with collections Save and categorize content based on your preferences.

Contributors

AI and ML perspective: Operational excellence

Build a robust foundation for model development

Define the problems and the required outcomes

Collect and preprocess the necessary data

Select an appropriate ML approach

Set up version control for code, models, and data

Automate the model-development lifecycle

Use a managed pipeline orchestration system

Implement CI/CD pipelines

Implement safe and controlled model releases

Implement observability

Monitor performance continuously

Evaluate models during development

Monitor for availability

Set up custom alerts for business-specific thresholds

Build a culture of operational excellence

Champion automation and standardization

Prioritize continuous learning and improvement

Cultivate accountability and ownership

Embed AI ethics and safety considerations

Design for scalability

Plan for capacity and quotas

Prepare for peak events

Scale applications for production

Contributors

AI and ML perspective: Security

Define clear goals and requirements

Keep data secure and prevent loss or mishandling

Keep AI pipelines secure and robust against tampering

Deploy on secure systems with secure tools and artifacts

Protect and monitor inputs

Monitor, evaluate, and prepare to respond to outputs

Contributors

AI and ML perspective: Reliability

Ensure that infrastructure is scalable and highly available

Use a modular and loosely coupled architecture

Build an automated MLOps platform

Maintain trust and control through data and model governance

Implement holistic AI and ML observability and reliability practices

Contributors

AI and ML perspective: Cost optimization

Define and measure costs and returns

Establish business goals and KPIs

Define strategic objectives and ROI-focused KPIs

Foster a culture of cost awareness and continuous optimization

Drive value and continuous optimization through iteration and feedback

Manage billing data with Cloud Billing and labels

Organize and label Google Cloud resources

Use billing-related tools

Monitor resources continuously with dashboards, alerts, and reports

Create a reporting system

Set alert thresholds based on key KPIs

Optimize resource allocation

Use autoscaling to dynamically adjust resources

Start with small models and datasets

Discover resource requirements through experimentation

Use MLOps to reduce inefficiencies

Align MLOps with cost drivers

Implement MLOps for cost optimization

Enforce data management and governance practices

Establish and adopt a data governance framework

Establish a data governance framework

Implement the data governance framework

Make datasets and features reusable throughout the ML lifecycle

Automate and streamline with MLOps

Automate and standardize data collection and processing

Automate data collection and processing

Standardize data collection and processing

Automate training pipelines and reuse existing assets

Automate training pipelines

Expand reusability beyond pipelines

Use Google Cloud services for model evaluation and tuning

Use resource-efficient model evaluation and experimentation

Choose optimal strategies to automate training

Use explainable AI

Use managed services and pre-trained models

Use notebooks for exploration and experiments

Start with existing and pre-trained models

Well-Architected Framework: AI and ML perspective