Well-Architected Framework: AI and ML perspective

This document in the Google Cloud Well-Architected Framework describes principles and recommendations to help you to design, build, and manage AI and ML workloads in Google Cloud that meet your operational, security, reliability, cost, and performance goals.

The target audience for this document includes decision makers, architects, administrators, developers, and operators who design, build, deploy, and maintain AI and ML workloads in Google Cloud.

The following pages describe principles and recommendations that are specific to AI and ML, for each pillar of the Well-Architected Framework:

Contributors

Authors:

Other contributors:

AI and ML perspective: Operational excellence

This document in the Well-Architected Framework: AI and ML perspective provides provides an overview of the principles and recommendations to build and operate robust AI and ML systems on Google Cloud. These recommendations help you set up foundational elements like observability, automation, and scalability. The recommendations in this document align with the operational excellence pillar of the Google Cloud Well-Architected Framework.

Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the AI and ML systems and pipelines that help drive your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that your operations remain aligned with business goals.

The recommendations in this document are mapped to the following core principles:

Build a robust foundation for model development

To develop and deploy scalable, reliable AI systems that help you achieve your business goals, a robust model-development foundation is essential. Such a foundation enables consistent workflows, automates critical steps in order to reduce errors, and ensures that the models can scale with demand. A strong model-development foundation ensures that your ML systems can be updated, improved, and retrained seamlessly. The foundation also helps you to align your models' performance with business needs, deploy impactful AI solutions quickly, and adapt to changing requirements.

To build a robust foundation to develop your AI models, consider the following recommendations.

Define the problems and the required outcomes

Before you start any AI or ML project, you must have a clear understanding of the business problems to be solved and the required outcomes. Start with an outline of the business objectives and break the objectives down into measurable key performance indicators (KPIs). To organize and document your problem definitions and hypotheses in a Jupyter notebook environment, use tools like Vertex AI Workbench. To implement versioning for code and documents and to document your projects, goals, and assumptions, use tools like Git. To develop and manage prompts for generative AI applications, you can use Vertex AI Studio.

Collect and preprocess the necessary data

To implement data preprocessing and transformation, you can use Dataflow (for Apache Beam), Dataproc (for Apache Spark), or BigQuery if an SQL-based process is appropriate. To validate schemas and detect anomalies, use TensorFlow Data Validation (TFDV) and take advantage of automated data quality scans in BigQuery where applicable.

For generative AI, data quality includes accuracy, relevance, diversity, and alignment with the required output characteristics. In cases where real-world data is insufficient or imbalanced, you can generate synthetic data to help improve model robustness and generalization. To create synthetic datasets based on existing patterns or to augment training data for better model performance, use BigQuery DataFrames and Gemini. Synthetic data is particularly valuable for generative AI because it can help improve prompt diversity and overall model robustness. When you build datasets for fine-tuning generative AI models, consider using the synthetic data generation capabilities in Vertex AI.

For generative AI tasks like fine-tuning or reinforcement learning from human feedback (RLHF), ensure that labels accurately reflect the quality, relevance, and safety of the generated outputs.

Select an appropriate ML approach

When you design your model and parameters, consider the model's complexity and computational needs. Depending on the task (such as classification, regression, or generation), consider using Vertex AI custom training for custom model building or AutoML for simpler ML tasks. For common applications, you can also access pretrained models through Vertex AI Model Garden. You can experiment with a variety of state-of-the-art foundation models for various use cases, such as generating text, images, and code.

You might want to fine-tune a pretrained foundation model to achieve optimal performance for your specific use case. For high-performance requirements in custom training, configure Cloud Tensor Processing Units (TPUs) or GPU resources to accelerate the training and inference of deep-learning models, like large language models (LLMs) and diffusion models.

Set up version control for code, models, and data

To manage and deploy code versions effectively, use tools like GitHub or GitLab. These tools provide robust collaboration features, branching strategies, and integration with CI/CD pipelines to ensure a streamlined development process.

Use appropriate solutions to manage each artifact of your ML system, like the following examples:

  • For code artifacts like container images and pipeline components, Artifact Registry provides a scalable storage solution that can help improve security. Artifact Registry also includes versioning and can integrate with Cloud Build and Cloud Deploy.
  • To manage data artifacts, like datasets used for training and evaluation, use solutions like BigQuery or Cloud Storage for storage and versioning.
  • To store metadata and pointers to data locations, use your version control system or a separate data catalog.

To maintain the consistency and versioning of your feature data, use Vertex AI Feature Store. To track and manage model artifacts, including binaries and metadata, use Vertex AI Model Registry, which lets you store, organize, and deploy model versions seamlessly.

To ensure model reliability, implement Vertex AI Model Monitoring. Detect data drift, track performance, and identify anomalies in production. For generative AI systems, monitor shifts in output quality and safety compliance.

Automate the model-development lifecycle

Automation helps you to streamline every stage of the AI and ML lifecycle. Automation reduces manual effort and standardizes processes, which leads to enhanced operational efficiency and a lower risk of errors. Automated workflows enable faster iteration, consistent deployment across environments, and more reliable outcomes, so your systems can scale and adapt seamlessly.

To automate the development lifecycle of your AI and ML systems, consider the following recommendations.

Use a managed pipeline orchestration system

Use Vertex AI Pipelines to automate every step of the ML lifecycle—from data preparation to model training, evaluation, and deployment. To accelerate deployment and promote consistency across projects, automate recurring tasks with scheduled pipeline runs, monitor workflows with execution metrics, and develop reusable pipeline templates for standardized workflows. These capabilities extend to generative AI models, which often require specialized steps like prompt engineering, response filtering, and human-in-the-loop evaluation. For generative AI, Vertex AI Pipelines can automate these steps, including the evaluation of generated outputs against quality metrics and safety guidelines. To improve prompt diversity and model robustness, automated workflows can also include data augmentation techniques.

Implement CI/CD pipelines

To automate the building, testing, and deployment of ML models, use Cloud Build. This service is particularly effective when you run test suites for application code, which ensures that the infrastructure, dependencies, and model packaging meet your deployment requirements.

ML systems often need additional steps beyond code testing. For example, you need to stress test the models under varying loads, perform bulk evaluations to assess model performance across diverse datasets, and validate data integrity before retraining. To simulate realistic workloads for stress tests, you can use tools like Locust, Grafana k6, or Apache JMeter. To identify bottlenecks, monitor key metrics like latency, error rate, and resource utilization through Cloud Monitoring. For generative AI, the testing must also include evaluations that are specific to the type of generated content, such as text quality, image fidelity, or code functionality. These evaluations can involve automated metrics like perplexity for language models or human-in-the-loop evaluation for more nuanced aspects like creativity and safety.

To implement testing and evaluation tasks, you can integrate Cloud Build with other Google Cloud services. For example, you can use Vertex AI Pipelines for automated model evaluation, BigQuery for large-scale data analysis, and Dataflow pipeline validation for feature validation.

You can further enhance your CI/CD pipeline by using Vertex AI for continuous training to enable automated retraining of models on new data. Specifically for generative AI, to keep the generated outputs relevant and diverse, the retraining might involve automatically updating the models with new training data or prompts. You can use Vertex AI Model Garden to select the latest base models that are available for tuning. This practice ensures that the models remain current and optimized for your evolving business needs.

Implement safe and controlled model releases

To minimize risks and ensure reliable deployments, implement a model release approach that lets you detect issues early, validate performance, and roll back quickly when required.

To package your ML models and applications into container images and deploy them, use Cloud Deploy. You can deploy your models to Vertex AI endpoints.

Implement controlled releases for your AI applications and systems by using strategies like canary releases. For applications that use managed models like Gemini, we recommend that you gradually release new application versions to a subset of users before the full deployment. This approach lets you detect potential issues early, especially when you use generative AI models where outputs can vary.

To release fine-tuned models, you can use Cloud Deploy to manage the deployment of the model versions, and use the canary release strategy to minimize risk. With managed models and fine-tuned models, the goal of controlled releases is to test changes with a limited audience before you release the applications and models to all users.

For robust validation, use Vertex AI Experiments to compare new models against existing ones, and use Vertex AI model evaluation to assess model performance. Specifically for generative AI, define evaluation metrics that align with the intended use case and the potential risks. You can use the Gen AI evaluation service in Vertex AI to assess metrics like toxicity, coherence, factual accuracy, and adherence to safety guidelines.

To ensure deployment reliability, you need a robust rollback plan. For traditional ML systems, use Vertex AI Model Monitoring to detect data drift and performance degradation. For generative AI models, you can track relevant metrics and set up alerts for shifts in output quality or the emergence of harmful content by using Vertex AI model evaluation along with Cloud Logging and Cloud Monitoring. Configure alerts based on generative AI-specific metrics to trigger rollback procedures when necessary. To track model lineage and revert to the most recent stable version, use insights from Vertex AI Model Registry.

Implement observability

The behavior of AI and ML systems can change over time due to changes in the data or environment and updates to the models. This dynamic nature makes observability crucial to detect performance issues, biases, or unexpected behavior. This is especially true for generative AI models because the outputs can be highly variable and subjective. Observability lets you proactively address unexpected behavior and ensure that your AI and ML systems remain reliable, accurate, and fair.

To implement observability for your AI and ML systems, consider the following recommendations.

Monitor performance continuously

Use metrics and success criteria for ongoing evaluation of models after deployment.

You can use Vertex AI Model Monitoring to proactively track model performance, identify training-serving skew and prediction drift, and receive alerts to trigger necessary model retraining or other interventions. To effectively monitor for training-serving skew, construct a golden dataset that represents the ideal data distribution, and use TFDV to analyze your training data and establish a baseline schema.

Configure Model Monitoring to compare the distribution of input data against the golden dataset for automatic skew detection. For traditional ML models, focus on metrics like accuracy, precision, recall, F1-score, AUC-ROC, and log loss. Define custom thresholds for alerts in Model Monitoring. For generative AI, use the Gen AI evaluation service to continuously monitor model output in production. You can also enable automatic evaluation metrics for response quality, safety, instruction adherence, grounding, writing style, and verbosity. To assess the generated outputs for quality, relevance, safety, and adherence to guidelines, you can incorporate human-in-the-loop evaluation.

Create feedback loops to automatically retrain models with Vertex AI Pipelines when Model Monitoring triggers an alert. Use these insights to improve your models continuously.

Evaluate models during development

Before you deploy your LLMs and other generative AI models, thoroughly evaluate them during the development phase. Use Vertex AI model evaluation to achieve optimal performance and to mitigate risk. Use Vertex AI rapid evaluation to let Google Cloud automatically run evaluations based on the dataset and prompts that you provide.

You can also define and integrate custom metrics that are specific to your use case. For feedback on generated content, integrate human-in-the-loop workflows by using Vertex AI Model Evaluation.

Use adversarial testing to identify vulnerabilities and potential failure modes. To identify and mitigate potential biases, use techniques like subgroup analysis and counterfactual generation. Use the insights gathered from the evaluations that were completed during the development phase to define your model monitoring strategy in production. Prepare your solution for continuous monitoring as described in the Monitor performance continuously section of this document.

Monitor for availability

To gain visibility into the health and performance of your deployed endpoints and infrastructure, use Cloud Monitoring. For your Vertex AI endpoints, track key metrics like request rate, error rate, latency, and resource utilization, and set up alerts for anomalies. For more information, see Cloud Monitoring metrics for Vertex AI.

Monitor the health of the underlying infrastructure, which can include Compute Engine instances, Google Kubernetes Engine (GKE) clusters, and TPUs and GPUs. Get automated optimization recommendations from Active Assist. If you use autoscaling, monitor the scaling behavior to ensure that autoscaling responds appropriately to changes in traffic patterns.

Track the status of model deployments, including canary releases and rollbacks, by integrating Cloud Deploy with Cloud Monitoring. In addition, monitor for potential security threats and vulnerabilities by using Security Command Center.

Set up custom alerts for business-specific thresholds

For timely identification and rectification of anomalies and issues, set up custom alerting based on thresholds that are specific to your business objectives. Examples of Google Cloud products that you can use to implement a custom alerting system include the following:

  • Cloud Logging: Collect, store, and analyze logs from all components of your AI and ML system.
  • Cloud Monitoring: Create custom dashboards to visualize key metrics and trends, and define custom metrics based on your needs. Configure alerts to get notifications about critical issues, and integrate the alerts with your incident management tools like PagerDuty or Slack.
  • Error Reporting: Automatically capture and analyze errors and exceptions.
  • Cloud Trace: Analyze the performance of distributed systems and identify bottlenecks. Tracing is particularly useful for understanding latency between different components of your AI and ML pipeline.
  • Cloud Profiler: Continuously analyze the performance of your code in production and identify performance bottlenecks in CPU or memory usage.

Build a culture of operational excellence

Shift the focus from just building models to building sustainable, reliable, and impactful AI solutions. Empower teams to continuously learn, innovate, and improve, which leads to faster development cycles, reduced errors, and increased efficiency. By prioritizing automation, standardization, and ethical considerations, you can ensure that your AI and ML initiatives consistently deliver value, mitigate risks, and promote responsible AI development.

To build a culture of operational excellence for your AI and ML systems, consider the following recommendations.

Champion automation and standardization

To emphasize efficiency and consistency, embed automation and standardized practices into every stage of the AI and ML lifecycle. Automation reduces manual errors and frees teams to focus on innovation. Standardization ensures that processes are repeatable and scalable across teams and projects.

Prioritize continuous learning and improvement

Foster an environment where ongoing education and experimentation are core principles. Encourage teams to stay up-to-date with AI and ML advancements, and provide opportunities to learn from past projects. A culture of curiosity and adaptation drives innovation and ensures that teams are equipped to meet new challenges.

Cultivate accountability and ownership

Build trust and alignment with clearly defined roles, responsibilities, and metrics for success. Empower teams to make informed decisions within these boundaries, and establish transparent ways to measure progress. A sense of ownership motivates teams and ensures collective responsibility for outcomes.

Embed AI ethics and safety considerations

Prioritize considerations for ethics in every stage of development. Encourage teams to think critically about the impact of their AI solutions, and foster discussions on fairness, bias, and societal impact. Clear principles and accountability mechanisms ensure that your AI systems align with organizational values and promote trust.

Design for scalability

To accommodate growing data volumes and user demands and to maximize the value of AI investments, your AI and ML systems need to be scalable. The systems must adapt and perform optimally to avoid performance bottlenecks that hinder effectiveness. When you design for scalability, you ensure that the AI infrastructure can handle growth and maintain responsiveness. Use scalable infrastructure, plan for capacity, and employ strategies like horizontal scaling and managed services.

To design your AI and ML systems for scalability, consider the following recommendations.

Plan for capacity and quotas

Assess future growth, and plan your infrastructure capacity and resource quotas accordingly. Work with business stakeholders to understand the projected growth and then define the infrastructure requirements accordingly.

Use Cloud Monitoring to analyze historical resource utilization, identify trends, and project future needs. Conduct regular load testing to simulate workloads and identify bottlenecks.

Familiarize yourself with Google Cloud quotas for the services that you use, such as Compute Engine, Vertex AI, and Cloud Storage. Proactively request quota increases through the Google Cloud console, and justify the increases with data from forecasting and load testing. Monitor quota usage and set up alerts to get notifications when the usage approaches the quota limits.

To optimize resource usage based on demand, rightsize your resources, use Spot VMs for fault-tolerant batch workloads, and implement autoscaling.

Prepare for peak events

Ensure that your system can handle sudden spikes in traffic or workload during peak events. Document your peak event strategy and conduct regular drills to test your system's ability to handle increased load.

To aggressively scale up resources when the demand spikes, configure autoscaling policies in Compute Engine and GKE. For predictable peak patterns, consider using predictive autoscaling. To trigger autoscaling based on application-specific signals, use custom metrics in Cloud Monitoring.

Distribute traffic across multiple application instances by using Cloud Load Balancing. Choose an appropriate load balancer type based on your application's needs. For geographically distributed users, you can use global load balancing to route traffic to the nearest available instance. For complex microservices-based architectures, consider using Cloud Service Mesh.

Cache static content at the edge of Google's network by using Cloud CDN. To cache frequently accessed data, you can use Memorystore, which offers a fully managed in-memory service for Redis, Valkey, or Memcached.

Decouple the components of your system by using Pub/Sub for real-time messaging and Cloud Tasks for asynchronous task execution

Scale applications for production

To ensure scalable serving in production, you can use managed services like Vertex AI distributed training and Vertex AI Prediction. Vertex AI Prediction lets you configure the machine types for your prediction nodes when you deploy a model to an endpoint or request batch predictions. For some configurations, you can add GPUs. Choose the appropriate machine type and accelerators to optimize latency, throughput, and cost.

To scale complex AI and Python applications and custom workloads across distributed computing resources, you can use Ray on Vertex AI. This feature can help optimize performance and enables seamless integration with Google Cloud services. Ray on Vertex AI simplifies distributed computing by handling cluster management, task scheduling, and data transfer. It integrates with other Vertex AI services like training, prediction, and pipelines. Ray provides fault tolerance and autoscaling, and helps you adapt the infrastructure to changing workloads. It offers a unified framework for distributed training, hyperparameter tuning, reinforcement learning, and model serving. Use Ray for distributed data preprocessing with Dataflow or Dataproc, accelerated model training, scalable hyperparameter tuning, reinforcement learning, and parallelized batch prediction.

Contributors

Authors:

Other contributors:

AI and ML perspective: Security

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Google Cloud Well-Architected Framework.

Secure deployment of AI and ML workloads is a critical requirement, particularly in enterprise environments. To meet this requirement, you need to adopt a holistic security approach that starts from the initial conceptualization of your AI and ML solutions and extends to development, deployment, and ongoing operations. Google Cloud offers robust tools and services that are designed to help secure your AI and ML workloads.

Define clear goals and requirements

It's easier to integrate the required security and compliance controls early in your design and development process, than to add the controls after development. From the start of your design and development process, make decisions that are appropriate for your specific risk environment and your specific business priorities.

Consider the following recommendations:

  • Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.
  • Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.

Keep data secure and prevent loss or mishandling

Data is a valuable and sensitive asset that must be kept secure. Data security helps you to maintain user trust, support your business objectives, and meet your compliance requirements.

Consider the following recommendations:

  • Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.
  • Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.
  • Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.
  • Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.
  • Guard against data poisoning for your ML training systems.

Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.

Consider the following recommendations:

  • Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.
  • Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.
  • Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.

Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.

Consider the following recommendations:

  • Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.
  • Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.
  • Prefer using validated prebuilt container images that are specifically designed for AI workloads.

Protect and monitor inputs

AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.

Consider the following recommendations:

  • Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.
  • Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.
  • Ensure that only the intended users of a deployed system can use it.

Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.

Consider the following recommendations:

  • Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.
  • Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.
  • Implement robust alerting and incident response procedures to address any potential issues.

Contributors

Authors:

Other contributors:

AI and ML perspective: Reliability

This document in the Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Google Cloud Well-Architected Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential for ensuring customer satisfaction and achieving business goals. You need AI and ML systems that are robust, reliable, and adaptable to meet the unique demands of both predictive ML and generative AI. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with Site Reliability Engineering (SRE) principles and provides a powerful foundation for reliable AI and ML systems.

Ensure that infrastructure is scalable and highly available

By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.

Consider the following recommendations:

  • Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.
  • Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.
  • Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.
  • Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.

Use a modular and loosely coupled architecture

To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.

Consider the following recommendations:

  • Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.
  • Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.
  • Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.
  • Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.

Build an automated MLOps platform

With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.

Consider the following recommendations:

  • Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.
  • Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.
  • Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.
  • Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.
  • Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.

Maintain trust and control through data and model governance

The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.

Consider the following recommendations:

  • Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.
  • Implement strict access controls and audit trails to protect sensitive data and models.
  • Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.
  • Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.

Implement holistic AI and ML observability and reliability practices

To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.

Consider the following recommendations:

  • Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.
  • Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.
  • Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.
  • Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).

Contributors

Authors:

Other contributors:

AI and ML perspective: Cost optimization

This document in Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline. The recommendations in this document align with the cost optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems can help you to unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.

Define and measure costs and returns

To effectively manage your AI and ML costs in Google Cloud, you must define and measure the expenses for cloud resources and the business value of your AI and ML initiatives. Google Cloud provides comprehensive tools for billing and cost management to help you to track expenses granularly. Business value metrics that you can measure include customer satisfaction, revenue, and operational costs. By establishing concrete metrics for both costs and business value, you can make informed decisions about resource allocation and optimization.

Consider the following recommendations:

  • Establish clear business objectives and key performance indicators (KPIs) for your AI and ML projects.
  • Use the billing information provided by Google Cloud to implement cost monitoring and reporting processes that can help you to attribute costs to specific AI and ML activities.
  • Establish dashboards, alerting, and reporting systems to track costs and returns against KPIs.

Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. By carefully aligning resource allocation with the needs of your workloads, you can avoid unnecessary expenses and ensure that your AI and ML systems have the resources that they need to perform optimally.

Consider the following recommendations:

  • Use autoscaling to dynamically adjust resources for training and inference.
  • Start with small models and data. Save costs by testing hypotheses at a smaller scale when possible.
  • Discover your compute needs through experimentation. Rightsize the resources that are used for training and serving based on your ML requirements.
  • Adopt MLOps practices to reduce duplication, manual processes, and inefficient resource allocation.

Enforce data management and governance practices

Effective data management and governance practices play a critical role in cost optimization. Well-organized data helps your organization to avoid needless duplication, reduces the effort required to obtain high quality data, and encourages teams to reuse datasets. By proactively managing data, you can reduce storage costs, enhance data quality, and ensure that your ML models are trained and operate on the most relevant and valuable data.

Consider the following recommendations:

  • Establish and adopt a well-defined data governance framework.
  • Apply labels and relevant metadata to datasets at the point of data ingestion.
  • Ensure that datasets are discoverable and accessible across the organization.
  • Make your datasets and features reusable throughout the ML lifecycle wherever possible.

Automate and streamline with MLOps

A primary benefit of adopting MLOps practices is a reduction in costs, both from a technology perspective and in terms of personnel activities. Automation helps you to avoid duplication of ML activities and improve the productivity of data scientists and ML engineers.

Consider the following recommendations:

  • Increase the level of automation and standardization in your data collection and processing technologies to reduce development effort and time.
  • Develop automated training pipelines to reduce the need for manual interventions and increase engineer productivity. Implement mechanisms for the pipelines to reuse existing assets like prepared datasets and trained models.
  • Use the model evaluation and tuning services in Google Cloud to increase model performance with fewer iterations. This enables your AI and ML teams to achieve more objectives in less time.

Use managed services and pre-trained or existing models

There are many approaches to achieving business goals by using AI and ML. Adopt an incremental approach to model selection and model development. This helps you to avoid excessive costs that are associated with starting fresh every time. To control costs, start with a simple approach: use ML frameworks, managed services, and pre-trained models.

Consider the following recommendations:

  • Enable exploratory and quick ML experiments by using notebook environments.
  • Use existing and pre-trained models as a starting point to accelerate your model selection and development process.
  • Use managed services to train or serve your models. Both AutoML and managed custom model training services can help to reduce the cost of model training. Managed services can also help to reduce the cost of your model-serving infrastructure.

Foster a culture of cost awareness and continuous optimization

Cultivate a collaborative environment that encourages communication and regular reviews. This approach helps teams to identify and implement cost-saving opportunities throughout the ML lifecycle.

Consider the following recommendations:

  • Adopt FinOps principles across your ML lifecycle.
  • Ensure that all costs and business benefits of AI and ML projects have assigned owners with clear accountability.

Contributors

Authors:

Other contributors:

AI and ML perspective: Performance optimization

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.

To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, LLMs that are very large can be highly performant on massive training infrastructure, but very large models might not perform well in capacity-constrained environments like mobile devices.

Translate business goals to performance objectives

To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.

Consider the following recommendations:

  • Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.
  • Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.
  • Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.

Run and track frequent experiments

To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.

Consider the following recommendations:

  • Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.
  • Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.

Build and automate training and serving services

Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.

Consider the following recommendations:

  • Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.
  • Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.

Match design choices to performance requirements

When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.

Consider the following recommendations:

  • Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.
  • Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.
  • Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.
  • Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.

To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.

Consider the following recommendations:

  • Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.
  • Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.

Contributors

Authors:

Other contributors: