Well-Architected Framework: AI and ML perspective

Last reviewed 2025-02-14 UTC

This document in the Google Cloud Well-Architected Framework describes principles and recommendations to help you to design, build, and manage AI and ML workloads in Google Cloud that meet your operational, security, reliability, cost, and performance goals.

The target audience for this document includes decision makers, architects, administrators, developers, and operators who design, build, deploy, and maintain AI and ML workloads in Google Cloud.

The following pages describe principles and recommendations that are specific to AI and ML, for each pillar of the Well-Architected Framework:

Contributors

Authors:

Other contributors:

AI and ML perspective: Operational excellence

This document in the Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to help you to build and operate robust AI and ML systems on Google Cloud. These recommendations help you to set up foundational elements like observability, automation, and scalability. This document's recommendations align with the operational excellence pillar of the Google Cloud Well-Architected Framework.

Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the intricate AI and ML systems and pipelines that power your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that operations remain aligned with business goals.

Build a robust foundation for model development

Establish a robust foundation to streamline model development, from problem definition to deployment. Such a foundation ensures that your AI solutions are built on reliable and efficient components and choices. This kind of foundation helps you to release changes and improvements quickly and easily.

Consider the following recommendations:

  • Define the problem that the AI system solves and the outcome that you want.
  • Identify and gather relevant data that's required to train and evaluate your models. Then, clean and preprocess the raw data. Implement data validation checks to ensure data quality and integrity.
  • Choose the appropriate ML approach for the task. When you design the structure and parameters of the model, consider the model's complexity and computational requirements.
  • Adopt a version control system for code, model, and data.

Automate the model-development lifecycle

From data preparation and training to deployment and monitoring, automation helps you to improve the quality and efficiency of your operations. Automation enables seamless, repeatable, and error-free model development and deployment. Automation minimizes manual intervention, speeds up release cycles, and ensures consistency across environments.

Consider the following recommendations:

  • Use a managed pipeline orchestration system to orchestrate and automate the ML workflow. The pipeline must handle the major steps of your development lifecycle: preparation, training, deployment, and evaluation.
  • Implement CI/CD pipelines for the model-development lifecycle. These pipelines should automate the building, testing, and deployment of models. The pipelines should also include continuous training to retrain models on new data as needed.
  • Implement phased release approaches such as canary deployments or A/B testing, for safe and controlled model releases.

Implement observability

When you implement observability, you can gain deep insights into model performance, data drift, and system health. Implement continuous monitoring, alerting, and logging mechanisms to proactively identify issues, trigger timely responses, and ensure operational continuity.

Consider the following recommendations:

  • Implement permanent and automated performance monitoring for your models. Use metrics and success criteria for ongoing evaluation of the model after deployment.
  • Monitor your deployment endpoints and infrastructure to ensure service availability.
  • Set up custom alerting based on business-specific thresholds and anomalies to ensure that issues are identified and resolved in a timely manner.
  • Use explainable AI techniques to understand and interpret model outputs.

Build a culture of operational excellence

Operational excellence is built on a foundation of people, culture, and professional practices. The success of your team and business depends on how effectively your organization implements methodologies that enable the reliable and rapid development of AI capabilities.

Consider the following recommendations:

  • Champion automation and standardization as core development methodologies. Streamline your workflows and manage the ML lifecycle efficiently by using MLOps techniques. Automate tasks to free up time for innovation, and standardize processes to support consistency and easier troubleshooting.
  • Prioritize continuous learning and improvement. Promote learning opportunities that team members can use to enhance their skills and stay current with AI and ML advancements. Encourage experimentation and conduct regular retrospectives to identify areas for improvement.
  • Cultivate a culture of accountability and ownership. Define clear roles so that everyone understands their contributions. Empower teams to make decisions within boundaries and track progress by using transparent metrics.
  • Embed AI ethics and safety into the culture. Prioritize responsible systems by integrating ethics considerations into every stage of the ML lifecycle. Establish clear ethics principles and foster open discussions about ethics-related challenges.

Design for scalability

Architect your AI solutions to handle growing data volumes and user demands. Use scalable infrastructure so that your models can adapt and perform optimally as your project expands.

Consider the following recommendations:

  • Plan for capacity and quotas. Anticipate future growth, and plan your infrastructure capacity and resource quotas accordingly.
  • Prepare for peak events. Ensure that your system can handle sudden spikes in traffic or workload during peak events.
  • Scale AI applications for production. Design for horizontal scaling to accommodate increases in the workload. Use frameworks like Ray on Vertex AI to parallelize tasks across multiple machines.
  • Use managed services where appropriate. Use services that help you to scale while minimizing the operational overhead and complexity of manual interventions.

Contributors

Authors:

Other contributors:

AI and ML perspective: Security

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Google Cloud Well-Architected Framework.

Secure deployment of AI and ML workloads is a critical requirement, particularly in enterprise environments. To meet this requirement, you need to adopt a holistic security approach that starts from the initial conceptualization of your AI and ML solutions and extends to development, deployment, and ongoing operations. Google Cloud offers robust tools and services that are designed to help secure your AI and ML workloads.

Define clear goals and requirements

It's easier to integrate the required security and compliance controls early in your design and development process, than to add the controls after development. From the start of your design and development process, make decisions that are appropriate for your specific risk environment and your specific business priorities.

Consider the following recommendations:

  • Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.
  • Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.

Keep data secure and prevent loss or mishandling

Data is a valuable and sensitive asset that must be kept secure. Data security helps you to maintain user trust, support your business objectives, and meet your compliance requirements.

Consider the following recommendations:

  • Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.
  • Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.
  • Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.
  • Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.
  • Guard against data poisoning for your ML training systems.

Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.

Consider the following recommendations:

  • Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.
  • Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.
  • Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.

Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.

Consider the following recommendations:

  • Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.
  • Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.
  • Prefer using validated prebuilt container images that are specifically designed for AI workloads.

Protect and monitor inputs

AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.

Consider the following recommendations:

  • Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.
  • Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.
  • Ensure that only the intended users of a deployed system can use it.

Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.

Consider the following recommendations:

  • Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.
  • Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.
  • Implement robust alerting and incident response procedures to address any potential issues.

Contributors

Authors:

Other contributors:

AI and ML perspective: Reliability

This document in the Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Google Cloud Well-Architected Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential for ensuring customer satisfaction and achieving business goals. You need AI and ML systems that are robust, reliable, and adaptable to meet the unique demands of both predictive ML and generative AI. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with Site Reliability Engineering (SRE) principles and provides a powerful foundation for reliable AI and ML systems.

Ensure that infrastructure is scalable and highly available

By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.

Consider the following recommendations:

  • Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.
  • Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.
  • Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.
  • Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.

Use a modular and loosely coupled architecture

To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.

Consider the following recommendations:

  • Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.
  • Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.
  • Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.
  • Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.

Build an automated MLOps platform

With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.

Consider the following recommendations:

  • Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.
  • Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.
  • Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.
  • Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.
  • Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.

Maintain trust and control through data and model governance

The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.

Consider the following recommendations:

  • Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.
  • Implement strict access controls and audit trails to protect sensitive data and models.
  • Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.
  • Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.

Implement holistic AI and ML observability and reliability practices

To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.

Consider the following recommendations:

  • Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.
  • Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.
  • Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.
  • Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).

Contributors

Authors:

Other contributors:

AI and ML perspective: Cost optimization

This document in Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline. The recommendations in this document align with the cost optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems can help you to unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.

Define and measure costs and returns

To effectively manage your AI and ML costs in Google Cloud, you must define and measure the expenses for cloud resources and the business value of your AI and ML initiatives. Google Cloud provides comprehensive tools for billing and cost management to help you to track expenses granularly. Business value metrics that you can measure include customer satisfaction, revenue, and operational costs. By establishing concrete metrics for both costs and business value, you can make informed decisions about resource allocation and optimization.

Consider the following recommendations:

  • Establish clear business objectives and key performance indicators (KPIs) for your AI and ML projects.
  • Use the billing information provided by Google Cloud to implement cost monitoring and reporting processes that can help you to attribute costs to specific AI and ML activities.
  • Establish dashboards, alerting, and reporting systems to track costs and returns against KPIs.

Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. By carefully aligning resource allocation with the needs of your workloads, you can avoid unnecessary expenses and ensure that your AI and ML systems have the resources that they need to perform optimally.

Consider the following recommendations:

  • Use autoscaling to dynamically adjust resources for training and inference.
  • Start with small models and data. Save costs by testing hypotheses at a smaller scale when possible.
  • Discover your compute needs through experimentation. Rightsize the resources that are used for training and serving based on your ML requirements.
  • Adopt MLOps practices to reduce duplication, manual processes, and inefficient resource allocation.

Enforce data management and governance practices

Effective data management and governance practices play a critical role in cost optimization. Well-organized data helps your organization to avoid needless duplication, reduces the effort required to obtain high quality data, and encourages teams to reuse datasets. By proactively managing data, you can reduce storage costs, enhance data quality, and ensure that your ML models are trained and operate on the most relevant and valuable data.

Consider the following recommendations:

  • Establish and adopt a well-defined data governance framework.
  • Apply labels and relevant metadata to datasets at the point of data ingestion.
  • Ensure that datasets are discoverable and accessible across the organization.
  • Make your datasets and features reusable throughout the ML lifecycle wherever possible.

Automate and streamline with MLOps

A primary benefit of adopting MLOps practices is a reduction in costs, both from a technology perspective and in terms of personnel activities. Automation helps you to avoid duplication of ML activities and improve the productivity of data scientists and ML engineers.

Consider the following recommendations:

  • Increase the level of automation and standardization in your data collection and processing technologies to reduce development effort and time.
  • Develop automated training pipelines to reduce the need for manual interventions and increase engineer productivity. Implement mechanisms for the pipelines to reuse existing assets like prepared datasets and trained models.
  • Use the model evaluation and tuning services in Google Cloud to increase model performance with fewer iterations. This enables your AI and ML teams to achieve more objectives in less time.

Use managed services and pre-trained or existing models

There are many approaches to achieving business goals by using AI and ML. Adopt an incremental approach to model selection and model development. This helps you to avoid excessive costs that are associated with starting fresh every time. To control costs, start with a simple approach: use ML frameworks, managed services, and pre-trained models.

Consider the following recommendations:

  • Enable exploratory and quick ML experiments by using notebook environments.
  • Use existing and pre-trained models as a starting point to accelerate your model selection and development process.
  • Use managed services to train or serve your models. Both AutoML and managed custom model training services can help to reduce the cost of model training. Managed services can also help to reduce the cost of your model-serving infrastructure.

Foster a culture of cost awareness and continuous optimization

Cultivate a collaborative environment that encourages communication and regular reviews. This approach helps teams to identify and implement cost-saving opportunities throughout the ML lifecycle.

Consider the following recommendations:

  • Adopt FinOps principles across your ML lifecycle.
  • Ensure that all costs and business benefits of AI and ML projects have assigned owners with clear accountability.

Contributors

Authors:

Other contributors:

AI and ML perspective: Performance optimization

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.

To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, LLMs that are very large can be highly performant on massive training infrastructure, but very large models might not perform well in capacity-constrained environments like mobile devices.

Translate business goals to performance objectives

To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.

Consider the following recommendations:

  • Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.
  • Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.
  • Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.

Run and track frequent experiments

To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.

Consider the following recommendations:

  • Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.
  • Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.

Build and automate training and serving services

Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.

Consider the following recommendations:

  • Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.
  • Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.

Match design choices to performance requirements

When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.

Consider the following recommendations:

  • Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.
  • Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.
  • Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.
  • Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.

To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.

Consider the following recommendations:

  • Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.
  • Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.

Contributors

Authors:

Other contributors: