AI and ML perspective: Operational excellence

Last reviewed 2024-10-11 UTC

This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to help you to build and operate robust AI and ML systems on Google Cloud. These recommendations help you to set up foundational elements like observability, automation, and scalability. This document's recommendations align with the operational excellence pillar of the Architecture Framework.

Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the intricate AI and ML systems and pipelines that power your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that operations remain aligned with business goals.

Build a robust foundation for model development

Establish a robust foundation to streamline model development, from problem definition to deployment. Such a foundation ensures that your AI solutions are built on reliable and efficient components and choices. This kind of foundation helps you to release changes and improvements quickly and easily.

Consider the following recommendations:

  • Define the problem that the AI system solves and the outcome that you want.
  • Identify and gather relevant data that's required to train and evaluate your models. Then, clean and preprocess the raw data. Implement data validation checks to ensure data quality and integrity.
  • Choose the appropriate ML approach for the task. When you design the structure and parameters of the model, consider the model's complexity and computational requirements.
  • Adopt a version control system for code, model, and data.

Automate the model-development lifecycle

From data preparation and training to deployment and monitoring, automation helps you to improve the quality and efficiency of your operations. Automation enables seamless, repeatable, and error-free model development and deployment. Automation minimizes manual intervention, speeds up release cycles, and ensures consistency across environments.

Consider the following recommendations:

  • Use a managed pipeline orchestration system to orchestrate and automate the ML workflow. The pipeline must handle the major steps of your development lifecycle: preparation, training, deployment, and evaluation.
  • Implement CI/CD pipelines for the model-development lifecycle. These pipelines should automate the building, testing, and deployment of models. The pipelines should also include continuous training to retrain models on new data as needed.
  • Implement phased release approaches such as canary deployments or A/B testing, for safe and controlled model releases.

Implement observability

When you implement observability, you can gain deep insights into model performance, data drift, and system health. Implement continuous monitoring, alerting, and logging mechanisms to proactively identify issues, trigger timely responses, and ensure operational continuity.

Consider the following recommendations:

  • Implement permanent and automated performance monitoring for your models. Use metrics and success criteria for ongoing evaluation of the model after deployment.
  • Monitor your deployment endpoints and infrastructure to ensure service availability.
  • Set up custom alerting based on business-specific thresholds and anomalies to ensure that issues are identified and resolved in a timely manner.
  • Use explainable AI techniques to understand and interpret model outputs.

Build a culture of operational excellence

Operational excellence is built on a foundation of people, culture, and professional practices. The success of your team and business depends on how effectively your organization implements methodologies that enable the reliable and rapid development of AI capabilities.

Consider the following recommendations:

  • Champion automation and standardization as core development methodologies. Streamline your workflows and manage the ML lifecycle efficiently by using MLOps techniques. Automate tasks to free up time for innovation, and standardize processes to support consistency and easier troubleshooting.
  • Prioritize continuous learning and improvement. Promote learning opportunities that team members can use to enhance their skills and stay current with AI and ML advancements. Encourage experimentation and conduct regular retrospectives to identify areas for improvement.
  • Cultivate a culture of accountability and ownership. Define clear roles so that everyone understands their contributions. Empower teams to make decisions within boundaries and track progress by using transparent metrics.
  • Embed AI ethics and safety into the culture. Prioritize responsible systems by integrating ethics considerations into every stage of the ML lifecycle. Establish clear ethics principles and foster open discussions about ethics-related challenges.

Design for scalability

Architect your AI solutions to handle growing data volumes and user demands. Use scalable infrastructure so that your models can adapt and perform optimally as your project expands.

Consider the following recommendations:

  • Plan for capacity and quotas. Anticipate future growth, and plan your infrastructure capacity and resource quotas accordingly.
  • Prepare for peak events. Ensure that your system can handle sudden spikes in traffic or workload during peak events.
  • Scale AI applications for production. Design for horizontal scaling to accommodate increases in the workload. Use frameworks like Ray on Vertex AI to parallelize tasks across multiple machines.
  • Use managed services where appropriate. Use services that help you to scale while minimizing the operational overhead and complexity of manual interventions.

Contributors

Authors:

Other contributors: