架构完善框架:AI 和机器学习视角中的这篇文档概述了相关原则和建议,可帮助您优化 Google Cloud上 AI 和机器学习工作负载的性能。本文档中的建议与 Google Cloud 架构完善框架的性能优化支柱相符。
AI 和机器学习系统可为您的组织带来全新的自动化和决策功能。这些系统的性能会直接影响收入、成本和客户满意度等业务驱动因素。若要充分发挥 AI 和机器学习系统的潜力,您需要根据业务目标和技术要求优化其性能。性能优化过程通常涉及某些权衡。例如,提供所需性能的设计选择可能会导致更高的成本。本文档中的建议优先考虑性能,而不是费用等其他因素。
为了优化 AI 和 ML 性能,您需要针对模型架构、参数和训练策略等因素做出决策。在做出这些决策时,请考虑 AI 和 ML 系统的整个生命周期及其部署环境。例如,非常大的 LLM 在大规模训练基础设施上可能表现出色,但在容量受限的环境(例如移动设备)中可能表现不佳。
将业务目标转化为性能目标
若要做出可优化性能的架构决策,请先明确一组业务目标。设计 AI 和机器学习系统,使其提供支持业务目标和优先事项所需的技术性能。您的技术团队必须了解效果目标与业务目标之间的对应关系。
请考虑以下建议:
将业务目标转化为技术要求:将 AI 和 ML 系统的业务目标转化为具体的技术性能要求,并评估未达到这些要求的影响。例如,对于预测客户流失的应用,机器学习模型应在准确率和召回率等标准指标方面表现出色,并且应用应满足低延迟等运营要求。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2024-10-11。"],[[["\u003cp\u003eThis document outlines principles and recommendations for optimizing the performance of AI and ML workloads on Google Cloud, aligning with the performance optimization pillar of the Well-Architected Framework.\u003c/p\u003e\n"],["\u003cp\u003eOptimizing AI and ML system performance is crucial for business drivers like revenue and customer satisfaction, requiring decisions on model architecture, parameters, and training strategies throughout the entire lifecycle.\u003c/p\u003e\n"],["\u003cp\u003eTranslating business objectives into specific technical requirements, monitoring performance at all stages, and automating evaluations are essential for effective AI/ML performance optimization.\u003c/p\u003e\n"],["\u003cp\u003eBuilding a dedicated experimentation environment, embedding experimentation into the company's culture, and leveraging AI-specialized components for training and prediction are key to successful AI/ML development.\u003c/p\u003e\n"],["\u003cp\u003eLinking performance metrics to design choices and configurations through a data and model lineage system, alongside using explainability tools, is vital for innovating and improving model performance.\u003c/p\u003e\n"]]],[],null,["# AI and ML perspective: Performance optimization\n\nThis document in the\n[Well-Architected Framework: AI and ML perspective](/architecture/framework/perspectives/ai-ml)\nprovides an overview of principles and recommendations to help you to optimize\nthe performance of your AI and ML workloads on Google Cloud. The\nrecommendations in this document align with the\n[performance optimization pillar](/architecture/framework/performance-optimization)\nof the Google Cloud Well-Architected Framework.\n\nAI and ML systems enable new automation and decision-making capabilities for\nyour organization. The performance of these systems can directly affect your\nbusiness drivers like revenue, costs, and customer satisfaction. To realize the\nfull potential of your AI and ML systems, you need to optimize their performance\nbased on your business goals and technical requirements. The performance\noptimization process often involves certain trade-offs. For example, a design\nchoice that provides the required performance might lead to higher costs. The\nrecommendations in this document prioritize performance over other\nconsiderations like costs.\n\nTo optimize AI and ML performance, you need to make decisions regarding factors\nlike the model architecture, parameters, and training strategy. When you make\nthese decisions, consider the entire lifecycle of the AI and ML systems and\ntheir deployment environment. For example, LLMs that are very large can be\nhighly performant on massive training infrastructure, but very large models\nmight not perform well in capacity-constrained environments like mobile\ndevices.\n\nTranslate business goals to performance objectives\n--------------------------------------------------\n\nTo make architectural decisions that optimize performance, start with a clear\nset of business goals. Design AI and ML systems that provide the technical\nperformance that's required to support your business goals and priorities. Your\ntechnical teams must understand the mapping between performance objectives and\nbusiness goals.\n\nConsider the following recommendations:\n\n- **Translate business objectives into technical requirements** : Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, *and* the application should meet operational requirements like low latency.\n- **Monitor performance at all stages of the model lifecycle**: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.\n- **Automate evaluation to make it reproducible and standardized**: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.\n\nRun and track frequent experiments\n----------------------------------\n\nTo transform innovation and creativity into performance improvements, you need\na culture and a platform that supports experimentation. Performance improvement\nis an ongoing process because AI and ML technologies are developing\ncontinuously and quickly. To maintain a fast-paced, iterative process, you\nneed to separate the experimentation space from your training and serving\nplatforms. A standardized and robust experimentation process is important.\n\nConsider the following recommendations:\n\n- **Build an experimentation environment**: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.\n- **Embed experimentation as a culture** : Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and [hyperparameters](https://developers.google.com/machine-learning/glossary#hyperparameter).\n\nBuild and automate training and serving services\n------------------------------------------------\n\nTraining and serving AI models are core components of your AI services. You\nneed robust platforms and practices that support fast and reliable creation,\ndeployment, and serving of AI models. Invest time and effort to create\nfoundational platforms for your core AI training and serving tasks. These\nfoundational platforms help to reduce time and effort for your teams and improve\nthe quality of outputs in the medium and long term.\n\nConsider the following recommendations:\n\n- **Use AI-specialized components of a training service**: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.\n- **Use AI-specialized components of a prediction service**: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.\n\nMatch design choices to performance requirements\n------------------------------------------------\n\nWhen you make design choices to improve performance, carefully assess whether\nthe choices support your business requirements or are wasteful and\ncounterproductive. To choose the appropriate infrastructure, models, or\nconfigurations, identify performance bottlenecks and assess how they're linked\nto your performance measures. For example, even on very powerful GPU\naccelerators, your training tasks can experience performance bottlenecks due to\ndata I/O issues from the storage layer or due to performance limitations of the\nmodel itself.\n\nConsider the following recommendations:\n\n- **Optimize hardware consumption based on performance goals**: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.\n- **Focus on workload-specific requirements**: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.\n- **Choose appropriate training strategies**: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.\n- **Recognize that performance-optimization strategies can have\n diminishing returns**: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.\n\nLink performance metrics to design and configuration choices\n------------------------------------------------------------\n\nTo innovate, troubleshoot, and investigate performance issues, establish a\nclear link between design choices and performance outcomes. In addition to\nexperimentation, you must reliably record the lineage of your assets,\ndeployments, model outputs, and the configurations and inputs that produced the\noutputs.\n\nConsider the following recommendations:\n\n- **Build a data and model lineage system**: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.\n- **Use explainability tools to improve model performance**: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.\n\nContributors\n------------\n\nAuthors:\n\n- [Benjamin Sadik](https://www.linkedin.com/in/benjaminhaimsadik) \\| AI and ML Specialist Customer Engineer\n- [Filipe Gracio, PhD](https://www.linkedin.com/in/filipegracio) \\| Customer Engineer, AI/ML Specialist\n\n\u003cbr /\u003e\n\nOther contributors:\n\n- [Kumar Dhanagopal](https://www.linkedin.com/in/kumardhanagopal) \\| Cross-Product Solution Developer\n- [Marwan Al Shawi](https://www.linkedin.com/in/marwanalshawi) \\| Partner Customer Engineer\n- [Zach Seils](https://www.linkedin.com/in/zachseils) \\| Networking Specialist\n\n\u003cbr /\u003e"]]