Well-Architected Framework: Operational excellence pillar

Last reviewed 2025-02-14 UTC

The operational excellence pillar in the Google Cloud Well-Architected Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.

The operational excellence pillar is relevant to the following audiences:

Managers and leaders: A framework to establish and maintain operational excellence in the cloud and to ensure that cloud investments deliver value and support business objectives.
Cloud operations teams: Guidance to manage incidents and problems, plan capacity, optimize performance, and manage change.
Site reliability engineers (SREs): Best practices that help you to achieve high levels of service reliability, including monitoring, incident response, and automation.
Cloud architects and engineers: Operational requirements and best practices for the design and implementation phases, to help ensure that solutions are designed for operational efficiency and scalability.
DevOps teams: Guidance about automation, CI/CD pipelines, and change management, to help enable faster and more reliable software delivery.

To achieve operational excellence, you should embrace automation, orchestration, and data-driven insights. Automation helps to eliminate toil. It also streamlines and builds guardrails around repetitive tasks. Orchestration helps to coordinate complex processes. Data-driven insights enable evidence-based decision-making. By using these practices, you can optimize cloud operations, reduce costs, improve service availability, and enhance security.

Operational excellence in the cloud goes beyond technical proficiency in cloud operations. It includes a cultural shift that encourages continuous learning and experimentation. Teams must be empowered to innovate, iterate, and adopt a growth mindset. A culture of operational excellence fosters a collaborative environment where individuals are encouraged to share ideas, challenge assumptions, and drive improvement.

For operational excellence principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Operational excellence in the Well-Architected Framework.

Core principles

The recommendations in the operational excellence pillar of the Well-Architected Framework are mapped to the following core principles:

Ensure operational readiness and performance using CloudOps: Ensure that cloud solutions meet operational and performance requirements by defining service level objectives (SLOs) and by performing comprehensive monitoring, performance testing, and capacity planning.
Manage incidents and problems: Minimize the impact of cloud incidents and prevent recurrence through comprehensive observability, clear incident response procedures, thorough retrospectives, and preventive measures.
Manage and optimize cloud resources: Optimize and manage cloud resources through strategies like right-sizing, autoscaling, and by using effective cost monitoring tools.
Automate and manage change: Automate processes, streamline change management, and alleviate the burden of manual labor.
Continuously improve and innovate: Focus on ongoing enhancements and the introduction of new solutions to stay competitive.

Contributors

Authors:

Ryan Cox | Principal Architect
Hadrian Knotz | Enterprise Architect

Other contributors:

Daniel Lees | Cloud Security Architect
Filipe Gracio, PhD | Customer Engineer
Gary Harmson | Principal Architect
Jose Andrade | Enterprise Infrastructure Customer Engineer
Kumar Dhanagopal | Cross-Product Solution Developer
Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
Radhika Kanakam | Senior Program Manager, Cloud GTM
Samantha He | Technical Writer
Zach Seils | Networking Specialist
Wade Holmes | Global Solutions Director

Ensure operational readiness and performance using CloudOps

This principle in the operational excellence pillar of the Google Cloud Well-Architected Framework helps you to ensure operational readiness and performance of your cloud workloads. It emphasizes establishing clear expectations and commitments for service performance, implementing robust monitoring and alerting, conducting performance testing, and proactively planning for capacity needs.

Principle overview

Different organizations might interpret operational readiness differently. Operational readiness is how your organization prepares to successfully operate workloads on Google Cloud. Preparing to operate a complex, multilayered cloud workload requires careful planning for both go-live and day-2 operations. These operations are often called CloudOps.

Focus areas of operational readiness

Operational readiness consists of four focus areas. Each focus area consists of a set of activities and components that are necessary to prepare to operate a complex application or environment in Google Cloud. The following table lists the components and activities of each focus area:

Focus area of operational readiness	Activities and components
Workforce	Defining clear roles and responsibilities for the teams that manage and operate the cloud resources. Ensuring that team members have appropriate skills. Developing a learning program. Establishing a clear team structure. Hiring the required talent.
Processes	Observability. Managing service disruptions. Cloud delivery. Core cloud operations.
Tooling	Tools that are required to support CloudOps processes.
Governance	Service levels and reporting. Cloud financials. Cloud operating model. Architectural review and governance boards. Cloud architecture and compliance.

Recommendations

To ensure operational readiness and performance by using CloudOps, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Define SLOs and SLAs

A core responsibility of the cloud operations team is to define service level objectives (SLOs) and service level agreements (SLAs) for all of the critical workloads. This recommendation is relevant to the governance focus area of operational readiness.

SLOs must be specific, measurable, achievable, relevant, and time-bound (SMART), and they must reflect the level of service and performance that you want.

Specific: Clearly articulates the required level of service and performance.
Measurable: Quantifiable and trackable.
Achievable: Attainable within the limits of your organization's capabilities and resources.
Relevant: Aligned with business goals and priorities.
Time-bound: Has a defined timeframe for measurement and evaluation.

For example, an SLO for a web application might be "99.9% availability" or "average response time less than 200 ms." Such SLOs clearly define the required level of service and performance for the web application, and the SLOs can be measured and tracked over time.

SLAs outline the commitments to customers regarding service availability, performance, and support, including any penalties or remedies for noncompliance. SLAs must include specific details about the services that are provided, the level of service that can be expected, the responsibilities of both the service provider and the customer, and any penalties or remedies for noncompliance. SLAs serve as a contractual agreement between the two parties, ensuring that both have a clear understanding of the expectations and obligations that are associated with the cloud service.

Google Cloud provides tools like Cloud Monitoring and service level indicators (SLIs) to help you define and track SLOs. Cloud Monitoring provides comprehensive monitoring and observability capabilities that enable your organization to collect and analyze metrics that are related to the availability, performance, and latency of cloud-based applications and services. SLIs are specific metrics that you can use to measure and track SLOs over time. By utilizing these tools, you can effectively monitor and manage cloud services, and ensure that they meet the SLOs and SLAs.

Clearly defining and communicating SLOs and SLAs for all of your critical cloud services helps to ensure reliability and performance of your deployed applications and services.

Implement comprehensive observability

To get real-time visibility into the health and performance of your cloud environment, we recommend that you use a combination of Google Cloud Observability tools and third-party solutions. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Implementing a combination of observability solutions provides you with a comprehensive observability strategy that covers various aspects of your cloud infrastructure and applications. Google Cloud Observability is a unified platform for collecting, analyzing, and visualizing metrics, logs, and traces from various Google Cloud services, applications, and external sources. By using Cloud Monitoring, you can gain insights into resource utilization, performance characteristics, and overall health of your resources.

To ensure comprehensive monitoring, monitor important metrics that align with system health indicators such as CPU utilization, memory usage, network traffic, disk I/O, and application response times. You must also consider business-specific metrics. By tracking these metrics, you can identify potential bottlenecks, performance issues, and resource constraints. Additionally, you can set up alerts to notify relevant teams proactively about potential issues or anomalies.

To enhance your monitoring capabilities further, you can integrate third-party solutions with Google Cloud Observability. These solutions can provide additional functionality, such as advanced analytics, machine learning-powered anomaly detection, and incident management capabilities. This combination of Google Cloud Observability tools and third-party solutions lets you create a robust and customizable monitoring ecosystem that's tailored to your specific needs. By using this combination approach, you can proactively identify and address issues, optimize resource utilization, and ensure the overall reliability and availability of your cloud applications and services.

Implement performance and load testing

Performing regular performance testing helps you to ensure that your cloud-based applications and infrastructure can handle peak loads and maintain optimal performance. Load testing simulates realistic traffic patterns. Stress testing pushes the system to its limits to identify potential bottlenecks and performance limitations. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Tools like Cloud Load Balancing and load testing services can help you to simulate real-world traffic patterns and stress-test your applications. These tools provide valuable insights into how your system behaves under various load conditions, and can help you to identify areas that require optimization.

Based on the results of performance testing, you can make decisions to optimize your cloud infrastructure and applications for optimal performance and scalability. This optimization might involve adjusting resource allocation, tuning configurations, or implementing caching mechanisms.

For example, if you find that your application is experiencing slowdowns during periods of high traffic, you might need to increase the number of virtual machines or containers that are allocated to the application. Alternatively, you might need to adjust the configuration of your web server or database to improve performance.

By regularly conducting performance testing and implementing the necessary optimizations, you can ensure that your cloud-based applications and infrastructure always run at peak performance, and deliver a seamless and responsive experience for your users. Doing so can help you to maintain a competitive advantage and build trust with your customers.

Plan and manage capacity

Proactively planning for future capacity needs—both organic or inorganic—helps you to ensure the smooth operation and scalability of your cloud-based systems. This recommendation is relevant to the processes focus area of operational readiness.

Planning for future capacity includes understanding and managing quotas for various resources like compute instances, storage, and API requests. By analyzing historical usage patterns, growth projections, and business requirements, you can accurately anticipate future capacity requirements. You can use tools like Cloud Monitoring and BigQuery to collect and analyze usage data, identify trends, and forecast future demand.

Historical usage patterns provide valuable insights into resource utilization over time. By examining metrics like CPU utilization, memory usage, and network traffic, you can identify periods of high demand and potential bottlenecks. Additionally, you can help to estimate future capacity needs by making growth projections based on factors like growth in the user base, new products and features, and marketing campaigns. When you assess capacity needs, you should also consider business requirements like SLAs and performance targets.

When you determine the resource sizing for a workload, consider factors that can affect utilization of resources. Seasonal variations like holiday shopping periods or end-of-quarter sales can lead to temporary spikes in demand. Planned events like product launches or marketing campaigns can also significantly increase traffic. To make sure that your primary and disaster recovery (DR) system can handle unexpected surges in demand, plan for capacity that can support graceful failover during disruptions like natural disasters and cyberattacks.

Autoscaling is an important strategy for dynamically adjusting your cloud resources based on workload fluctuations. By using autoscaling policies, you can automatically scale compute instances, storage, and other resources in response to changing demand. This ensures optimal performance during peak periods while minimizing costs when resource utilization is low. Autoscaling algorithms use metrics like CPU utilization, memory usage, and queue depth to determine when to scale resources.

Continuously monitor and optimize

To manage and optimize cloud workloads, you must establish a process for continuously monitoring and analyzing performance metrics. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

To establish a process for continuous monitoring and analysis, you track, collect, and evaluate data that's related to various aspects of your cloud environment. By using this data, you can proactively identify areas for improvement, optimize resource utilization, and ensure that your cloud infrastructure consistently meets or exceeds your performance expectations.

An important aspect of performance monitoring is regularly reviewing logs and traces. Logs provide valuable insights into system events, errors, and warnings. Traces provide detailed information about the flow of requests through your application. By analyzing logs and traces, you can identify potential issues, identify the root causes of problems, and get a better understanding of how your applications behave under different conditions. Metrics like the round-trip time between services can help you to identify and understand bottlenecks that are in your workloads.

Further, you can use performance-tuning techniques to significantly enhance application response times and overall efficiency. The following are examples of techniques that you can use:

Caching: Store frequently accessed data in memory to reduce the need for repeated database queries or API calls.
Database optimization: Use techniques like indexing and query optimization to improve the performance of database operations.
Code profiling: Identify areas of your code that consume excessive resources or cause performance issues.

By applying these techniques, you can optimize your applications and ensure that they run efficiently in the cloud.

Manage incidents and problems

This principle in the operational excellence pillar of the Google Cloud Well-Architected Framework provides recommendations to help you manage incidents and problems related to your cloud workloads. It involves implementing comprehensive monitoring and observability, establishing clear incident response procedures, conducting thorough root cause analysis, and implementing preventive measures. Many of the topics that are discussed in this principle are covered in detail in the Reliability pillar.

Principle overview

Incident management and problem management are important components of a functional operations environment. How you respond to, categorize, and solve incidents of differing severity can significantly affect your operations. You must also proactively and continuously make adjustments to optimize reliability and performance. An efficient process for incident and problem management relies on the following foundational elements:

Continuous monitoring: Identify and resolve issues quickly.
Automation: Streamline tasks and improve efficiency.
Orchestration: Coordinate and manage cloud resources effectively.
Data-driven insights: Optimize cloud operations and make informed decisions.

These elements help you to build a resilient cloud environment that can handle a wide range of challenges and disruptions. These elements can also help to reduce the risk of costly incidents and downtime, and they can help you to achieve greater business agility and success. These foundational elements are spread across the four focus areas of operational readiness: Workforce, Processes, Tooling, and Governance.

Recommendations

To manage incidents and problems effectively, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Establish clear incident response procedures

Clear roles and responsibilities are essential to ensure effective and coordinated response to incidents. Additionally, clear communication protocols and escalation paths help to ensure that information is shared promptly and effectively during an incident. This recommendation is relevant to these focus areas of operational readiness: workforce, processes, and tooling.

To establish incident response procedures, you need to define the roles and expectations of each team member, such as incident commanders, investigators, communicators, and technical experts. Establishing communication and escalation paths includes identifying important contacts, setting up communication channels, and defining the process for escalating incidents to higher levels of management when necessary. Regular training and preparation helps to ensure that teams are equipped with the knowledge and skills to respond to incidents effectively.

By documenting incident response procedures in a runbook or playbook, you can provide a standardized reference guide for teams to follow during an incident. The runbook must outline the steps to be taken at each stage of the incident response process, including communication, triage, investigation, and resolution. It must also include information about relevant tools and resources and contact information for important personnel. You must regularly review and update the runbook to ensure that it remains current and effective.

Centralize incident management

For effective tracking and management throughout the incident lifecycle, consider using a centralized incident management system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

A centralized incident management system provides the following advantages:

Improved visibility: By consolidating all incident-related data in a single location, you eliminate the need for teams to search in various channels or systems for context. This approach saves time and reduces confusion, and it gives stakeholders a comprehensive view of the incident, including its status, impact, and progress.
Better coordination and collaboration: A centralized system provides a unified platform for communication and task management. It promotes seamless collaboration between the different departments and functions that are involved in incident response. This approach ensures that everyone has access to up-to-date information and it reduces the risk of miscommunication and misalignment.
Enhanced accountability and ownership: A centralized incident management system enables your organization to allocate tasks to specific individuals or teams and it ensures that responsibilities are clearly defined and tracked. This approach promotes accountability and encourages proactive problem-solving because team members can easily monitor their progress and contributions.

A centralized incident management system must offer robust features for incident tracking, task assignment, and communication management. These features let you customize workflows, set priorities, and integrate with other systems, such as monitoring tools and ticketing systems.

By implementing a centralized incident management system, you can optimize your organization's incident response processes, improve collaboration, and enhance visibility. Doing so leads to faster incident resolution times, reduced downtime, and improved customer satisfaction. It also helps foster a culture of continuous improvement because you can learn from past incidents and identify areas for improvement.

Conduct thorough post-incident reviews

After an incident occurs, you must conduct a detailed post-incident review (PIR), which is also known as a postmortem, to identify the root cause, contributing factors, and lessons learned. This thorough review helps you to prevent similar incidents in the future. This recommendation is relevant to these focus areas of operational readiness: processes and governance.

The PIR process must involve a multidisciplinary team that has expertise in various aspects of the incident. The team must gather all of the relevant information through interviews, documentation review, and site inspections. A timeline of events must be created to establish the sequence of actions that led up to the incident.

After the team gathers the required information, they must conduct a root cause analysis to determine the factors that led to the incident. This analysis must identify both the immediate cause and the systemic issues that contributed to the incident.

Along with identifying the root cause, the PIR team must identify any other contributing factors that might have caused the incident. These factors could include human error, equipment failure, or organizational factors like communication breakdowns and lack of training.

The PIR report must document the findings of the investigation, including the timeline of events, root cause analysis, and recommended actions. The report is a valuable resource for implementing corrective actions and preventing recurrence. The report must be shared with all of the relevant stakeholders and it must be used to develop safety training and procedures.

To ensure a successful PIR process, your organization must foster a blameless culture that focuses on learning and improvement rather than assigning blame. This culture encourages individuals to report incidents without fear of retribution, and it lets you address systemic issues and make meaningful improvements.

By conducting thorough PIRs and implementing corrective measures based on the findings, you can significantly reduce the risk of similar incidents occurring in the future. This proactive approach to incident investigation and prevention helps to create a safer and more efficient work environment for everyone involved.

Maintain a knowledge base

A knowledge base of known issues, solutions, and troubleshooting guides is essential for incident management and resolution. Team members can use the knowledge base to quickly identify and address common problems. Implementing a knowledge base helps to reduce the need for escalation and it improves overall efficiency. This recommendation is relevant to these focus areas of operational readiness: workforce and processes.

A primary benefit of a knowledge base is that it lets teams learn from past experiences and avoid repeating mistakes. By capturing and sharing solutions to known issues, teams can build a collective understanding of how to resolve common problems and best practices for incident management. Use of a knowledge base saves time and effort, and helps to standardize processes and ensure consistency in incident resolution.

Along with helping to improve incident resolution times, a knowledge base promotes knowledge sharing and collaboration across teams. With a central repository of information, teams can easily access and contribute to the knowledge base, which promotes a culture of continuous learning and improvement. This culture encourages teams to share their expertise and experiences, leading to a more comprehensive and valuable knowledge base.

To create and manage a knowledge base effectively, use appropriate tools and technologies. Collaboration platforms like Google Workspace are well-suited for this purpose because they let you easily create, edit, and share documents collaboratively. These tools also support version control and change tracking, which ensures that the knowledge base remains up-to-date and accurate.

Make the knowledge base easily accessible to all relevant teams. You can achieve this by integrating the knowledge base with existing incident management systems or by providing a dedicated portal or intranet site. A knowledge base that's readily available lets teams quickly access the information that they need to resolve incidents efficiently. This availability helps to reduce downtime and minimize the impact on business operations.

Regularly review and update the knowledge base to ensure that it remains relevant and useful. Monitor incident reports, identify common issues and trends, and incorporate new solutions and troubleshooting guides into the knowledge base. An up-to-date knowledge base helps your teams resolve incidents faster and more effectively.

Automate incident response

Automation helps to streamline your incident response and remediation processes. It lets you address security breaches and system failures promptly and efficiently. By using Google Cloud products like Cloud Run functions or Cloud Run, you can automate various tasks that are typically manual and time-consuming. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Automated incident response provides the following benefits:

Reduction in incident detection and resolution times: Automated tools can continuously monitor systems and applications, detect suspicious or anomalous activities in real time, and notify stakeholders or respond without intervention. This automation lets you identify potential threats or issues before they escalate into major incidents. When an incident is detected, automated tools can trigger predefined remediation actions, such as isolating affected systems, quarantining malicious files, or rolling back changes to restore the system to a known good state.
Reduced burden on security and operations teams: Automated incident response lets the security and operations teams focus on more strategic tasks. By automating routine and repetitive tasks, such as collecting diagnostic information or triggering alerts, your organization can free up personnel to handle more complex and critical incidents. This automation can lead to improved overall incident response effectiveness and efficiency.
Enhanced consistency and accuracy of the remediation process: Automated tools can ensure that remediation actions are applied uniformly across all affected systems, minimizing the risk of human error or inconsistency. This standardization of the remediation process helps to minimize the impact of incidents on users and the business.

Manage and optimize cloud resources

This principle in the operational excellence pillar of the Google Cloud Well-Architected Framework provides recommendations to help you manage and optimize the resources that are used by your cloud workloads. It involves right-sizing resources based on actual usage and demand, using autoscaling for dynamic resource allocation, implementing cost optimization strategies, and regularly reviewing resource utilization and costs. Many of the topics that are discussed in this principle are covered in detail in the Cost optimization pillar.

Principle overview

Cloud resource management and optimization play a vital role in optimizing cloud spending, resource usage, and infrastructure efficiency. It includes various strategies and best practices aimed at maximizing the value and return from your cloud spending.

This pillar's focus on optimization extends beyond cost reduction. It emphasizes the following goals:

Efficiency: Using automation and data analytics to achieve peak performance and cost savings.
Performance: Scaling resources effortlessly to meet fluctuating demands and deliver optimal results.
Scalability: Adapting infrastructure and processes to accommodate rapid growth and diverse workloads.

By focusing on these goals, you achieve a balance between cost and functionality. You can make informed decisions regarding resource provisioning, scaling, and migration. Additionally, you gain valuable insights into resource consumption patterns, which lets you proactively identify and address potential issues before they escalate.

Recommendations

To manage and optimize resources, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Right-size resources

Continuously monitoring resource utilization and adjusting resource allocation to match actual demand are essential for efficient cloud resource management. Over-provisioning resources can lead to unnecessary costs, and under-provisioning can cause performance bottlenecks that affect application performance and user experience. To achieve an optimal balance, you must adopt a proactive approach to right-sizing cloud resources. This recommendation is relevant to the governance focus area of operational readiness.

Cloud Monitoring and Recommender can help you to identify opportunities for right-sizing. Cloud Monitoring provides real-time visibility into resource utilization metrics. This visibility lets you track resource usage patterns and identify potential inefficiencies. Recommender analyzes resource utilization data to make intelligent recommendations for optimizing resource allocation. By using these tools, you can gain insights into resource usage and make informed decisions about right-sizing the resources.

In addition to Cloud Monitoring and Recommender, consider using custom metrics to trigger automated right-sizing actions. Custom metrics let you track specific resource utilization metrics that are relevant to your applications and workloads. You can also configure alerts to notify administrators when predefined thresholds are met. The administrators can then take necessary actions to adjust resource allocation. This proactive approach ensures that resources are scaled in a timely manner, which helps to optimize cloud costs and prevent performance issues.

Use autoscaling

Autoscaling compute and other resources helps to ensure optimal performance and cost efficiency of your cloud-based applications. Autoscaling lets you dynamically adjust the capacity of your resources based on workload fluctuations, so that you have the resources that you need when you need them and you can avoid over-provisioning and unnecessary costs. This recommendation is relevant to the processes focus area of operational readiness.

To meet the diverse needs of different applications and workloads, Google Cloud offers various autoscaling options, including the following:

Compute Engine managed instance groups (MIGs) are groups of VMs that are managed and scaled as a single entity. With MIGs, you can define autoscaling policies that specify the minimum and maximum number of VMs to maintain in the group, and the conditions that trigger autoscaling. For example, you can configure a policy to add VMs in a MIG when the CPU utilization reaches a certain threshold and to remove VMs when the utilization drops below a different threshold.
Google Kubernetes Engine (GKE) autoscaling dynamically adjusts your cluster resources to match your application's needs. It offers the following tools:
- Cluster Autoscaler adds or removes nodes based on Pod resource demands.
- Horizontal Pod Autoscaler changes the number of Pod replicas based on CPU, memory, or custom metrics.
- Vertical Pod Autoscaler fine-tunes Pod resource requests and limits based on usage patterns.
- Node Auto-Provisioning automatically creates optimized node pools for your workloads.
These tools work together to optimize resource utilization, ensure application performance, and simplify cluster management.
Cloud Run is a serverless platform that lets you run code without having to manage infrastructure. Cloud Run offers built-in autoscaling, which automatically adjusts the number of instances based on the incoming traffic. When the volume of traffic increases, Cloud Run scales up the number of instances to handle the load. When traffic decreases, Cloud Run scales down the number of instances to reduce costs.

By using these autoscaling options, you can ensure that your cloud-based applications have the resources that they need to handle varying workloads, while avoiding overprovisioning and unnecessary costs. Using autoscaling can lead to improved performance, cost savings, and more efficient use of cloud resources.

Leverage cost optimization strategies

Optimizing cloud spending helps you to effectively manage your organization's IT budgets. This recommendation is relevant to the governance focus area of operational readiness.

Google Cloud offers several tools and techniques to help you optimize cloud costs. By using these tools and techniques, you can get the best value from your cloud spending. These tools and techniques help you to identify areas where costs can be reduced, such as identifying underutilized resources or recommending more cost-effective instance types. Google Cloud options to help optimize cloud costs include the following:

Committed use discounts (CUDs) are discounts for committing to a certain level of usage over a period of time.
Sustained use discounts in Compute Engine provide discounts for consistent usage of a service.
Spot VMs provide access to unused VM capacity at a lower cost compared to regular VMs.

Pricing models might change over time, and new features might be introduced that offer better performance or lower cost compared to existing options. Therefore, you should regularly review pricing models and consider alternative features. By staying informed about the latest pricing models and features, you can make informed decisions about your cloud architecture to minimize costs.

Google Cloud's Cost Management tools, such as budgets and alerts, provide valuable insights into cloud spending. Budgets and alerts let users set budgets and receive alerts when the budgets are exceeded. These tools help users track their cloud spending and identify areas where costs can be reduced.

Track resource usage and costs

You can use tagging and labeling to track resource usage and costs. By assigning tags and labels to your cloud resources like projects, departments, or other relevant dimensions, you can categorize and organize the resources. This lets you monitor and analyze spending patterns for specific resources and identify areas of high usage or potential cost savings. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.

Tools like Cloud Billing and Cost Management help you to get a comprehensive understanding of your spending patterns. These tools provide detailed insights into your cloud usage and they let you identify trends, forecast costs, and make informed decisions. By analyzing historical data and current spending patterns, you can identify the focus areas for your cost-optimization efforts.

Custom dashboards and reports help you to visualize cost data and gain deeper insights into spending trends. By customizing dashboards with relevant metrics and dimensions, you can monitor key performance indicators (KPIs) and track progress towards your cost optimization goals. Reports offer deeper analyses of cost data. Reports let you filter the data by specific time periods or resource types to understand the underlying factors that contribute to your cloud spending.

Regularly review and update your tags, labels, and cost analysis tools to ensure that you have the most up-to-date information on your cloud usage and costs. By staying informed and conducting cost postmortems or proactive cost reviews, you can promptly identify any unexpected increases in spending. Doing so lets you make proactive decisions to optimize cloud resources and control costs.

Establish cost allocation and budgeting

Accountability and transparency in cloud cost management are crucial for optimizing resource utilization and ensuring financial control. This recommendation is relevant to the governance focus area of operational readiness.

To ensure accountability and transparency, you need to have clear mechanisms for cost allocation and chargeback. By allocating costs to specific teams, projects, or individuals, your organization can ensure that each of these entities is responsible for its cloud usage. This practice fosters a sense of ownership and encourages responsible resource management. Additionally, chargeback mechanisms enable your organization to recover cloud costs from internal customers, align incentives with performance, and promote fiscal discipline.

Establishing budgets for different teams or projects is another essential aspect of cloud cost management. Budgets enable your organization to define spending limits and track actual expenses against those limits. This approach lets you make proactive decisions to prevent uncontrolled spending. By setting realistic and achievable budgets, you can ensure that cloud resources are used efficiently and aligned with business objectives. Regular monitoring of actual spending against budgets helps you to identify variances and address potential overruns promptly.

To monitor budgets, you can use tools like Cloud Billing budgets and alerts. These tools provide real-time insights into cloud spending and they notify stakeholders of potential overruns. By using these capabilities, you can track cloud costs and take corrective actions before significant deviations occur. This proactive approach helps to prevent financial surprises and ensures that cloud resources are used responsibly.

Automate and manage change

This principle in the operational excellence pillar of the Google Cloud Well-Architected Framework provides recommendations to help you automate and manage change for your cloud workloads. It involves implementing infrastructure as code (IaC), establishing standard operating procedures, implementing a structured change management process, and using automation and orchestration.

Principle overview

Change management and automation play a crucial role in ensuring smooth and controlled transitions within cloud environments. For effective change management, you need to use strategies and best practices that minimize disruptions and ensure that changes are integrated seamlessly with existing systems.

Effective change management and automation include the following foundational elements:

Change governance: Establish clear policies and procedures for change management, including approval processes and communication plans.
Risk assessment: Identify potential risks associated with changes and mitigate them through risk management techniques.
Testing and validation: Thoroughly test changes to ensure that they meet functional and performance requirements and mitigate potential regressions.
Controlled deployment: Implement changes in a controlled manner, ensuring that users are seamlessly transitioned to the new environment, with mechanisms to seamlessly roll back if needed.

These foundational elements help to minimize the impact of changes and ensure that changes have a positive effect on business operations. These elements are represented by the processes, tooling, and governance focus areas of operational readiness.

Recommendations

To automate and manage change, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Adopt IaC

Infrastructure as code (IaC) is a transformative approach for managing cloud infrastructure. You can define and manage cloud infrastructure declaratively by using tools like Terraform. IaC helps you achieve consistency, repeatability, and simplified change management. It also enables faster and more reliable deployments. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

The following are the main benefits of adopting the IaC approach for your cloud deployments:

Human-readable resource configurations: With the IaC approach, you can declare your cloud infrastructure resources in a human-readable format, like JSON or YAML. Infrastructure administrators and operators can easily understand and modify the infrastructure and collaborate with others.
Consistency and repeatability: IaC enables consistency and repeatability in your infrastructure deployments. You can ensure that your infrastructure is provisioned and configured the same way every time, regardless of who is performing the deployment. This approach helps to reduce errors and ensures that your infrastructure is always in a known state.
Accountability and simplified troubleshooting: The IaC approach helps to improve accountability and makes it easier to troubleshoot issues. By storing your IaC code in a version control system, you can track changes, and identify when changes were made and by whom. If necessary, you can easily roll back to previous versions.

Implement version control

A version control system like Git is a key component of the IaC process. It provides robust change management and risk mitigation capabilities, which is why it's widely adopted, either through in-house development or SaaS solutions. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.

By tracking changes to IaC code and configurations, version control provides visibility into the evolution of the code, making it easier to understand the impact of changes and identify potential issues. This enhanced visibility fosters collaboration among team members who work on the same IaC project.

Most version control systems let you easily roll back changes if needed. This capability helps to mitigate the risk of unintended consequences or errors. By using tools like Git in your IaC workflow, you can significantly improve change management processes, foster collaboration, and mitigate risks, which leads to a more efficient and reliable IaC implementation.

Build CI/CD pipelines

Continuous integration and continuous delivery (CI/CD) pipelines streamline the process of developing and deploying cloud applications. CI/CD pipelines automate the building, testing, and deployment stages, which enables faster and more frequent releases with improved quality control. This recommendation is relevant to the tooling focus area of operational readiness.

CI/CD pipelines ensure that code changes are continuously integrated into a central repository, typically a version control system like Git. Continuous integration facilitates early detection and resolution of issues, and it reduces the likelihood of bugs or compatibility problems.

To create and manage CI/CD pipelines for cloud applications, you can use tools like Cloud Build and Cloud Deploy.

Cloud Build is a fully managed build service that lets developers define and execute build steps in a declarative manner. It integrates seamlessly with popular source-code management platforms and it can be triggered by events like code pushes and pull requests.
Cloud Deploy is a serverless deployment service that automates the process of deploying applications to various environments, such as testing, staging, and production. It provides features like blue-green deployments, traffic splitting, and rollback capabilities, making it easier to manage and monitor application deployments.

Integrating CI/CD pipelines with version control systems and testing frameworks helps to ensure the quality and reliability of your cloud applications. By running automated tests as part of the CI/CD process, development teams can quickly identify and fix any issues before the code is deployed to the production environment. This integration helps to improve the overall stability and performance of your cloud applications.

Use configuration management tools

Tools like Puppet, Chef, Ansible, and VM Manager help you to automate the configuration and management of cloud resources. Using these tools, you can ensure resource consistency and compliance across your cloud environments. This recommendation is relevant to the tooling focus area of operational readiness.

Automating the configuration and management of cloud resources provides the following benefits:

Significant reduction in the risk of manual errors: When manual processes are involved, there is a higher likelihood of mistakes due to human error. Configuration management tools reduce this risk by automating processes, so that configurations are applied consistently and accurately across all cloud resources. This automation can lead to improved reliability and stability of the cloud environment.
Improvement in operational efficiency: By automating repetitive tasks, your organization can free up IT staff to focus on more strategic initiatives. This automation can lead to increased productivity and cost savings and improved responsiveness to changing business needs.
Simplified management of complex cloud infrastructure: As cloud environments grow in size and complexity, managing the resources can become increasingly difficult. Configuration management tools provide a centralized platform for managing cloud resources. The tools make it easier to track configurations, identify issues, and implement changes. Using these tools can lead to improved visibility, control, and security of your cloud environment.

Automate testing

Integrating automated testing into your CI/CD pipelines helps to ensure the quality and reliability of your cloud applications. By validating changes before deployment, you can significantly reduce the risk of errors and regressions, which leads to a more stable and robust software system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

The following are the main benefits of incorporating automated testing into your CI/CD pipelines:

Early detection of bugs and defects: Automated testing helps to detect bugs and defects early in the development process, before they can cause major problems in production. This capability saves time and resources by preventing the need for costly rework and bug fixes at later stages in the development process.
High quality and standards-based code: Automated testing can help improve the overall quality of your code by ensuring that the code meets certain standards and best practices. This capability leads to more maintainable and reliable applications that are less prone to errors.

You can use various types of testing techniques in CI/CD pipelines. Each test type serves a specific purpose.

Unit testing focuses on testing individual units of code, such as functions or methods, to ensure that they work as expected.
Integration testing tests the interactions between different components or modules of your application to verify that they work properly together.
End-to-end testing is often used along with unit and integration testing. End-to-end testing simulates real-world scenarios to test the application as a whole, and helps to ensure that the application meets the requirements of your end users.

To effectively integrate automated testing into your CI/CD pipelines, you must choose appropriate testing tools and frameworks. There are many different options, each with its own strengths and weaknesses. You must also establish a clear testing strategy that outlines the types of tests to be performed, the frequency of testing, and the criteria for passing or failing a test. By following these recommendations, you can ensure that your automated testing process is efficient and effective. Such a process provides valuable insights into the quality and reliability of your cloud applications.

Continuously improve and innovate

This principle in the operational excellence pillar of the Google Cloud Well-Architected Framework provides recommendations to help you continuously optimize cloud operations and drive innovation.

Principle overview

To continuously improve and innovate in the cloud, you need to focus on continuous learning, experimentation, and adaptation. This helps you to explore new technologies and optimize existing processes and it promotes a culture of excellence that enables your organization to achieve and maintain industry leadership.

Through continuous improvement and innovation, you can achieve the following goals:

Accelerate innovation: Explore new technologies and services to enhance capabilities and drive differentiation.
Reduce costs: Identify and eliminate inefficiencies through process-improvement initiatives.
Enhance agility: Adapt rapidly to changing market demands and customer needs.
Improve decision making: Gain valuable insights from data and analytics to make data-driven decisions.

Organizations that embrace the continuous improvement and innovation principle can unlock the full potential of the cloud environment and achieve sustainable growth. This principle maps primarily to the Workforce focus area of operational readiness. A culture of innovation lets teams experiment with new tools and technologies to expand capabilities and reduce costs.

Recommendations

To continuously improve and innovate your cloud workloads, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Foster a culture of learning

Encourage teams to experiment, share knowledge, and learn continuously. Adopt a blameless culture where failures are viewed as opportunities for growth and improvement. This recommendation is relevant to the workforce focus area of operational readiness.

When you foster a culture of learning, teams can learn from mistakes and iterate quickly. This approach encourages team members to take risks, experiment with new ideas, and expand the boundaries of their work. It also creates a psychologically safe environment where individuals feel comfortable sharing failures and learning from them. Sharing in this way leads to a more open and collaborative environment.

To facilitate knowledge sharing and continuous learning, create opportunities for teams to share knowledge and learn from each other. You can do this through informal and formal learning sessions and conferences.

By fostering a culture of experimentation, knowledge sharing, and continuous learning, you can create an environment where teams are empowered to take risks, innovate, and grow. This environment can lead to increased productivity, improved problem-solving, and a more engaged and motivated workforce. Further, by promoting a blameless culture, you can create a safe space for employees to learn from mistakes and contribute to the collective knowledge of the team. This culture ultimately leads to a more resilient and adaptable workforce that is better equipped to handle challenges and drive success in the long run.

Conduct regular retrospectives

Retrospectives give teams an opportunity to reflect on their experiences, identify what went well, and identify what can be improved. By conducting retrospectives after projects or major incidents, teams can learn from successes and failures, and continuously improve their processes and practices. This recommendation is relevant to these focus areas of operational readiness: processes and governance.

An effective way to structure a retrospective is to use the Start-Stop-Continue model:

Start: In the Start phase of the retrospective, team members identify new practices, processes, and behaviors that they believe can enhance their work. They discuss why the changes are needed and how they can be implemented.
Stop: In the Stop phase, team members identify and eliminate practices, processes, and behaviors that are no longer effective or that hinder progress. They discuss why these changes are necessary and how they can be implemented.
Continue: In the Continue phase, team members identify practices, processes, and behaviors that work well and must be continued. They discuss why these elements are important and how they can be reinforced.

By using a structured format like the Start-Stop-Continue model, teams can ensure that retrospectives are productive and focused. This model helps to facilitate discussion, identify the main takeaways, and identify actionable steps for future enhancements.

Stay up-to-date with cloud technologies

To maximize the potential of Google Cloud services, you must keep up with the latest advancements, features, and best practices. This recommendation is relevant to the workforce focus area of operational readiness.

Participating in relevant conferences, webinars, and training sessions is a valuable way to expand your knowledge. These events provide opportunities to learn from Google Cloud experts, understand new capabilities, and engage with industry peers who might face similar challenges. By attending these sessions, you can gain insights into how to use new features effectively, optimize your cloud operations, and drive innovation within your organization.

To ensure that your team members keep up with cloud technologies, encourage them to obtain certifications and attend training courses. Google Cloud offers a wide range of certifications that validate skills and knowledge in specific cloud domains. Earning these certifications demonstrates commitment to excellence and provides tangible evidence of proficiency in cloud technologies. The training courses that are offered by Google Cloud and our partners delve deeper into specific topics. They provide direct experience and practical skills that can be immediately applied to real-world projects. By investing in the professional development of your team, you can foster a culture of continuous learning and ensure that everyone has the necessary skills to succeed in the cloud.

Actively seek and incorporate feedback

Collect feedback from users, stakeholders, and team members. Use the feedback to identify opportunities to improve your cloud solutions. This recommendation is relevant to the workforce focus area of operational readiness.

The feedback that you collect can help you to understand the evolving needs, issues, and expectations of the users of your solutions. This feedback serves as a valuable input to drive improvements and prioritize future enhancements. You can use various mechanisms to collect feedback:

Surveys are an effective way to gather quantitative data from a large number of users and stakeholders.
User interviews provide an opportunity for in-depth qualitative data collection. Interviews let you understand the specific challenges and experiences of individual users.
Feedback forms that are placed within the cloud solutions offer a convenient way for users to provide immediate feedback on their experience.
Regular meetings with team members can facilitate the collection of feedback on technical aspects and implementation challenges.

The feedback that you collect through these mechanisms must be analyzed and synthesized to identify common themes and patterns. This analysis can help you prioritize future enhancements based on the impact and feasibility of the suggested improvements. By addressing the needs and issues that are identified through feedback, you can ensure that your cloud solutions continue to meet the evolving requirements of your users and stakeholders.

Measure and track progress

Key performance indicators (KPIs) and metrics are crucial for tracking progress and measuring the effectiveness of your cloud operations. KPIs are quantifiable measurements that reflect the overall performance. Metrics are specific data points that contribute to the calculation of KPIs. Review the metrics regularly and use them to identify opportunities for improvement and measure progress. Doing so helps you to continuously improve and optimize your cloud environment. This recommendation is relevant to these focus areas of operational readiness: governance and processes.

A primary benefit of using KPIs and metrics is that they enable your organization to adopt a data-driven approach to cloud operations. By tracking and analyzing operational data, you can make informed decisions about how to improve the cloud environment. This data-driven approach helps you to identify trends, patterns, and anomalies that might not be visible without the use of systematic metrics.

To collect and analyze operational data, you can use tools like Cloud Monitoring and BigQuery. Cloud Monitoring enables real-time monitoring of cloud resources and services. BigQuery lets you store and analyze the data that you gather through monitoring. Using these tools together, you can create custom dashboards to visualize important metrics and trends.

Operational dashboards can provide a centralized view of the most important metrics, which lets you quickly identify any areas that need attention. For example, a dashboard might include metrics like CPU utilization, memory usage, network traffic, and latency for a particular application or service. By monitoring these metrics, you can quickly identify any potential issues and take steps to resolve them.

Well-Architected Framework: Operational excellence pillar Stay organized with collections Save and categorize content based on your preferences.

Core principles

Contributors

Ensure operational readiness and performance using CloudOps

Principle overview

Focus areas of operational readiness

Recommendations

Define SLOs and SLAs

Implement comprehensive observability

Implement performance and load testing

Plan and manage capacity

Continuously monitor and optimize

Manage incidents and problems

Principle overview

Recommendations

Establish clear incident response procedures

Centralize incident management

Conduct thorough post-incident reviews

Maintain a knowledge base

Automate incident response

Manage and optimize cloud resources

Principle overview

Recommendations

Right-size resources

Use autoscaling

Leverage cost optimization strategies

Track resource usage and costs

Establish cost allocation and budgeting

Automate and manage change

Principle overview

Recommendations

Adopt IaC

Implement version control

Build CI/CD pipelines

Use configuration management tools

Automate testing

Continuously improve and innovate

Principle overview

Recommendations

Foster a culture of learning

Conduct regular retrospectives

Stay up-to-date with cloud technologies

Actively seek and incorporate feedback

Measure and track progress

Well-Architected Framework: Operational excellence pillar