Stay organized with collections
Save and categorize content based on your preferences.
Last reviewed 2024-11-20 UTC
Reliable infrastructure is a critical requirement for workloads in the cloud.
As a cloud architect, to design reliable infrastructure for your workloads, you
need a good understanding of the reliability capabilities of your cloud provider
of choice. This document describes the building blocks of reliability in
Google Cloud (zones, regions, and location-scoped resources) and the
availability levels that they provide. This document also provides guidelines
for assessing the reliability requirements of your workloads, and presents
architectural recommendations for building and managing reliable infrastructure
in Google Cloud.
This document is divided into the following parts:
If you've read this guide previously and want to see what's changed, see the
Release notes.
Overview of reliability
An application or workload is reliable when it meets your current objectives
for availability and resilience to failures.
Availability (or uptime) is the percentage of time that an application is
usable. For example, for an application that has an availability target of
99.99%, the total downtime must not exceed 8.64 seconds during a 24-hour period.
Sometimes, availability is measured as the proportion of requests that the
application serves successfully during a given period. For example, for an
application that has an availability target of 99.99%, for every 100,000
requests received, not more than ten requests can fail. Availability is often
expressed as the number of nines in the percentage. For example, 99.99%
availability is expressed as "4 nines".
Depending on the purpose of the application, you might have different sets of
indicators for how reliable the application is. The following are examples of
such reliability indicators:
For applications that serve content, availability, latency, and
throughput are important reliability indicators. They indicate whether the
application can respond to requests, how long the application takes to
respond to requests, and how many requests the application can process
successfully in a given period.
For databases and storage systems, latency, throughput, availability,
and durability (how well data is protected against loss or corruption), are
indicators of reliability. They indicate how long the system takes to read
or write data, and whether data can be accessed on demand.
For big data and analytics workloads such as data processing pipelines,
consistent pipeline performance (throughput and latency) is essential to
ensure freshness of the data products, and is an important reliability
indicator. It indicates how much data can be processed, and how long it
takes for the pipeline to progress from data ingestion to data processing.
Most applications have data correctness as an essential reliability
indicator.
The reliability of an application that's deployed in Google Cloud depends
on the following factors:
The internal design of the application.
The secondary applications or components that the application depends on.
Google Cloud infrastructure resources such as compute, networking,
storage, databases, and security that the application runs on, and how the
application uses the infrastructure.
Infrastructure capacity that you provision, and how the capacity scales.
The DevOps processes and tools that you use to build, deploy, and
maintain the application, its dependencies, and the Google Cloud
infrastructure.
These factors are summarized in the following diagram:
As shown in the preceding diagram, the reliability of an application that's
deployed in Google Cloud depends on multiple factors. The focus of this
guide is the reliability of the Google Cloud infrastructure.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-11-20 UTC."],[[["\u003cp\u003eReliable cloud infrastructure is essential for application workloads, requiring a deep understanding of the cloud provider's reliability features.\u003c/p\u003e\n"],["\u003cp\u003eApplication reliability is defined by meeting availability and failure resilience objectives, with availability often measured in terms of uptime or successful request rates.\u003c/p\u003e\n"],["\u003cp\u003eKey reliability indicators vary by application type, including availability, latency, throughput, durability, and data correctness.\u003c/p\u003e\n"],["\u003cp\u003eThe reliability of applications in Google Cloud is influenced by the application's design, dependencies, infrastructure resources, capacity management, and DevOps processes.\u003c/p\u003e\n"],["\u003cp\u003eThis document provides a comprehensive guide for cloud architects on how to assess reliability requirements, and design, build and manage infrastructure in Google Cloud.\u003c/p\u003e\n"]]],[],null,["# Google Cloud infrastructure reliability guide\n\nReliable infrastructure is a critical requirement for workloads in the cloud.\nAs a cloud architect, to design reliable infrastructure for your workloads, you\nneed a good understanding of the reliability capabilities of your cloud provider\nof choice. This document describes the building blocks of reliability in\nGoogle Cloud (zones, regions, and location-scoped resources) and the\navailability levels that they provide. This document also provides guidelines\nfor assessing the reliability requirements of your workloads, and presents\narchitectural recommendations for building and managing reliable infrastructure\nin Google Cloud.\n\nThis document is divided into the following parts:\n\n- Overview of reliability (this part)\n- [Building blocks of reliability in Google Cloud](/architecture/infra-reliability-guide/building-blocks)\n- [Assess the reliability requirements for your cloud workloads](/architecture/infra-reliability-guide/requirements)\n- [Design reliable infrastructure for your workloads in Google Cloud](/architecture/infra-reliability-guide/design)\n- [Manage traffic and load for your workloads in Google Cloud](/architecture/infra-reliability-guide/traffic-load)\n- [Manage and monitor your Google Cloud infrastructure](/architecture/infra-reliability-guide/manage-and-monitor)\n\nIf you've read this guide previously and want to see what's changed, see the\n[Release notes](/architecture/release-notes).\n\nOverview of reliability\n-----------------------\n\nAn application or workload is reliable when it meets your current objectives\nfor availability and resilience to failures.\n\nAvailability (or uptime) is the percentage of time that an application is\nusable. For example, for an application that has an availability target of\n99.99%, the total downtime must not exceed 8.64 seconds during a 24-hour period.\nSometimes, availability is measured as the proportion of requests that the\napplication serves successfully during a given period. For example, for an\napplication that has an availability target of 99.99%, for every 100,000\nrequests received, not more than ten requests can fail. Availability is often\nexpressed as the number of nines in the percentage. For example, 99.99%\navailability is expressed as \"4 nines\".\n\nDepending on the purpose of the application, you might have different sets of\nindicators for how reliable the application is. The following are examples of\nsuch reliability indicators:\n\n- For applications that serve content, availability, latency, and throughput are important reliability indicators. They indicate whether the application can respond to requests, how long the application takes to respond to requests, and how many requests the application can process successfully in a given period.\n- For databases and storage systems, latency, throughput, availability, and durability (how well data is protected against loss or corruption), are indicators of reliability. They indicate how long the system takes to read or write data, and whether data can be accessed on demand.\n- For big data and analytics workloads such as data processing pipelines, consistent pipeline performance (throughput and latency) is essential to ensure freshness of the data products, and is an important reliability indicator. It indicates how much data can be processed, and how long it takes for the pipeline to progress from data ingestion to data processing.\n- Most applications have data correctness as an essential reliability indicator.\n\nFor further guidelines to define the reliability objectives for your\napplications, see\n[Assess the reliability requirements for your cloud workloads](/architecture/infra-reliability-guide/requirements).\n| **Note:** Planning for disaster recovery (DR) is related to reliability, and DR is essential for business continuity. For detailed guidance about DR planning, see the [Disaster recovery planning guide](/architecture/dr-scenarios-planning-guide).\n\nFactors that affect application reliability\n-------------------------------------------\n\nThe reliability of an application that's deployed in Google Cloud depends\non the following factors:\n\n- The internal design of the application.\n- The secondary applications or components that the application depends on.\n- Google Cloud infrastructure resources such as compute, networking, storage, databases, and security that the application runs on, and how the application uses the infrastructure.\n- Infrastructure capacity that you provision, and how the capacity scales.\n- The DevOps processes and tools that you use to build, deploy, and maintain the application, its dependencies, and the Google Cloud infrastructure.\n\nThese factors are summarized in the following diagram:\n\nAs shown in the preceding diagram, the reliability of an application that's\ndeployed in Google Cloud depends on multiple factors. The focus of this\nguide is the reliability of the Google Cloud infrastructure.\n\nWhat's next\n-----------\n\n- [Building blocks of reliability in Google Cloud](/architecture/infra-reliability-guide/building-blocks)\n- [Assess the reliability requirements for your cloud workloads](/architecture/infra-reliability-guide/requirements)\n- [Design reliable infrastructure for your workloads in Google Cloud](/architecture/infra-reliability-guide/design)\n- [Manage traffic and load for your workloads in Google Cloud](/architecture/infra-reliability-guide/traffic-load)\n- [Manage and monitor your Google Cloud infrastructure](/architecture/infra-reliability-guide/manage-and-monitor)\n\nContributors\n------------\n\nAuthors:\n\n- [Nir Tarcic](https://www.linkedin.com/in/nirtarcic) \\| Cloud Lifecycle SRE UTL\n- [Kumar Dhanagopal](https://www.linkedin.com/in/kumardhanagopal) \\| Cross-Product Solution Developer\n\n\u003cbr /\u003e\n\nOther contributors:\n\n- [Alok Kumar](https://www.linkedin.com/in/alok-kumar-0a51159) \\| Distinguished Engineer\n- [Andrew Fikes](https://www.linkedin.com/in/andrew-fikes) \\| Engineering Fellow, Reliability\n- [Chris Heiser](https://www.linkedin.com/in/christopher-heiser) \\| SRE TL\n- [David Ferguson](https://www.linkedin.com/in/davidsferguson) \\| Director, Site Reliability Engineering\n- [Joe Tan](https://www.linkedin.com/in/joe-tan-378a55a8) \\| Senior Product Counsel\n- [Krzysztof Duleba](https://www.linkedin.com/in/kduleba) \\| Principal Engineer\n- [Narayan Desai](https://www.linkedin.com/in/nldesai) \\| Principal SRE\n- [Sailesh Krishnamurthy](https://www.linkedin.com/in/saileshkrishnamurthy) \\| VP, Engineering\n- [Steve McGhee](https://www.linkedin.com/in/stevemcghee) \\| Reliability Advocate\n- [Sudhanshu Jain](https://www.linkedin.com/in/sudhanshujain) \\| Product Manager\n- [Yaniv Aknin](https://www.linkedin.com/in/yanivaknin) \\| Software Engineer\n\n\u003cbr /\u003e"]]