[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2024-12-30。"],[[["\u003cp\u003eThis content outlines how to design and run tests to ensure system recovery from failures, including regional failovers, release rollbacks, and data restoration.\u003c/p\u003e\n"],["\u003cp\u003eTesting should involve defining clear objectives, such as validating RTO and RPO, assessing system resilience, and testing automated failover mechanisms.\u003c/p\u003e\n"],["\u003cp\u003eIt is recommended to prepare the test environment, simulate various failure scenarios, including those involving human errors and misconfigurations, and monitor system behavior using tools like Cloud Monitoring and Cloud Logging.\u003c/p\u003e\n"],["\u003cp\u003eVerify recovery by measuring the time it takes to resume normal operations and ensure data integrity aligns with the RPO, evaluating service restoration and confirming that all services are restored to a functional state with minimal user disruption.\u003c/p\u003e\n"],["\u003cp\u003eDocument each step of the testing process, analyze the results, suggest improvements, and conduct periodic tests to validate ongoing reliability and resilience.\u003c/p\u003e\n"]]],[],null,["# Perform testing for recovery from failures\n\nThis principle in the reliability pillar of the\n[Google Cloud Well-Architected Framework](/architecture/framework)\nprovides recommendations to help you design and run tests for recovery in the\nevent of failures.\n\nThis principle is relevant to the *learning*\n[focus area](/architecture/framework/reliability#focus-areas)\nof reliability.\n\nPrinciple overview\n------------------\n\nTo be sure that your system can recover from failures, you must periodically\nrun tests that include regional failovers, release rollbacks, and data\nrestoration from backups.\n\nThis testing helps you to practice responses to events that pose major risks to\nreliability, such as the outage of an entire\n[region](/docs/geography-and-regions#regions_and_zones).\nThis testing also helps you verify that your system behaves as intended during a\ndisruption.\n\nIn the unlikely event of an entire region going down, you need to fail over all\ntraffic to another region. During normal operation of your workload, when data\nis modified, it needs to be synchronized from the primary region to the failover\nregion. You need to verify that the replicated data is always very recent, so\nthat users don't experience data loss or session breakage. The load balancing\nsystem must also be able to shift traffic to the failover region at any time\nwithout service interruptions. To minimize downtime after a regional outage,\noperations engineers also need to be able to manually and efficiently shift user\ntraffic away from a region, in as less time as possible. This operation is\nsometimes called *draining a region*, which means you stop the inbound traffic\nto the region and move all the traffic elsewhere.\n\nRecommendations\n---------------\n\nWhen you design and run tests for failure recovery, consider the\nrecommendations in the following subsections.\n\n### Define the testing objectives and scope\n\nClearly define what you want to achieve from the testing. For example, your\nobjectives can include the following:\n\n- Validate the recovery time objective (RTO) and the recovery point objective (RPO). For details, see [Basics of DR planning](/architecture/dr-scenarios-planning-guide#basics_of_dr_planning).\n- Assess system resilience and fault tolerance under various failure scenarios.\n- Test the effectiveness of automated failover mechanisms.\n\nDecide which components, services, or regions are in the testing scope. The\nscope can include specific application tiers like the frontend, backend, and\ndatabase, or it can include specific Google Cloud resources like\nCloud SQL instances or GKE clusters. The scope must also specify any\nexternal dependencies, such as third-party APIs or cloud interconnections.\n\n### Prepare the environment for testing\n\nChoose an appropriate environment, preferably a staging or sandbox environment\nthat replicates your production setup. If you conduct the test in production,\nensure that you have safety measures ready, like automated monitoring and manual\nrollback procedures.\n\nCreate a backup plan. Take snapshots or backups of critical databases and\nservices to prevent data loss during the test. Ensure that your team is prepared\nto do manual interventions if the automated failover mechanisms fail.\n\nTo prevent test disruptions, ensure that your IAM roles,\npolicies, and failover configurations are correctly set up. Verify that the\nnecessary permissions are in place for the test tools and scripts.\n\nInform stakeholders, including operations, DevOps, and application owners,\nabout the test schedule, scope, and potential impact. Provide stakeholders with\nan estimated timeline and the expected behaviors during the test.\n\n### Simulate failure scenarios\n\nPlan and execute failures by using tools like\n[Chaos Monkey](https://netflix.github.io/chaosmonkey/).\nYou can use custom scripts to simulate failures of critical services such as a\nshutdown of a primary node in a multi-zone GKE cluster or a disabled\nCloud SQL instance. You can also use scripts to simulate a region-wide\nnetwork outage by using firewall rules or API restrictions based on your scope\nof test. Gradually escalate the failure scenarios to observe system behavior\nunder various conditions.\n\nIntroduce load testing alongside failure scenarios to replicate real-world\nusage during outages. Test cascading failure impacts, such as how frontend\nsystems behave when backend services are unavailable.\n\nTo validate configuration changes and to assess the system's resilience against\nhuman errors, test scenarios that involve misconfigurations. For example, run\ntests with incorrect DNS failover settings or incorrect IAM\npermissions.\n\n### Monitor system behavior\n\nMonitor how load balancers, health checks, and other mechanisms reroute\ntraffic. Use Google Cloud tools like Cloud Monitoring and\nCloud Logging to capture metrics and events during the test.\n\nObserve changes in latency, error rates, and throughput during and after the\nfailure simulation, and monitor the overall performance impact. Identify any\ndegradation or inconsistencies in the user experience.\n\nEnsure that logs are generated and alerts are triggered for key events, such as\nservice outages or failovers. Use this data to verify the effectiveness of your\nalerting and incident response systems.\n\n### Verify recovery against your RTO and RPO\n\nMeasure how long it takes for the system to resume normal operations after a\nfailure, and then compare this data with the defined RTO and document any\ngaps.\n\nEnsure that data integrity and availability align with the RPO. To test\ndatabase consistency, compare snapshots or backups of the database before and\nafter a failure.\n\nEvaluate service restoration and confirm that all services are restored to a\nfunctional state with minimal user disruption.\n\n### Document and analyze results\n\nDocument each test step, failure scenario, and corresponding system behavior.\nInclude timestamps, logs, and metrics for detailed analyses.\n\nHighlight bottlenecks, single points of failure, or unexpected behaviors\nobserved during the test. To help prioritize fixes, categorize issues by\nseverity and impact.\n\nSuggest improvements to the system architecture, failover mechanisms, or\nmonitoring setups. Based on test findings, update any relevant failover policies\nand playbooks. Present a postmortem report to stakeholders. The report should\nsummarize the outcomes, lessons learned, and next steps. For more information,\nsee\n[Conduct thorough postmortems](/architecture/framework/reliability/conduct-postmortems).\n\n### Iterate and improve\n\nTo validate ongoing reliability and resilience, plan periodic testing (for\nexample, quarterly).\n\nRun tests under different scenarios, including infrastructure changes, software\nupdates, and increased traffic loads.\n\nAutomate failover tests by using CI/CD pipelines to integrate reliability\ntesting into your development lifecycle.\n\nDuring the postmortem, use feedback from stakeholders and end users to improve\nthe test process and system resilience."]]