[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-11。"],[],[],null,["# Lifecycle of an incident\n\nThis document explains how the Google Cloud Support team and\nproduct engineering team work together to resolve an incident and provide you\nwith updates.\n\nThe following diagram shows the responsibilities of the product engineering and\nsupport teams.\n\nThe following sections explain these responsibilities.\n\nDetection\n---------\n\nGoogle Cloud uses internal and synthetic monitoring to detect incidents.\nFor more information, see\n[Chapter 6 of the Site Reliability Engineering book](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/).\n\nInitial response\n----------------\n\nWhen an incident is detected, the Google Cloud Service Health team manages customer\ncommunications. Initial notification of an incident is often sparse,\nfrequently only mentioning the product in question. This is because we\nprioritize fast notification over detail. Detail can be provided in subsequent\nupdates.\n\nTo provide you as much information as possible without overwhelming you\nwith issues that don't affect you, different communication channels are used\ndepending on the scope and severity of an issue:\n\nInvestigate\n-----------\n\nProduct engineering teams are responsible for investigating the root cause of\nincidents. Incident management is often done by Site Reliability Engineers but\nmight be done by software engineers or others, depending on the situation and\nproduct. For more information, see\n[Chapter 12 of the Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/chapters/effective-troubleshooting/).\n\nMitigation and fix\n------------------\n\nAn issue is considered *fixed* only when changes have been made that Google is\nconfident will end the impact indefinitely. For example, the fix could be rolling\nback a change that triggered an incident.\n\nWhile an incident is in progress, Service Health and the product team\nattempt to *mitigate* the issue. Mitigation is when the impact or scope of an\nissue can be reduced, for example, by temporarily providing additional resources\nto a product suffering overload.\n\nIf no mitigation has been found, when possible, the Service Health team\nfinds and communicates *workarounds*. Workarounds are steps that you can take to solve\nthe underlying need despite the incident. A workaround might be to use different\nsettings for an API call to avoid a problematic code path.\n\nFollow up\n---------\n\nWhile an incident is ongoing, the Service Health team provides regular\nupdates. Updates typically provide:\n\n- More information about the incident, such as error messages, zones or\n regions affected, which features are affected, or percentages of impact.\n\n- Progress towards mitigation, including any workarounds.\n\n- Timelines for communication, tailored to the incident.\n\n- Changes in status, such as when an incident is fixed.\n\nRetrospective\n-------------\n\nAll incidents undergo an internal retrospective to fully understand the incident\nand identify reliability improvements that Google can make. These improvements\nare then tracked and implemented. For more information, see\n[Chapter 15 of the Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/chapters/postmortem-culture/).\n\nIncident report\n---------------\n\nWhen incidents have very wide and serious impact, Google provides incident\nreports that outline the symptoms, impact, root cause, remediation, and future\nprevention of incidents. As with retrospectives, we pay particular attention to\nthe steps that we take to learn from the issue and improve reliability. Google's\ngoal in writing and releasing retrospectives is to be transparent and\ndemonstrate our commitment to building stable products for our customers."]]