[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2024-12-30。"],[[["\u003cp\u003ePostmortems are written records of incidents, detailing the impact, resolution, root causes, and follow-up actions to prevent recurrence, focusing on learning rather than assigning blame.\u003c/p\u003e\n"],["\u003cp\u003ePostmortems should be conducted after major events like user-visible downtimes, data losses, and high resolution times, as well as non-major incidents like interventions from on-call engineers and monitoring failures.\u003c/p\u003e\n"],["\u003cp\u003eEffective postmortems are blameless, focusing on process, tool, and technology improvements rather than individual or team fault, with feedback structured to avoid blame.\u003c/p\u003e\n"],["\u003cp\u003ePostmortem reports should be readable and relevant to the audience, avoiding unnecessary complexity, with supplementary data placed in an appendix.\u003c/p\u003e\n"],["\u003cp\u003eSharing the outcomes of the postmortem as widely as possible will increase learning, thereby reducing the likelihood of similar failures in the future.\u003c/p\u003e\n"]]],[],null,["# Conduct thorough postmortems\n\nThis principle in the reliability pillar of the\n[Google Cloud Well-Architected Framework](/architecture/framework)\nprovides recommendations to help you conduct effective postmortems after\nfailures and incidents.\n\nThis principle is relevant to the *learning*\n[focus area](/architecture/framework/reliability#focus-areas)\nof reliability.\n\nPrinciple overview\n------------------\n\nA postmortem is a written record of an incident, its impact, the actions taken\nto mitigate or resolve the incident, the root causes, and the follow-up actions\nto prevent the incident from recurring. The goal of a postmortem is to learn\nfrom mistakes and not assign blame.\n\nThe following diagram shows the workflow of a postmortem:\n\nThe workflow of a postmortem includes the following steps:\n\n- Create postmortem\n- Capture the facts\n- Identify and analyze the root causes\n- Plan for the future\n- Execute the plan\n\nConduct postmortem analyses after major events and non-major events like the\nfollowing:\n\n- User-visible downtimes or degradations beyond a certain threshold.\n- Data losses of any kind.\n- Interventions from on-call engineers, such as a release rollback or rerouting of traffic.\n- Resolution times above a defined threshold.\n- Monitoring failures, which usually imply manual incident discovery.\n\nRecommendations\n---------------\n\nDefine postmortem criteria before an incident occurs so that everyone knows\nwhen a post mortem is necessary.\n\nTo conduct effective postmortems, consider the recommendations in the following\nsubsections.\n\n### Conduct blameless postmortems\n\nEffective postmortems focus on processes, tools, and technologies, and don't\nplace blame on individuals or teams. The purpose of a postmortem analysis is to\nimprove your technology and future, not to find who is guilty. Everyone makes\nmistakes. The goal should be to analyze the mistakes and learn from them.\n\nThe following examples show the difference between feedback that assigns blame\nand blameless feedback:\n\n- **Feedback that assigns blame**: \"We need to rewrite the entire complicated backend system! It's been breaking weekly for the last three quarters and I'm sure we're all tired of fixing things piecemeal. Seriously, if I get paged one more time I'll rewrite it myself...\"\n- **Blameless feedback**: \"An action item to rewrite the entire backend system might actually prevent these pages from continuing to happen. The maintenance manual for this version is quite long and really difficult to be fully trained up on. I'm sure our future on-call engineers will thank us!\"\n\n### Make the postmortem report readable by all the intended audiences\n\nFor each piece of information that you plan to include in the report, assess\nwhether that information is important and necessary to help the audience\nunderstand what happened. You can move supplementary data and explanations to an\nappendix of the report. Reviewers who need more information can request it.\n\n### Avoid complex or over-engineered solutions\n\nBefore you start to explore solutions for a problem, evaluate the importance of\nthe problem and the likelihood of a recurrence. Adding complexity to the system\nto solve problems that are unlikely to occur again can lead to increased\ninstability.\n\n### Share the postmortem as widely as possible\n\nTo ensure that issues don't remain unresolved, publish the outcome of the\npostmortem to a wide audience and get support from management. The value of a\npostmortem is proportional to the learning that occurs after the postmortem.\nWhen more people learn from incidents, the likelihood of similar failures\nrecurring is reduced."]]