[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2024-12-30。"],[[["\u003cp\u003eThis document focuses on the observation aspect of reliability within the Google Cloud Well-Architected Framework, offering recommendations to proactively identify potential errors and failures.\u003c/p\u003e\n"],["\u003cp\u003eEffective observability in Google Cloud requires the use of metrics, which are numerical measurements; logs, which are time-stamped records of events; and traces, which track user journeys or transactions through applications.\u003c/p\u003e\n"],["\u003cp\u003eUtilizing Cloud Monitoring and Cloud Logging provides comprehensive insights into key metrics like response times and error rates, allowing data-driven decisions about workload performance and component dependencies.\u003c/p\u003e\n"],["\u003cp\u003eProactive troubleshooting involves implementing error handling and logging across all workload components, as well as optimizing resource utilization through monitoring CPU, network I/O, and disk I/O metrics.\u003c/p\u003e\n"],["\u003cp\u003eEffective alerting focuses on critical metrics with appropriate thresholds to reduce alert fatigue and ensure timely responses, contributing to maintaining workload reliability.\u003c/p\u003e\n"]]],[],null,["# Detect potential failures by using observability\n\nThis principle in the reliability pillar of the\n[Google Cloud Well-Architected Framework](/architecture/framework)\nprovides recommendations to help you proactively identify areas where errors and\nfailures might occur.\n\nThis principle is relevant to the *observation*\n[focus area](/architecture/framework/reliability#focus-areas)\nof reliability.\n\nPrinciple overview\n------------------\n\nTo maintain and improve the reliability of your workloads in\nGoogle Cloud, you need to implement effective observability by using\nmetrics, logs, and traces.\n\n- Metrics are numerical measurements of activities that you want to track for your application at specific time intervals. For example, you might want to track technical metrics like request rate and error rate, which can be used as service-level indicators (SLIs). You might also need to track application-specific business metrics like orders placed and payments received.\n- Logs are time-stamped records of discrete events that occur within an application or system. The event could be a failure, an error, or a change in state. Logs might include metrics, and you can also use logs for SLIs.\n- A trace represents the journey of a single user or transaction through a number of separate applications or the components of an application. For example, these components could be microservices. Traces help you to track what components were used in the journeys, where bottlenecks exist, and how long the journeys took.\n\nMetrics, logs, and traces help you monitor your system continuously.\nComprehensive monitoring helps you find out where and why errors occurred. You\ncan also detect potential failures before errors occur.\n\nRecommendations\n---------------\n\nTo detect potential failures efficiently, consider the recommendations in the\nfollowing subsections.\n\n### Gain comprehensive insights\n\nTo track key metrics like response times and error rates, use\n[Cloud Monitoring](/monitoring/docs/monitoring-overview)\nand\n[Cloud Logging](/logging/docs/overview).\nThese tools also help you to ensure that the metrics consistently meet the needs\nof your workload.\n\nTo make data-driven decisions, analyze default service metrics to understand\ncomponent dependencies and their impact on overall workload performance.\n\nTo customize your monitoring strategy, create and publish your own metrics by\nusing the Google Cloud SDK.\n\n### Perform proactive troubleshooting\n\nImplement robust error handling and enable logging across all of the components\nof your workloads in Google Cloud. Activate logs like\n[Cloud Storage access logs](/storage/docs/access-logs)\nand\n[VPC Flow Logs](/vpc/docs/flow-logs).\n\nWhen you configure logging, consider the associated\n[costs](/stackdriver/pricing#logging-costs).\nTo control logging costs, you can configure\n[exclusion filters](/logging/docs/routing/overview#exclusions)\non the log sinks to exclude certain logs from being stored.\n\n### Optimize resource utilization\n\nMonitor CPU consumption, network I/O metrics, and disk I/O metrics to detect\nunder-provisioned and over-provisioned resources in services like\nGKE, Compute Engine, and Dataproc. For a\ncomplete list of supported services, see\n[Cloud Monitoring overview](/monitoring/docs/monitoring-overview).\n\n### Prioritize alerts\n\nFor alerts, focus on critical metrics, set appropriate thresholds to minimize\nalert fatigue, and ensure timely responses to significant issues. This targeted\napproach lets you proactively maintain workload reliability. For more\ninformation, see\n[Alerting overview](/monitoring/alerts)."]]