慢速工作器。在慢速工作器上,工作项的运行速度比平常慢。通常,慢速工作器的处理速度低于在同一阶段执行类似工作的工作器的处理速度。很多因素可能会导致工作器运行缓慢,包括 CPU 耗尽、抖动、机器架构以及工作器进程卡滞。发生速度缓慢问题时,Dataflow 会尝试自动缓解问题。如需了解详情,请参阅本文档中的自动缓解由慢速工作器导致的 Straggler。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-18。"],[[["\u003cp\u003eStragglers are work items in Dataflow batch pipelines that take significantly longer to complete, reduce parallelism, and block new work, potentially causing overall job delays.\u003c/p\u003e\n"],["\u003cp\u003eThe Google Cloud console allows users to view stragglers by stage or by worker, helping to identify which stages or workers are experiencing delays.\u003c/p\u003e\n"],["\u003cp\u003eDataflow can detect the causes of stragglers, including hot keys, which are keys with disproportionately more elements than others, and slow workers, where work items run more slowly than usual.\u003c/p\u003e\n"],["\u003cp\u003eDataflow automatically attempts to mitigate slow worker issues by simulating a host maintenance event, potentially live migrating or restarting the affected worker.\u003c/p\u003e\n"],["\u003cp\u003eHot keys can be addressed by rekeying data, utilizing specific combine transforms (Java/Python), or enabling Dataflow Shuffle.\u003c/p\u003e\n"]]],[],null,["# Troubleshoot stragglers in batch jobs\n\n*Stragglers* are work items that slow down your Dataflow jobs by\npreventing work from being done in parallel.\n\nFor batch pipelines, a straggler is defined as a work item with the\nfollowing characteristics:\n\n- It takes significantly longer to complete than other work items in the same stage.\n- It reduces parallelism within the stage.\n- It blocks new work from starting.\n\nIn the worst case, a straggler blocks a stage from completion because a small\npercentage of the work is in progress, causing overall delays in a job.\n\nDataflow detects stragglers that occur during batch jobs. If\nDataflow detects a straggler, it also tries to determine the cause\nof the straggler.\n| **Note:** For information about troubleshooting stragglers in streaming jobs, see [Troubleshoot stragglers in streaming jobs](/dataflow/docs/guides/troubleshoot-streaming-stragglers).\n\nView stragglers in the Google Cloud console\n-------------------------------------------\n\nAfter you start a Dataflow job, you can use the Google Cloud console\nto view any detected stragglers.\n\nYou can view stragglers either by stage or by worker. Use these views to find\nwhich stages have stragglers, and then pinpoint the workers where stragglers\noccurred within each stage.\n\n### View stragglers by stage\n\nTo view stragglers by stage:\n\n1. In the Google Cloud console, go to the Dataflow **Jobs**\n page.\n\n Go to [Jobs](https://console.cloud.google.com/dataflow/jobs)\n2. Click the name of the job.\n\n3. In the job details page, click the **Execution details** tab.\n\n4. In the **Graph view** list, select **Stage progress**. The progress graph\n shows aggregated counts of all stragglers detected within each stage.\n\n5. To see details for a stage, hold the pointer over the bar for a stage. To\n view the workers for the stage, click **View workers** in the details panel.\n\n### View stragglers by worker\n\nTo view stragglers by worker:\n\n1. In the Google Cloud console, go to the Dataflow **Jobs**\n page.\n\n Go to [Jobs](https://console.cloud.google.com/dataflow/jobs)\n2. Click the name of the job.\n\n3. In the job details page, click the **Execution details** tab.\n\n4. In the **Graph view** list, select **Worker progress**.\n\n5. In the **Filter workers by stage** list, select the stage. The progress graph\n shows any stragglers detected for that stage. The bar has darker shading at\n the point where the straggler was first detected.\n\n6. To see details for a worker, hold the pointer over the bar for that worker.\n\nIn the **Stage info** panel, the **Straggler details** section lists the\nstragglers for all workers shown on the page, with the following information:\n\n- The start time when the straggler was detected.\n- The worker that experienced the straggler.\n- The cause, if known.\n\nTroubleshoot batch stragglers\n-----------------------------\n\nDataflow detects the following causes of stragglers in batch\npipelines:\n\n- **Hot key** . A *hot key* is a key that represents significantly more elements\n than other keys in the same `PCollection`. For more information, see\n [Troubleshoot stragglers caused by hot keys](#troubleshoot_stragglers_caused_by_hot_keys)\n in this document.\n\n- **Slow Worker** . On a *slow worker* , work items run\n more slowly than usual. Often, the processing speed of a slow worker is\n less than the processing speed of workers doing similar work at the same stage.\n Many factors can cause worker slowness, including CPU starvation, thrashing,\n machine architecture, and stuck worker processes.\n When slowness occurs, Dataflow attempts to mitigate the issue\n automatically. For more information, see\n [Automatically mitigate stragglers caused by slow workers](#slow-workers) in this\n document.\n\n- **Undetermined cause** . For stragglers with undetermined cause, see the\n general troubleshooting steps for\n [slow batch jobs](/dataflow/docs/guides/troubleshoot-slow-batch-jobs)\n in \"Troubleshoot slow or stuck jobs.\"\n\n### Troubleshoot stragglers caused by hot keys\n\nVarious factors can cause stragglers, but one common cause is the existence of\na *hot key* . A hot key is a key that represents significantly more elements than\nother keys in the same `PCollection`. Hot keys can create stragglers because\nthey limit Dataflow's ability to process elements in parallel.\n\nIf Dataflow detects a straggler caused by a hot key, the\n**Straggler Details** panel lists `Hot Key` as the cause.\n\nBy default, Dataflow does not display the key value of the\nhot key. To display the key value, set the\n[`hotKeyLoggingEnabled`](/dataflow/docs/reference/pipeline-options#debugging)\npipeline option to `true` when you run the job.\n\nTo resolve this issue, check that your data is evenly distributed. If a key has\ndisproportionately many values, consider the following courses of action:\n\n- Rekey your data. Apply a [`ParDo`](https://beam.apache.org/documentation/programming-guide/#pardo) transform to output new key-value pairs.\n- For Java jobs, use the [`Combine.PerKey.withHotKeyFanout`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Combine.PerKey.html) transform.\n- For Python jobs, use the [`CombinePerKey.with_hot_key_fanout`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html?#apache_beam.transforms.core.CombinePerKey.with_hot_key_fanout) transform.\n- Enable [Dataflow Shuffle](/dataflow/docs/shuffle-for-batch).\n\n| **Note:** Apache Beam SQL cannot reshuffle data that contains hotkeys during sub-transforms.\n\nFor example, if a pipeline performs a `JOIN` operation as part of a SQL\ntransform, then a given key is likely to contain a disproportionate amount of\ndata when it is fed into the `GroupByKey` that is performed as part of the\nexpanded `JOIN` operation.\n\nFor more information, see the following feature request:\n[beam-issue/28186](https://github.com/apache/beam/issues/28186).\n\n### Automatically mitigate stragglers caused by slow workers\n\nSlow workers are uncommon on Dataflow but can impact\njob performance. To prevent performance issues, when Dataflow detects\nslow workers, it tries to mitigate the problem before the workers cause\nstragglers.\n\nThe automatic mitigation\n[simulates a host maintenance event](/compute/docs/instances/simulating-host-maintenance).\nThe event is a Compute Engine maintenance mechanism that happens regularly.\nDepending on the worker's\n[host maintenance policy](/compute/docs/instances/host-maintenance-overview#schedulingoptions),\nthe worker is either live migrated or restarted. If a live migration occurs, the workload isn't interrupted.\nIf the worker is restarted, the ongoing work from the slow worker is lost,\nand processing restarts.\n\nIf a slow worker is detected and successfully mitigated,\nthe following message displays in the **job-message** logs: \n\n Slow worker ... detected and automatically remediated ...\n\nBecause slow workers are not stragglers, you don't need to take further action.\n\nIf mitigation is unsuccessful, the slow worker causes a straggler that\ndisplays in the Dataflow monitoring interface.\n\nAutomatic mitigation might fail if your project runs out of quota for\ninstance simulate maintenance event requests. For more information\nabout the default quota, see\n[API rate limits for regional metrics](/compute/resource-usage#api-rate-limits-regional)\nin \"Resource usage quotas and permission management.\"\nTo request a higher quota limit, see\n[Requesting a quota adjustment](/docs/quotas/help/request_increase)\nin \"View and manage quotas.\"\n\nWhat's next\n-----------\n\n- Learn to use the [Dataflow monitoring interface](/dataflow/docs/guides/using-monitoring-intf).\n- Understand the [**Execution details**](/dataflow/docs/concepts/execution-details) tab in the monitoring interface."]]