[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-18。"],[[["\u003cp\u003eHorizontal Autoscaling in Dataflow dynamically adjusts the number of worker instances for both batch and streaming pipelines based on CPU utilization and pipeline parallelism.\u003c/p\u003e\n"],["\u003cp\u003eBatch pipelines have Horizontal Autoscaling enabled by default, where Dataflow automatically re-evaluates work and scales workers every 30 seconds, sublinearly to the amount of work needed, with limitations to save idle resources.\u003c/p\u003e\n"],["\u003cp\u003eFor streaming jobs, Horizontal Autoscaling is enabled by default for pipelines using Streaming Engine, but for those that do not, the \u003ccode\u003e--autoscalingAlgorithm=THROUGHPUT_BASED\u003c/code\u003e and \u003ccode\u003e--maxNumWorkers\u003c/code\u003e pipeline options need to be set to enable the feature.\u003c/p\u003e\n"],["\u003cp\u003eCustom data sources can improve Horizontal Autoscaling performance by implementing specific methods to provide the Dataflow service with information about estimated size, progress, or backlog.\u003c/p\u003e\n"],["\u003cp\u003eHorizontal Autoscaling can be disabled by using the \u003ccode\u003e--autoscalingAlgorithm=NONE\u003c/code\u003e pipeline option, which will cause the number of workers to be based on the \u003ccode\u003e--numWorkers\u003c/code\u003e option, additionally there are limitations when it comes to the Dataflow Prime, Dataflow Shuffle and PeriodicImpulse.\u003c/p\u003e\n"]]],[],null,["# Horizontal Autoscaling enables Dataflow to choose the appropriate\nnumber of worker instances for your job, adding or removing workers as needed.\nDataflow scales based on the average CPU utilization of the\nworkers and on the parallelism of a pipeline. The parallelism of a pipeline is\nan estimate of the number of threads needed to most efficiently process data at\nany given time.\n\nHorizontal Autoscaling is supported in both batch and streaming pipelines.\n\nBatch autoscaling\n-----------------\n\nHorizontal Autoscaling is enabled by default on all batch pipelines.\nDataflow automatically chooses the number of workers based on the\nestimated total amount of work in each stage of your pipeline. This estimate\ndepends on the input size and the current throughput. Every 30 seconds,\nDataflow re-evaluates the amount of work according to the\nexecution progress. As the estimated total amount of work increases or\ndecreases, Dataflow dynamically scales the number of workers up or\ndown.\n\nThe number of workers is sublinear to the amount of work. For example, a job\nwith twice the work has fewer than twice the workers.\n\nIf any of the following conditions occur, Dataflow either\nmaintains or decreases the number of workers, in order to save idle resources:\n\n- The average worker CPU usage is lower than 5%.\n- Parallelism is limited due to unparallelizable work, such as un-splittable data caused by compressed files or I/O modules that don't split.\n- The degree of parallelism is fixed, for example when writing to existing files in Cloud Storage.\n\nTo set an upper bound on the number of workers, set the\n[`--maxNumWorkers` pipeline option](/dataflow/docs/reference/pipeline-options).\nThe default value is `2,000`.\nTo set a lower bound on the number of workers, set the\n[`--dataflow-service-options=min_num_workers` service option](/dataflow/docs/reference/service-options).\nThese flags are optional.\n\nStreaming autoscaling\n---------------------\n\nFor streaming jobs, Horizontal Autoscaling allows Dataflow to\nadaptively change the number of workers in response to changes in load and\nresource utilization.\n\nHorizontal Autoscaling is enabled by default for streaming jobs that use\n[Streaming Engine](/dataflow/docs/streaming-engine).\nTo enable Horizontal Autoscaling for streaming jobs that don't use\nStreaming Engine, set the following\n[pipeline options](/dataflow/docs/reference/pipeline-options) when you start\nyour pipeline: \n\n### Java\n\n --autoscalingAlgorithm=THROUGHPUT_BASED\n --maxNumWorkers=\u003cvar translate=\"no\"\u003eMAX_WORKERS\u003c/var\u003e\n\nReplace \u003cvar translate=\"no\"\u003eMAX_WORKERS\u003c/var\u003e with the maximum number of worker\ninstances.\n\n### Python\n\n --autoscaling_algorithm=THROUGHPUT_BASED\n --max_num_workers=\u003cvar translate=\"no\"\u003eMAX_WORKERS\u003c/var\u003e\n\nReplace \u003cvar translate=\"no\"\u003eMAX_WORKERS\u003c/var\u003e with the maximum number of worker\ninstances.\n\n### Go\n\n --autoscaling_algorithm=THROUGHPUT_BASED\n --max_num_workers=\u003cvar translate=\"no\"\u003eMAX_WORKERS\u003c/var\u003e\n\nReplace \u003cvar translate=\"no\"\u003eMAX_WORKERS\u003c/var\u003e with the maximum number of worker\ninstances.\n\nTo set a lower bound on the number of workers, set the\n[`--dataflow-service-options=min_num_workers` service option](/dataflow/docs/reference/service-options).\nWhen you set this value, horizontal autoscaling doesn't scale below the number\nof workers specified. This flag is optional.\n\nWhile a streaming job is running, you can update the minimum and maximum workers\nby using an\n[in-flight job update](/dataflow/docs/guides/updating-a-pipeline#in-flight-updates).\nTo adjust the settings, set the `min-num-workers` and `max-num-workers` flags.\nFor more information, see\n[Update the autoscaling range](/dataflow/docs/guides/tune-horizontal-autoscaling#update-range).\n\nDisable Horizontal Autoscaling\n------------------------------\n\nTo disable Horizontal Autoscaling, set the following\n[pipeline option](/dataflow/docs/reference/pipeline-options) when you run\nthe job. \n\n### Java\n\n --autoscalingAlgorithm=NONE\n\nIf you disable Horizontal Autoscaling, then Dataflow sets\nthe number of workers based on the `--numWorkers` option.\n\n### Python\n\n --autoscaling_algorithm=NONE\n\nIf you disable Horizontal Autoscaling, then Dataflow sets\nthe number of workers based on the `--num_workers` option.\n\n### Go\n\n --autoscaling_algorithm=NONE\n\nIf you disable Horizontal Autoscaling, then Dataflow sets\nthe number of workers based on the `--num_workers` option.\n\nCustom sources\n--------------\n\nIf you create a custom data source, you can potentially improve performance by\nimplementing methods that provide more information to the Horizontal Autoscaling\nalgorithm: \n\n### Java\n\n#### Bounded sources\n\n- In your `BoundedSource` subclass, implement the method `getEstimatedSizeBytes`. The Dataflow service uses `getEstimatedSizeBytes` when calculating the initial number of workers to use for your pipeline.\n- In your `BoundedReader` subclass, implement the method `getFractionConsumed`. The Dataflow service uses `getFractionConsumed` to track read progress and converge on the correct number of workers to use during a read.\n\n#### Unbounded sources\n\nThe source must inform the Dataflow service about backlog.\nBacklog is an estimate of the input in bytes that has not yet been processed\nby the source. To inform the service about backlog, implement either one of\nthe following methods in your `UnboundedReader` class.\n\n- `getSplitBacklogBytes()` - Backlog for the current split of the source. The service aggregates backlog across all the splits.\n- `getTotalBacklogBytes()` - The global backlog across all the splits. In some cases the backlog is not available for each split and can only be calculated across all the splits. Only the first split (split ID '0') needs to provide total backlog.\n\nThe Apache Beam repository contains several\n[examples](https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java)\nof custom sources that implement the `UnboundedReader` class.\n\n### Python\n\n#### Bounded sources\n\n- In your `BoundedSource` subclass, implement the method `estimate_size`. The Dataflow service uses `estimate_size` when calculating the initial number of workers to use for your pipeline.\n- In your `RangeTracker` subclass, implement the method `fraction_consumed`. The Dataflow service uses `fraction_consumed` to track read progress and converge on the correct number of workers to use during a read.\n\n### Go\n\n#### Bounded sources\n\n- In your `RangeTracker`, implement the method `GetProgress()`. The Dataflow service uses `GetProgress` to track read progress and converge on the correct number of workers to use during a read.\n\nLimitations\n-----------\n\n- In jobs running Dataflow Prime, Horizontal Autoscaling is deactivated during and up to 10 minutes after Vertical Autoscaling. For more information, see [Effect on Horizontal Autoscaling](/dataflow/docs/vertical-autoscaling#horizontal-autoscaling).\n- For pipelines not using [Dataflow Shuffle](/dataflow/docs/shuffle-for-batch), Dataflow might not be able to scale down the workers effectively because the workers might have shuffled data stored in local disks.\n- The [PeriodicImpulse](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/PeriodicImpulse.html) transform is supported with streaming autoscaling in the Apache Beam SDK versions 2.60.0 and later. If your pipeline uses `PeriodicImpulse` with an earlier SDK version, then Dataflow workers don't scale down as expected.\n\nWhat's next\n-----------\n\n- [Tune Horizontal Autoscaling for streaming pipelines](/dataflow/docs/guides/tune-horizontal-autoscaling)\n- [Monitor Dataflow autoscaling](/dataflow/docs/guides/autoscaling-metrics)\n- [Troubleshoot Dataflow autoscaling](/dataflow/docs/guides/troubleshoot-autoscaling)"]]