This page describes flex-start provisioning mode in Google Kubernetes Engine (GKE). Flex-start, powered by Dynamic Workload Scheduler, provides a flexible and cost-effective technique to obtain GPUs when you need to run AI/ML workloads.
Flex-start lets you dynamically provision accelerators as needed, for up to seven days, not bounded to a specific start time, and without the management of long-term reservations. Therefore, flex-start works well for smaller to medium-sized workloads with fluctuating demand requirements or short durations. For example, small model pre-training, model fine-tuning, or scalable serving models.
The information on this page can help you to do the following:
- Understand how flex-start in GKE works.
- Decide whether flex-start is right for your workload.
- Decide which flex-start configuration is right for your workload.
- Manage disruptions when using flex-start.
- Understand the limitations of flex-start in GKE.
This page is intended for Platform admins and operators and Machine learning (ML) engineers who want to optimize accelerator infrastructure for their workloads.
When to use flex-start
We recommend that you use flex-start if your workloads meet all of the following conditions:
- Your workloads require GPU resources.
- You have limited or no reserved GPU capacity and you need more reliable access to GPUs.
- Your workload is time-flexible and your use case can afford to wait to get all the requested capacity, for example, when GKE allocates the GPU resources outside of the busiest hours.
Requirements
To use flex-start in GKE, your cluster must use version 1.32.2-gke.1652000 or later.
How flex-start provisioning mode works
With flex-start, you specify the required GPU capacity in your workloads. Additionally, with Standard clusters, you configure flex-start on specific node pools. GKE automatically provisions VMs by completing the following process when capacity becomes available:
- The workload requests GPU resources that are not immediately available. This request can be made directly by the workload specification or through scheduling tools such as custom compute classes or Kueue.
- GKE identifies that your node has flex-start enabled and that the workload can wait for an indeterminate amount of time.
- The cluster autoscaler accepts your request and calculates the number of necessary nodes, treating them as a single unit.
- The scheduler waits until all needed resources are available in a single zone.
- The cluster autoscaler provisions the necessary nodes when they are available. These nodes run for a maximum of seven days, or for a shorter
duration if you specify a value in the
maxRunDurationSeconds
parameter. This parameter is available with GKE version 1.28.5-gke.1355000 or later. If you don't specify a value for themaxRunDurationSeconds
parameter, the default is seven days. - After the running time you defined in the
maxRunDurationSeconds
parameter ends, the nodes and the Pods are preempted. - If the Pods finish sooner and the nodes are no longer utilized, the cluster autoscaler removes them according to the autoscaling profile.
GKE counts the duration for each flex-start request on a node level. The time available for running Pods might be slightly smaller due to delays during startup. Pod retries share this duration, which means that less time is available for Pods after retry. GKE counts the duration for each flex-start request separately.
Flex-start configurations
GKE supports the following flex-start configurations:
- Flex-start, where GKE allocates resources
node by node. This configuration only requires you to set the
--flex-start
flag during node creation. Flex-start with queued provisioning, where GKE allocates all requested resources at the same time. All Pods of the workload can run together on newly provisioned nodes. The provisioned nodes aren't reused between workload executions. To use this configuration, you have to add the
--flex-start
andenable-queued-provisioning
flags when you create the node pool.
The following table compares the flex-start configurations:
Flex-start | Flex-start with queued provisioning | |
---|---|---|
Availability | Preview
|
Generally Available (GA) flex-start and enable-queued-provisioning flags in Preview. To enroll in preview for these flags, fill out the request form.
To give feedback or request support for this feature, contact dws-flex-start-feedback@google.com.
|
Recommended workload size | Small to medium, which means that the workload can run on a single node. For example, this configuration works well if you are running small training jobs, offline inference, or batch jobs. | Medium to large, which means that the workload can run on multiple nodes. Your workload requires multiple resources and can't start running until all GPU nodes are provisioned and ready at the same time. For example, this configuration works well if you are running distributed ML training workloads. |
Provisioning type | GKE provisions one node at a time when resources are available. | GKE allocates all requested resources simultaneously. |
Setup complexity | Less complex. This configuration is similar to on-demand and Spot VMs. | More complex. We strongly recommend that you use a quota management tool, such as Kueue. |
Custom compute classes support | Yes | No |
Node recycling | Yes | No |
Price | Flex Start SKU | Flex Start SKU |
Quota | Preemptible | Preemptible |
Node upgrade strategy | Short-lived upgrades | Short-lived upgrades |
gcloud container node pool create flag |
--flex-start |
|
Get started | Run a large-scale workload with flex-start with queued provisioning |
Optimize flex-start configuration
To create robust and cost-optimized AI/ML infrastructure, you can combine flex-start configurations with available GKE features. We recommend that you use Compute Classes to define a prioritized list of node configurations based on your workload requirements. GKE will select the most suitable configuration based on availability and your defined priority.
Manage disruptions in workloads that use Dynamic Workload Scheduler
Workloads that require the availability of all nodes, or most nodes, in a node pool are sensitive to evictions. In addition, nodes that are provisioned by using Dynamic Workload Scheduler requests don't support automatic repair. Automatic repair removes all workloads from a node, and thus prevents them from running.
All nodes using flex-start, queued provisioning, or both, use short-lived upgrades when the cluster control plane runs the minimum version for flex-start, 1.32.2-gke.1652000 or later.
Short-lived upgrades update a Standard node pool or group of nodes in an Autopilot cluster without disrupting running nodes. New nodes are created with the new configuration, gradually replacing existing nodes with the old configuration over time. Earlier versions of GKE, which don't support flex-start or short-lived upgrades, require different best practices.
Best practices to minimize workload disruptions for nodes using short-lived upgrades
Flex-start nodes and nodes which use queued provisioning are automatically configured to use short-lived upgrades when the cluster runs version 1.32.2-gke.1652000 or later.
To minimize disruptions to workloads running on nodes that use short-lived upgrades, perform the following tasks:
- Configure maintenance windows and exclusions to set when GKE should and shouldn't perform update operations, such as node upgrades, while ensuring that GKE still has the time to do automatic maintenance.
- Disable node auto-repair.
For nodes on clusters running versions earlier than 1.32.2-gke.1652000, and thus not using short-lived upgrades, refer to the specific guidance for those nodes.
Best practices to minimize workload disruption for queued provisioning nodes without short-lived upgrades
Nodes using queued provisioning on a cluster running a GKE version earlier than 1.32.2-gke.1652000 don't use short-lived upgrades. Clusters upgraded to 1.32.2-gke.1652000 or later with existing queued provisioning nodes are automatically updated to use short-lived upgrades.
For nodes running these earlier versions, refer to the following guidance:
- Depending on your cluster's release channel
enrollment,
use the following best practices to prevent node auto-upgrades from
disrupting your workloads:
- If your cluster is enrolled in a release channel, use maintenance windows and exclusions to prevent GKE from automatically upgrading your nodes while your workload is running.
- If your cluster isn't enrolled in a release channel, disable node auto-upgrades. However, we recommend using release channels, where you can use maintenance exclusions with more granular scopes.
- Disable node auto-repair.
- Use maintenance windows and exclusions to minimize disruption to running workloads, while ensuring that GKE still has the time to do automatic maintenance. Be sure to schedule that time for when no workloads are running.
- To ensure that your node pool remains up to date, manually upgrade your node pool when no Dynamic Workload Scheduler requests are active and the node pool is empty.
Considerations for when your cluster migrates to short-lived upgrades
GKE updates existing nodes using queued provisioning to use short-lived upgrades when the cluster is upgraded to version 1.32.2-gke.1652000 or later. GKE doesn't update other settings, such as enabling node auto-upgrades if you disabled them for a specific node pool.
We recommend that you consider implementing the following best practices now that your node pools use short-lived upgrades:
- If you disabled node auto-upgrades by using the
--no-enable-autoupgrade
flag, this migration doesn't re-enable node auto-upgrades for the node pool. We recommend that you enable node auto-upgrades, because short-lived upgrades are not disruptive to existing nodes and the workloads that run on them. For more information, see Short-lived upgrades. - We also recommend that, if your cluster wasn't already enrolled in a release channel, you enroll your cluster, so that you can use more granular maintenance exclusion scopes.
Node recycling in flex-start
To help ensure a smooth transition of nodes and prevent downtime for your running jobs, flex-start supports node recycling. When a node reaches the end of its seven-day duration, GKE ca automatically replaces the node with a new one to preserve your running workloads.
To use node recycling, you must create a
custom compute class profile and
include the nodeRecycling
field in the flexStart
specification with the
following parameters:
leadTimeSeconds
: a configurable parameter that lets you balance resource availability and cost efficiency. TheleadTimeSeconds
parameter specifies how many seconds before a node reaches the end of its seven-day duration that a new node provisioning process should start to substitute it. A longer lead time increases the probability that the new node is ready before the old one is removed, but might incur additional costs.maxRunDurationSeconds
: a configurable parameter that lets you set the maximum duration for which a node can be active.
The node recycling process consists of the following steps:
Recycling phase: GKE validates that a flex-start-provisioned node has the
nodeRecycling
flag set totrue
. If so, GKE starts the node recycling phase when the current date is greater than or equal to the difference between the values from the following fields:creationTimestamp
plusmaxRunDurationSeconds
leadTimeSeconds
The
creationTimeStamp
flag includes the time when the node was created.Node creation: the creation process for the new node begins, proceeding through queueing and provisioning phases. The duration of the queueing phase duration can vary dynamically depending on the zone and specific accelerator capacity.
Cordon the node that's reaching the end of its seven-day duration: after the new node is running, the old node is cordoned. This action prevents any new Pods from being scheduled on it. Existing Pods in that node continue to run.
Node deprovisioning: the node that's reaching the end of its seven-day duration is eventually deprovisioned after a suitable period, which helps ensure that running workloads have migrated to the new node.
For more information about how to use node recycling, try the Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy tutorial.
Limitations
- Inter-pod anti-affinity is not supported. Cluster autoscaler doesn't consider inter-pod anti-affinity rules during node provisioning, which might lead to unschedulable workloads. This situation might happen when nodes for two or more Dynamic Workload Scheduler objects were provisioned in the same node pool.
- Only GPU nodes are supported.
- Reservations aren't supported with Dynamic Workload Scheduler. You have to specify
--reservation-affinity=none
flag when you create the node pool. Dynamic Workload Scheduler requires and supports only theANY
location policy for cluster autoscaling. - A single Dynamic Workload Scheduler request can create up to 1,000 virtual machines (VMs), which is the maximum number of nodes per zone for a single node pool.
- GKE uses the Compute Engine
ACTIVE_RESIZE_REQUESTS
quota to control the number of Dynamic Workload Scheduler requests that are pending in a queue. By default, this quota has a limit of 100 requests per Google Cloud project. If you attempt to create a Dynamic Workload Scheduler request that's greater than this quota, the new request fails. - Node pools that use Dynamic Workload Scheduler are sensitive to disruption because the nodes are provisioned together. To learn more, see Manage disruption in workloads that use Dynamic Workload Scheduler.
- You might see additional short-lived VMs listed in the Google Cloud console. This behavior is intended because Compute Engine might create and then promptly remove VMs until the capacity to provision all of the required machines is available.
- Spot VMs aren't supported.
- Dynamic Workload Scheduler doesn't support ephemeral volumes. You must use persistent volumes for storage. To select the best storage type that uses persistent volumes, see Storage for GKE clusters overview.