About node auto-provisioning


This page explains how node auto-provisioning works in Standard Google Kubernetes Engine (GKE) clusters. With node auto-provisioning, nodes are automatically scaled to meet the requirements of your workloads.

With Autopilot clusters, you don't need to manually provision nodes or manage node pools because GKE automatically manages node scaling and provisioning.

Why use node auto-provisioning

Node auto-provisioning automatically manages and scales a set of node pools on the user's behalf. Without node auto-provisioning, the GKE cluster autoscaler creates nodes only from user-created node pools. With node auto-provisioning, GKE automatically creates and deletes node pools.

Unsupported features

Node auto-provisioning doesn't create node pools that use any of the following features. However, the cluster autoscaler scales nodes in existing node pools with these features:

How node auto-provisioning works

Node auto-provisioning is a mechanism of the cluster autoscaler. The cluster autoscaler only scales existing node pools. With node auto-provisioning enabled, the cluster autoscaler can create node pools automatically based on the specifications of unschedulable Pods.

Node auto-provisioning creates node pools based on the following information:

Resource limits

Node auto-provisioning and the cluster autoscaler have limits at the following levels:

  • Node pool level: Auto-provisioned node pools are limited to 1000 nodes.
  • Cluster level:
    • Any auto-provisioning limits that you define are enforced based on the total CPU and memory resources used across all node pools, not just auto-provisioned pools.
    • The cluster autoscaler does not create new nodes if doing so would exceed one of the defined limits. If limits are already exceeded, GKE doesn't delete the nodes.

Workload separation

If pending Pods have node affinities and tolerations, node auto-provisioning can provision nodes with matching labels and taints.

Node auto-provisioning might create node pools with labels and taints if all of the following conditions are met:

  • A pending Pod requires a node with a specific label key and value.
  • The Pod has a toleration for a taint with the same key.
  • The toleration is for the NoSchedule effect, NoExecute effect, or all effects.

For instructions, refer to Configure workload separation in GKE.

Limitations of using labels for workload separation

Node auto-provisioning triggers new node pool creation when you use labels supported by node auto-provisioning, like cloud.google.com/gke-spot or machine families. You can use other labels in your Pod manifests to narrow down the nodes on which GKE places Pods, but GKE won't use these labels to provision new node pools. For the list of labels that don't explicitly trigger node pool creation, see Limitations of workload separation with taints and tolerations.

Deletion of auto-provisioned node pools

When there are no nodes in an auto-provisioned node pool, GKE deletes the node pool. GKE does not delete node pools that are not auto-provisioned.

Supported machine types

Node auto-provisioning considers the Pod requirements in your cluster to determine what type of nodes would best fit those Pods.

By default, GKE uses the E2 machine series unless any of the following conditions apply:

  • The workload requests a feature that is not available in the E2 machine series. For example, if a GPU is requested by the workload, the N1 machine series is used for the new node pool.
  • The workload requests TPU resources. To learn more about TPUs, see the Introduction to Cloud TPU.
  • The workload uses the machine-family label. For more information, see Using a custom machine family.

If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.

Supported node images

Node auto-provisioning creates node pools using one of the following node images:

  • Container-Optimized OS (cos_containerd).
  • Ubuntu (ubuntu_containerd).

Supported machine learning accelerators

Node auto-provisioning can create node pools with hardware accelerators such as GPU and Cloud TPU. Node auto-provisioning supports TPUs in GKE version 1.28 and later.

GPUs

If the Pod requests GPUs, node auto-provisioning assigns a machine type sufficiently large to support the number of GPUs that the Pod requests. The number of GPUs restricts the CPU and memory that the node can have. For more information, see GPU platforms.

Cloud TPUs

GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.

With the --enable-autoprovisioning flag on a GKE cluster, GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU version and topology that meets the requirements of pending workloads.

When you use --enable-autoscaling, GKE scales the node pool based on its type, as follows:

  • Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.

  • Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type ct5lp-hightpu-4t and a topology of 16x16, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.

If a specific TPU slice has no Pods that are running or are pending to be scheduled, GKE scales down the node pool. Multi-host TPU slice node pools are scaled down atomically. Single-host TPU slice node pools are scaled down by removing individual single-host TPU slices.

When you enable node auto-provisioning with TPUs, GKE makes scaling decisions based on the values defined in the Pod request. The following manifest is an example of a Deployment specification that results in one node pool that contains TPU v4 slice with a 2x2x2 topology and two ct4p-hightpu-4t machines:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: tpu-workload
      labels:
        app: tpu-workload
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nginx-tpu
      template:
        metadata:
          labels:
            app: nginx-tpu
        spec:
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x2
            cloud.google.com/reservation-name: my-reservation
          containers:
          - name: nginx
            image: nginx:1.14.2
            resources:
              requests:
                google.com/tpu: 4
              limits:
               google.com/tpu: 4
            ports:
            - containerPort: 80

Where:

  • cloud.google.com/gke-tpu-accelerator: The TPU version and type. For example, you can use any of the following:
    • TPU v4 with tpu-v4-podslice
    • TPU v5e with tpu-v5-lite-podslice.
    • TPU v6e with tpu-v6e-slice.
  • cloud.google.com/gke-tpu-topology: The number and physical arrangement of TPU chips within a TPU slice. When creating a node pool and enabling node auto-provisioning, you select the TPU topology. For more information about Cloud TPU topologies, see TPU configurations.
  • limit.google.com/tpu: The number of TPU chips on the TPU VM. Most configurations have just one correct value. However, the tpu-v5-lite-podslice with 2x4 topology configuration:
    • If you specify google.com/tpu = 8, node auto-provisioning scales up single-host TPU slice node pool adding one ct5lp-hightpu-8t machine.
    • If you specify google.com/tpu = 4, node auto-provisioning creates a multi-host TPU slice node pool with two ct5lp-hightpu-4t machines.
  • cloud.google.com/reservation-name: The name of the reservation that the workload uses. If omitted, the workload doesn't use any reservation.

If you set v6e, node auto-provisioning makes the following decisions:

Values set in the Pod manifest Decided by node auto-provisioning
gke-tpu-topology limit.google.com/tpu Type of node pool Node pool size Machine type
1x1 1 Single-host TPU slice Flexible ct6e-standard-1t
2x2 4 Single-host TPU slice Flexible ct6e-standard-4t
2x4 8 Single-host TPU slice Flexible ct6e-standard-8t
2x4 4 Multi-host TPU slice 2 ct6e-standard-4t
4x4 4 Multi-host TPU slice 4 ct6e-standard-4t
4x8 4 Multi-host TPU slice 8 ct6e-standard-4t
8x8 4 Multi-host TPU slice 16 ct6e-standard-4t
8x16 4 Multi-host TPU slice 32 ct6e-standard-4t
16x16 4 Multi-host TPU slice 64 ct6e-standard-4t

If you set tpu-v4-podslice, node auto-provisioning makes the following decisions:

Values set in the Pod manifest Decided by node auto-provisioning
gke-tpu-topology limit.google.com/tpu Type of node pool Node pool size Machine type
2x2x1 4 Single-host TPU slice Flexible ct4p-hightpu-4t
{A}x{B}x{C} 4 Multi-host TPU slice {A}x{B}x{C}/4 ct4p-hightpu-4t

The product of {A}x{B}x{C} defines the number of chips in the node pool. For example, you can define a small topology of 64 chips with combinations such as 4x4x4. If you use topologies larger than 64 chips, the values you assign to {A},{B}, and {C} must meet the following conditions:

  • {A},{B}, and {C} are either all lower than or equal to four, or multiples of four.
  • The largest topology supported is 12x16x16.
  • The assigned values keep the A ≤ B ≤ C pattern. For example, 2x2x4 or 2x4x4 for small topologies.

If you set tpu-v5-lite-podslice, node auto-provisioning makes the following decisions:

Values set in the Pod manifest Decided by node auto-provisioning
gke-tpu-topology limit.google.com/tpu Type of node pool Node pool size Machine type
1x1 1 Single-host TPU slice Flexible ct5lp-hightpu-1t
2x2 4 Single-host TPU slice Flexible ct5lp-hightpu-4t
2x4 8 Single-host TPU slice Flexible ct5lp-hightpu-8t
2x41 4 Multi-host TPU slice 2 (8/4) ct5lp-hightpu-4t
4x4 4 Multi-host TPU slice 4 (16/4) ct5lp-hightpu-4t
4x8 4 Multi-host TPU slice 8 (32/4) ct5lp-hightpu-4t
4x8 4 Multi-host TPU slice 16 (32/4) ct5lp-hightpu-4t
8x8 4 Multi-host TPU slice 16 (64/4) ct5lp-hightpu-4t
8x16 4 Multi-host TPU slice 32 (128/4) ct5lp-hightpu-4t
16x16 4 Multi-host TPU slice 64 (256/4) ct5lp-hightpu-4t
  1. Special case where the machine type depends on the value you defined in the google.com/tpu limits field.

If you set the accelerator type to tpu-v5-lite-device, node auto-provisioning makes the following decisions:

Values set in the Pod manifest Decided by node auto-provisioning
gke-tpu-topology limit.google.com/tpu Type of node pool Node pool size Machine type
1x1 1 Single-host TPU slice Flexible ct5l-hightpu-1t
2x2 4 Single-host TPU slice Flexible ct5l-hightpu-4t
2x4 8 Single-host TPU slice Flexible ct5l-hightpu-8t

To learn how to set up node auto-provisioning, see Configuring TPUs.

Support for Spot VMs

Node auto-provisioning supports creating node pools based on Spot VMs.

Creating node pools based on Spot VMs is only considered if unschedulable pods with a toleration for the cloud.google.com/gke-spot="true":NoSchedule taint exist. The taint is automatically applied to nodes in auto-provisioned node pools that are based on Spot VMs.

You can combine using the toleration with a nodeSelector or node affinity rule for the cloud.google.com/gke-spot="true" or cloud.google.com/gke-provisioning=spot (for nodes running GKE version 1.25.5-gke.2500 or later) node labels to ensure that your workloads only run on node pools based on Spot VMs.

Support for Pods requesting ephemeral storage

Node auto-provisioning supports creating node pools when Pods request ephemeral storage. The size of the boot disk provisioned in the node pools is constant for all new auto-provisioned node pools. This size of the boot disk can be customized.

The default is 100 GiB. Ephemeral storage backed by local SSDs is not supported.

Node auto-provisioning will provision a node pool only if the allocatable ephemeral storage of a node with a specified boot disk is greater than or equal to the ephemeral storage request of a pending Pod. If the ephemeral storage request is higher than what is allocatable, node auto-provisioning will not provision a node pool. Disk sizes for nodes are not dynamically configured based on ephemeral storage requests of pending Pods.

Scalability limitations

Node auto-provisioning has the same limitations as the cluster autoscaler, as well as the following additional limitations:

Limit on number of separated workloads
Node auto-provisioning supports a maximum of 100 distinct separated workloads.
Limit on number of node pools
Node auto-provisioning de-prioritizes creating new node pools when the number of pools in the cluster approaches 100. Creating over 100 node pools is possible but only when creating a node pool is the only option to schedule a pending Pod.

What's next