Use preemptible VMs to run fault-tolerant workloads


This page shows you how to use preemptible VMs in Google Kubernetes Engine (GKE).

Overview

Preemptible VMs are Compute Engine VM instances that are priced lower than standard VMs and provide no guarantee of availability. Preemptible VMs offer similar functionality to Spot VMs, but only last up to 24 hours after creation.

In some cases, a preemptible VM might last longer than 24 hours. This can occur when the new Compute Engine instance comes up too fast and Kubernetes doesn't recognize that a different Compute Engine VM was created. The underlying Compute Engine instance will have a maximum duration of 24 hours and follow the expected preemptible VM behavior.

Comparison to Spot VMs

Preemptible VMs share many similarities with Spot VMs, including the following:

  • Terminated when Compute Engine requires the resources to run standard VMs.
  • Useful for running stateless, batch, or fault-tolerant workloads.
  • Lower pricing than standard VMs.
  • On clusters running GKE version 1.20 and later, graceful node shutdown is enabled by default.
  • Works with the cluster autoscaler and node auto-provisioning.

In contrast to Spot VMs, which have no maximum expiration time, preemptible VMs only last for up to 24 hours after creation.

You can enable preemptible VMs on new clusters and node pools, use nodeSelector or node affinity to control scheduling, and use taints and tolerations to avoid issues with system workloads when nodes are preempted.

Termination and graceful shutdown of preemptible VMs

When Compute Engine needs to reclaim the resources used by preemptible VMs, a preemption notice is sent to GKE. Preemptible VMs terminate 30 seconds after receiving a termination notice.

By default, clusters use graceful node shutdown. The kubelet notices the termination notice and gracefully terminates Pods that are running on the node. If the Pods are part of a managed workload, such as a Deployment, the controller creates and schedules new Pods to replace the terminated Pods.

On a best-effort basis, the kubelet grants a graceful termination period of 15 seconds for non-system Pods, after which system Pods (with the system-cluster-critical or system-node-critical priorityClasses) have 15 seconds to gracefully terminate. During graceful node termination, the kubelet updates the status of the Pods and assigns a Failed phase and a Terminated reason to the terminated Pods.

The VM shuts down 30 seconds after the termination notice is sent even if you specify a value greater than 15 seconds in the terminationGracePeriodSeconds field of your Pod manifest.

When the number of terminated Pods reaches a threshold of 1000 for clusters with fewer than 100 nodes or 5000 for clusters with 100 nodes or more, garbage collection cleans up the Pods.

You can also delete terminated Pods manually using the following commands:

  kubectl get pods --all-namespaces | grep -i NodeShutdown | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n
  kubectl get pods --all-namespaces | grep -i Terminated | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n

Modifications to Kubernetes behavior

Using preemptible VMs on GKE modifies the guarantees that are provided by Kubernetes PodDisruptionBudgets. The reclamation of preemptible VMs is involuntary and is not covered by the guarantees of PodDisruptionBudgets. You might experience greater unavailability than your configured PodDisruptionBudget.

Limitations

  • The kubelet graceful node shutdown feature is only enabled on clusters running GKE version 1.20 and later. For GKE versions prior to 1.20, you can use the Kubernetes on GCP Node Termination Event Handler to gracefully terminate your Pods when preemptible VMs are terminated.
  • Preemptible VMs do not support Windows Server node pools.
  • In GKE, you can't change the duration of the grace period for node shutdown. The shutdownGracePeriod and the shutdownGracePeriodCriticalPods kubelet configuration fields are immutable.

Create a cluster or node pool with preemptible VMs

You can use the Google Cloud CLI to create a cluster or node pool with preemptible VMs.

To create a cluster with preemptible VMs, run the following command:

gcloud container clusters create CLUSTER_NAME \
    --preemptible

Replace CLUSTER_NAME with the name of your new cluster.

To create a node pool with preemptible VMs, run the following command:

gcloud container node-pools create POOL_NAME \
    --cluster=CLUSTER_NAME \
    --preemptible

Replace POOL_NAME with the name of your new node pool.

Use nodeSelector to schedule Pods on preemptible VMs

GKE adds the cloud.google.com/gke-preemptible=true and cloud.google.com/gke-provisioning=preemptible (for nodes running GKE version 1.25.5-gke.2500 or later) labels to nodes that use preemptible VMs. You can use a nodeSelector in your deployments to tell GKE to schedule Pods onto preemptible VMs.

For example, the following Deployment filters for preemptible VMs using the cloud.google.com/gke-preemptible label:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-app
  template:
    metadata:
      labels:
        app: hello-app
    spec:
      containers:
      - name: hello-app
        image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
        resources:
          requests:
            cpu: 200m
      nodeSelector:
        cloud.google.com/gke-preemptible: "true"

Use node taints for preemptible VMs

You can taint nodes that use preemptible VMs so that GKE can only place Pods with the corresponding toleration on those nodes.

To add a node taint to a node pool that uses preemptible VMs, use the --node-taints flag when creating the node pool, similar to the following command:

gcloud container node-pools create POOL2_NAME \
    --cluster=CLUSTER_NAME \
    --node-taints=cloud.google.com/gke-preemptible="true":NoSchedule

Now, only Pods that tolerate the node taint are scheduled to the node.

To add the relevant toleration to your Pods, modify your deployments and add the following to your Pod specification:

tolerations:
- key: cloud.google.com/gke-preemptible
  operator: Equal
  value: "true"
  effect: NoSchedule

Node taints for GPU preemptible VMs

Preemptible VMs support using GPUs. You should create at least one other node pool in your cluster that doesn't use preemptible VMs before adding a GPU node pool that uses preemptible VMs. Having a standard node pool ensures that GKE can safely place system components like DNS.

If you create a new cluster with GPU node pools that use preemptible VMs, or if you add a new GPU node pool that uses preemptible VMs to a cluster that does not already have a standard node pool, GKE does not automatically add the nvidia.com/gpu=present:NoSchedule taint to the nodes. GKE might schedule system Pods onto the preemptible VMs, which can lead to disruptions. This behavior also increases your resource consumption, because GPU nodes are more expensive than non-GPU nodes.

What's next