Restore from a Pod snapshot

Standard

Google Kubernetes Engine (GKE) Pod snapshots help improve workload startup latency by restoring snapshots of running Pods. A Pod snapshot saves the entire Pod state, including memory and changes to the root file system. When new replicas are created, instead of initializing the Pod from a fresh state, the snapshot is restored. The Pod then resumes execution from the point the snapshot was taken.

This document explains how to enable and configure GKE Pod snapshots for your workloads.

For more information about how Pod snapshots work, see About Pod snapshots.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Enable Pod snapshots

To enable Pod snapshots, first create or update a cluster with the Pod snapshot feature enabled. Then, create or update a node pool to run in GKE Sandbox.

To enable the feature on a cluster, complete one of the following steps:
- To enable Pod snapshots on a new cluster, run the following command:
```
gcloud beta container clusters create CLUSTER_NAME \
    --enable-pod-snapshots \
    --cluster-version=CLUSTER_VERSION \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --workload-metadata=GKE_METADATA
```
  Replace the following:
  - CLUSTER_NAME: the name of your cluster.
  - CLUSTER_VERSION: the version of your new cluster, which must be 1.34.1-gke.3084001 or later.
  - PROJECT_ID: your project ID.
- To enable Pod snapshots on an existing cluster, complete the following steps:
  1. Update the cluster to version 1.34.1-gke.3084001 or later:
```
gcloud container clusters upgrade CLUSTER_NAME \
    --node-pool=NODEPOOL_NAME \
    --cluster-version=CLUSTER_VERSION
```
    Replace the following:
    - CLUSTER_NAME: the name of your cluster.
    - NODEPOOL_VERSION: the name of your nodepool.
    - CLUSTER_VERSION: the version to update your new cluster, which must be 1.34.1-gke.3084001 or later.
  2. Enable Pod snapshots on your cluster:
```
gcloud container clusters update CLUSTER_NAME \
   --workload-pool=PROJECT_ID .svc.id.goog" \
   --enable-pod-snapshots
```
    Replace PROJECT_ID with your project ID.
Enable GKE Sandbox on your Standard cluster:
```
gcloud container node-pools create NODE_POOL_NAME \
  --cluster=CLUSTER_NAME \
  --node-version=NODE_VERSION \
  --machine-type=MACHINE_TYPE \
  --image-type=cos_containerd \
  --sandbox type=gvisor
```
Replace the following variables:
- NODE_POOL_NAME: the name of your new node pool.
- NODE_VERSION: the version to use for the node pool.
- MACHINE_TYPE: the type of machine to use for the nodes.
For more information about using gVisor, see Isolate your workloads using GKE Sandbox.

Store snapshots

Pod snapshots are stored in a Cloud Storage bucket, which contains the memory and (optionally) GPU state. Pod snapshots require Workload Identity Federation for GKE to enable and use the Pod's service account to authenticate to Cloud Storage.

Pod snapshots require the following configuration for the bucket:

Hierarchical namespaces: must be enabled to allow for higher read and write queries per second. Hierarchical namespaces also requires that uniform bucket-level access is enabled.
Soft delete: because Pod snapshots use parallel composite uploads, you should disable data protection features like soft delete. If left enabled, the deletions of the temporary objects can increase your storage bill significantly.
Location: Cloud Storage bucket location must be the same location as the GKE cluster because performance might be impacted if snapshots are transferred across different regions.

Create Cloud Storage bucket

To create the bucket and the permissions required, complete the following steps:

Create a Cloud Storage bucket. The following command creates a bucket with the required configuration:
```
gcloud storage buckets create "gs://BUCKET_NAME" \
   --uniform-bucket-level-access \
   --enable-hierarchical-namespace \
   --soft-delete-duration=0d \
   --location="LOCATION"
```
Replace the following:
- BUCKET_NAME: the name of your bucket.
- LOCATION: the location of your bucket.
For a complete list of options for bucket creation, see buckets create options.

Grant workloads permission to access the Cloud Storage bucket

By default, GKE doesn't have permissions to access Cloud Storage. To read and write snapshot files, you must grant IAM permissions to the Kubernetes service account (KSA) used by your workload Pods.

Get credentials so that you can communicate with your cluster with kubectl commands:
```
gcloud container clusters get-credentials "CLUSTER_NAME"
```

For each Pod, complete the following steps:

Create a KSA for each Pod:
```
kubectl create serviceaccount "KSA_NAME" \
    --namespace "NAMESPACE"
```
Replace the following:
- KSA_NAME: the name of your KSA.
- NAMESPACE: the namespace for your Pods.

Grant the KSA permission to access the bucket:

gcloud storage buckets add-iam-policy-binding "gs://BUCKET_NAME" \
    --member="principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/NAMESPACE/sa/KSA_NAME" \
    --role="roles/storage.bucketViewer"

gcloud storage buckets add-iam-policy-binding "gs://BUCKET_NAME" \
    --member="principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/NAMESPACE/sa/KSA_NAME" \
    --role="roles/storage.objectUser"

Replace the following:

PROJECT_NUMBER: your project number.
PROJECT_ID: your project ID.

(Optional) Create managed folders for the Cloud Storage bucket

Creating folders lets you isolate permissions for snapshots from mutually-untrusted Pods, which is useful in multi-tenant use cases. To set up managed folders, complete the following steps:

Create a custom IAM role that contains only the necessary permissions for Pod snapshots:

gcloud iam roles create podSnapshotGcsReadWriter \
    --project="PROJECT_ID" \
    --permissions="storage.objects.get,storage.objects.create,storage.objects.delete,storage.folders.create"

Grant the roles/storage.bucketViewer role to all KSAs in the target namespace. This role lets KSAs read bucket metadata, but does not grant read or write permissions to objects in the bucket.

gcloud storage buckets add-iam-policy-binding "gs://BUCKET_NAME" \
    --member="principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/namespace/NAMESPACE" \
    --role="roles/storage.bucketViewer"

Replace the following:

PROJECT_NUMBER: your project number.
PROJECT_ID: your project ID.

For each KSA that needs to store Pod snapshots, complete the following steps:

Create a managed folder for the KSA:
```
gcloud storage managed-folders create "gs://BUCKET_NAME/FOLDER_PATH/"
```
Replace FOLDER_PATH with the path for the managed folder, for example my-app-snapshots.

Grant the KSA the custom podSnapshotGcsReadWriter role on the managed folder:

gcloud storage managed-folders add-iam-policy-binding "gs://BUCKET_NAME/FOLDER_PATH/" \
    --member="principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/NAMESPACE/sa/KSA_NAME" \
    --role="projects/PROJECT_ID/roles/podSnapshotGcsReadWriter"

Replace KSA_NAME with the name of the KSA.

Configure storage for snapshots

To specify where to store snapshot files, create a PodSnapshotStorageConfig resource.

The following example configures GKE to store Pod snapshots in the FOLDER_PATH/ path inside the Cloud Storage bucket BUCKET_NAME. Save the following manifest as example-pod-snapshot-storage-config:
```
apiVersion: podsnapshot.gke.io/v1alpha1
kind: PodSnapshotStorageConfig
metadata:
  name: example-pod-snapshot-storage-config
  namespace: NAMESPACE
spec:
  snapshotStorageConfig:
    gcs:
      bucket: "BUCKET_NAME"
      path: "FOLDER_PATH"
```
Replace the following:
- NAMESPACE: the namespace for your Pods. By default, this is default.
- BUCKET_NAME: the name of your Cloud Storage bucket.
- FOLDER_PATH: the path for the Cloud Storage managed folder.

Apply the manifest:

kubectl apply -f example-pod-snapshot-storage-config.yaml

Create a snapshot policy

To enable snapshots for a Pod, create a PodSnapshotPolicy resource with a selector that matches the Pod's labels.

The following example creates a policy that applies to Pods with the app: my-app label and uses the example-pod-snapshot-storage-config storage configuration. Save the following manifest as example-pod-snapshot-policy.yaml:

apiVersion: podsnapshot.gke.io/v1alpha1
kind: PodSnapshotPolicy
metadata:
  name: example-pod-snapshot-policy
  namespace: NAMESPACE
spec:
  storageConfigName: example-pod-snapshot-storage-config
  selector:
    matchLabels:
      app: my-app
  triggerConfig:
    type: workload
    postCheckpoint: resume

Apply the manifest:

kubectl apply -f example-pod-snapshot-policy.yaml --namespace NAMESPACE

Optimize snapshot size

When a Pod snapshot is triggered, gVisor captures the entire state of all containers, including:

Application state, such as memory and registers
Changes to the root file system and tmpfs (including emptyDir volumes)
Kernel state, such as open file descriptors, threads, and sockets

The size of the snapshot is determined by these factors. Larger snapshots take longer to save and restore. To optimize performance, before triggering a snapshot, you should clean up any application state or files that aren't required after the Pod is restored from the snapshot.

Optimizing snapshot size is particularly important for workloads like large language models (LLMs). LLM servers often download model weights into local storage (rootfs or tmpfs) before loading them into the GPU. When a snapshot is taken, both the GPU state and the model weight files are saved. In this scenario, if the model is 100 GB, the resulting snapshot is roughly 200 GB (100 GB of model files, plus 100 GB representing the GPU state). After the model weights are loaded into the GPU, the files on the file system are often not needed for the application to run. By deleting these model files before you trigger the snapshot, you can reduce the snapshot size by half and restore the application with significantly lower latency.

Trigger a snapshot from a workload

To trigger a snapshot from within your application code, configure your application to send a signal when it's ready for a snapshot. To signal readiness, write 1 to the /proc/gvisor/checkpoint file, for example echo 1 > /proc/gvisor/checkpoint. The write operation starts the snapshot process asynchronously and returns immediately. Reading from the same file descriptor will block the reading process until both the snapshot and restore is complete and the workload is ready to resume.

The exact usage will vary depending on your application, but the following example shows a snapshot trigger for a Python application. To trigger a Snapshot from this example workload, complete the following steps:

Save the following manifest as my-app.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  namespace: NAMESPACE
  labels:
    app: my-app
spec:
  serviceAccountName: KSA_NAME
  runtimeClassName: gvisor
  containers:
  - name: my-container
    image: python:3.10-slim
    command: ["python3", "-c"]
    args:
      - |
        import time
        def trigger_snapshot():
          try:
            with open("/proc/gvisor/checkpoint", "r+") as f:
              f.write("1")
              res = f.read().rstrip()
              print(f"GKE Pod Snapshot: {res}")
          except FileNotFoundError:
            print("GKE Pod Snapshot file does not exist -- Pod Snapshots is disabled")
            return
          except OSError as e:
            return e
        i = 0
        while True:
          print(f"Count: {i}", flush=True)
          if (i == 20): #simulate the application being ready to snapshot at 20th count
            trigger_snapshot()
          i += 1
          time.sleep(1)
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "250m"
        memory: "256Mi"

Deploy the application:
```
kubectl apply -f my-app.yaml
```

To verify that a snapshot was taken, check the event history for the GKEPodSnapshotting event:

kubectl get events -o \
custom-columns=NAME:involvedObject.name,CREATIONTIME:.metadata.creationTimestamp,REASON:.reason,MESSAGE:.message \
--namespace NAMESPACE \
--field-selector involvedObject.name=my-app,reason=GKEPodSnapshotting

The output resembles the following:

NAME                                    CREATIONTIME           REASON               MESSAGE
default/5b449f9c7c-bd7pc                 2025-11-05T16:25:11Z   GKEPodSnapshotting   Successfully checkpointed the pod to PodSnapshot

Manage snapshots

When you create a Pod snapshot, a PodSnapshot CRD resource is created to store the Pod's state at that time.

To view all PodSnapshot resources in a namespace, run the following command:

kubectl get podsnapshots.gke.io --namespace NAMESPACE

The output resembles the following:

NAME                                   STATUS                  POLICY           AGE
de334898-1e7a-4cdb-9f2e-7cc2181c29e4   AllSnapshotsAvailable   example-policy   47h

Restore a workload from a snapshot

To restore your workload from the latest snapshot, you can delete the existing Pod after a snapshot is taken, and then re-deploy the Pod. Alternatively, you can deploy a new Pod with an identical specification. GKE automatically restores the Pod from the matching snapshot.

The following steps show how a Pod is restored from a matching snapshot by deleting and re-deploying the Pod:

Delete the Pod:
```
kubectl delete -f POD_NAME.yaml
```
Replace POD_NAME with the name of your Pod, for example my-app.
Re-apply the Pod:
```
kubectl apply -f POD_NAME.yaml
```
View the logs to confirm snapshot restore:
```
kubectl logs my-app --namespace NAMESPACE
```
The output depends on how you've configured your application. In the example application, the logs show GKE Pod Snapshot: restore when a restore operation occurs.

Disable snapshots

Removing the PodSnapshotPolicy CRD prevents Pods from being snapshotted and restored. Running Pods are unaffected by the resource deletion. However, if you delete the policy while a Pod is being saved or restored, the Pod might enter a failed state.

To disable snapshotting and restoration for new Pods governed by a policy, delete the PodSnapshotPolicy by running the following command:

kubectl delete podsnapshotpolicies.podsnapshot.gke.io SNAPSHOT_POLICY --namespace=NAMESPACE

Replace SNAPSHOT_POLICY with the name of the PodSnapshotPolicy that you want to delete, for example example-pod-snapshot-policy.

You can also delete a specific PodSnapshot resource so that Pods are no longer restored from that specific snapshot. Deleting the PodSnapshot resource also removes the files stored in Cloud Storage.

To prevent a specific snapshot from being used for future restorations, delete the PodSnapshot object by running the following command:

kubectl delete podsnapshots.podsnapshot.gke.io POD_SNAPSHOT_NAME --namespace=NAMESPACE

Replace POD_SNAPSHOT_NAME with the name of the snapshot that you want to delete, for example example-podsnapshot.

What's next

Learn more about Pod snapshot concepts.
Refer to the Pod snapshot custom resource definitions (CRDs):