Encrypt GPU workload data in use with Confidential GKE Nodes

Autopilot Standard

You can encrypt GPU workload data in-use by running the workloads on encrypted Confidential Google Kubernetes Engine Nodes. This page shows Security engineers and Operators how to improve security for the data in accelerated workloads, such as AI/ML tasks. You should be familiar with the following concepts:

About running GPU workloads on Confidential GKE Nodes

You can request Confidential GKE Nodes for your GPU workloads by using one of the following methods:

Automatically provision Confidential GKE Nodes for your GPU workloads by using GKE ComputeClasses. You can use this method in Autopilot clusters and in Standard clusters. For more information, see the Use ComputeClasses to run GPU workloads on Confidential GKE Nodes section.
Manually configure Confidential GKE Nodes for your Standard clusters or node pools. For more information, see the Manually configure Confidential GKE Nodes in GKE Standard section.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Requirements and limitations

Regardless of the Confidential GKE Nodes configuration method that you choose, you must meet all of the following requirements:

The nodes must be in a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
The nodes must use only one NVIDIA H100 80 GB GPU and the a3-highgpu-1g machine type.
The nodes must use the Intel TDX Confidential Computing technology.
You must have quota for preemptible H100 80 GB GPUs (compute.googleapis.com/preemptible_nvidia_h100_gpus) in your node locations. For more information about managing your quota, see View and manage quotas.

In addition to these requirements, you must meet specific conditions depending on the Confidential GKE Nodes configuration method that you choose, as described in the following table:

Configuration method	Requirements	Limitations
ComputeClasses	Use Spot VMs or flex-start (Preview). Use GKE version 1.33.3-gke.1392000 or later.	You can't use flex-start with queued provisioning with ComputeClasses. You can't use GPU sharing features like time-sharing or multi-instance GPUs.
Manual configuration in Standard mode	Use Spot VMs, preemptible VMs, flex-start (Preview), or flex-start with queued provisioning. Use one of the following GKE versions: Manual GPU driver installation: 1.32.2-gke.1297000 or later. Automatic GPU driver installation: 1.33.3-gke.1392000 or later. Caution: For automatic driver installation, versions earlier than 1.33.3-gke.1392000 might cause workload disruptions because of driver incompatibilities. Flex-start with queued provisioning: 1.32.2-gke.1652000 or later.	You can't use flex-start (Preview) if you enable Confidential GKE Nodes for the entire cluster. You can't use GPU sharing features like time-sharing or multi-instance GPUs.

Required roles

To get the permissions that you need to create Confidential GKE Nodes, ask your administrator to grant you the following IAM roles on the Google Cloud project:

Create Confidential GKE Nodes: Kubernetes Engine Cluster Admin (roles/container.clusterAdmin)
Deploy GPU workloads: Kubernetes Engine Developer (roles/container.developer)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Use ComputeClasses to run confidential GPU workloads

You can define your Confidential GKE Nodes configuration in a ComputeClass. ComputeClasses are Kubernetes custom resources that let you declaratively set node configurations for GKE autoscaling and scheduling. You can follow the steps in this section in any Autopilot or Standard cluster that runs GKE version 1.33.3-gke.1392000 or later.

To use a ComputeClass to run GPU workloads on Confidential GKE Nodes, follow these steps:

Save the following ComputeClass manifest as a YAML file:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: COMPUTECLASS_NAME
spec:
  nodePoolConfig:
    confidentialNodeType: TDX
  priorityDefaults:
    location:
      zones: ['ZONE1','ZONE2']
  priorities:
  - gpu:
      type: nvidia-h100-80gb
      count: 1
      driverVersion: default
    spot: true
  activeMigration:
    optimizeRulePriority: true
  nodePoolAutoCreation:
    enabled: true
  whenUnsatisfiable: DoNotScaleUp

Replace the following:

COMPUTECLASS_NAME: a name for the ComputeClass.
ZONE1,ZONE2: a comma-separated list of zones to create nodes in, such as ['us-central1-a','us-central1-b']. Specify zones that support the Intel TDX Confidential Computing technology. For more information, see View supported zones.

Create the ComputeClass:
```
kubectl apply -f PATH_TO_MANIFEST
```
Replace PATH_TO_MANIFEST with the path to the ComputeClass manifest file.

To run your GPU workload on Confidential GKE Nodes, select the ComputeClass in the workload manifest. For example, save the following Deployment manifest, which selects a ComputeClass and GPUs, as a YAML file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: confidential-gpu-deployment
  labels:
    app: conf-gpu
spec:
  selector:
    matchLabels:
      app: conf-gpu
  replicas: 1
  template:
    metadata:
      labels:
        app: conf-gpu
    spec:
      nodeSelector:
        cloud.google.com/compute-class: COMPUTECLASS_NAME
      containers:
      - name: example-app
        image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
        resources:
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: 1
          requests:
            cpu: "4"
            memory: "16Gi"

Replace COMPUTECLASS_NAME with the name of the ComputeClass that you created.

Create the Deployment:
```
kubectl apply -f PATH_TO_DEPLOYMENT_MANIFEST
```
Replace PATH_TO_DEPLOYMENT_MANIFEST with the path to the Deployment manifest.

When you create your GPU workload, GKE uses the configuration in the ComputeClass to create Confidential GKE Nodes with attached GPUs.

Manually configure Confidential GKE Nodes in GKE Standard

You can run GPU workloads on Confidential GKE Nodes in Standard mode clusters or node pools. For GPU workloads, your Confidential GKE Nodes must use the Intel TDX Confidential Computing technology.

Enable Confidential GKE Nodes in new Standard clusters

You can enable Confidential GKE Nodes for your entire Standard cluster, so that every GPU node pool that you create uses the same Confidential Computing technology. When you create a new Standard mode cluster that uses Confidential GKE Nodes for GPU workloads, ensure that you specify the following cluster settings:

Location: a region or a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
Confidential Computing type: Intel TDX
Cluster version: one of the following versions, depending on how you want to install your GPU drivers:
- Manual GPU driver installation: 1.32.2-gke.1297000 or later.
- Automatic GPU driver installation: 1.33.3-gke.1392000 or later.

You can optionally configure GPUs for the default node pool that GKE creates in your cluster. However, we recommend that you use a separate node pool for your GPUs, so that at least one node pool in the cluster can run any workload.

For more information, see Enable Confidential GKE Nodes on Standard clusters.

Use Confidential GKE Nodes with GPUs in Standard node pools

If your cluster doesn't have Confidential GKE Nodes enabled, you can enable Confidential GKE Nodes in specific new or existing GPU node pools. The control plane and node pools must meet the requirements in the Availability section. When you configure the node pool, you can choose to install GPU drivers automatically or manually.

To create a new GPU node pool that uses Confidential GKE Nodes, select one of the following options:
Console
1. In the Google Cloud console, go to the Kubernetes clusters page:
  
  Go to Kubernetes clusters
2. Click the name of the Standard mode cluster to modify.
3. Click Add node pool. The Add a node pool page opens.
4. On the Node pool details pane, do the following:
  1. Select Specify node locations.
  2. Select only the supported zones that are listed in the Availability section.
  3. Ensure that the control plane version is one of the versions that's listed in the Availability section.
5. In the navigation menu, click Nodes.
6. On the Configure node settings pane, do the following:
  1. In the Machine configuration section, click GPUs.
  2. In the GPU type menu, select NVIDIA H100 80GB.
  3. In the Number of GPUs menu, select 1.
  4. Ensure that Enable GPU sharing isn't selected.
  5. In the GPU Driver installation section, select one of the following options:
    
    Google-managed: GKE automatically installs a driver. If you select this option, in the Version drop-down list, select one of the following driver versions:
    
    Default: install the default driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
    
    Latest: install the latest driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
    
    Caution: if you select the Google-managed option, control plane version upgrades immediately trigger node re-creation in node pools that use Confidential GKE Nodes. GKE re-creates the nodes using the node upgrade strategy, regardless of active maintenance policies. Before you upgrade your control plane, ensure that your confidential GPU workloads are ready to handle potential disruptions.
    
    User-managed: skip automatic driver installation. If you select this option, you must manually install a compatible GPU driver. Requires 1.32.2-gke.1297000 or later.
  6. In the Machine type section, ensure that the machine type is a3-highgpu-1g.
  7. Select Enable nodes on spot VMs or configure Flex-start VMs with queued provisioning.
7. When you're ready to create the node pool, click Create.
gcloud
You can create GPU node pools that run Confidential GKE Nodes on Spot VMs or by using Flex-start VMs with queued provisioning.
- Create a GPU node pool that runs Confidential GKE Nodes on Spot VMs:
  gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --confidential-node-type=tdx --location=LOCATION \ --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \ --spot --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=DRIVER_VERSION \ --machine-type=a3-highgpu-1g
  Replace the following:
  - NODE_POOL_NAME: a name for your new node pool.
  - CLUSTER_NAME: the name of your existing cluster.
  - LOCATION: the location for your new node pool. The location must support using GPUs in Confidential GKE Nodes.
  - NODE_LOCATION1,NODE_LOCATION2,...: a comma-separated list of zones to run the nodes in. These zones must support using NVIDIA Confidential Computing. For more information, see View supported zones.
  - DRIVER_VERSION: the GPU driver version to install. Specify one of the following values:
  - default: install the default driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
  - latest: install the latest driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
  - disabled: skip automatic driver installation. If you specify this value, you must manually install a compatible GPU driver. Requires 1.32.2-gke.1297000 or later.
  Caution: if you specify default or latest, control plane version upgrades immediately trigger node re-creation in node pools that use Confidential GKE Nodes, regardless of active maintenance policies. Before you upgrade your control plane, ensure that your confidential GPU workloads are ready to handle potential disruptions.
- Create a GPU node pool that runs Confidential GKE Nodes by using Flex-start VMs with queued provisioning:
  gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \ --machine-type=a3-highgpu-1g --confidential-node-type=tdx \ --location=LOCATION \ --flex-start --enable-queued-provisioning \ --enable-autoscaling --num-nodes=0 --total-max-nodes=TOTAL_MAX_NODES \ --location-policy=ANY --reservation-affinity=none --no-enable-autorepair \ --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=DRIVER_VERSION
  Replace TOTAL_MAX_NODES with the maximum number of nodes that the node pool can automatically scale to.
  
  For more information about the configuration options in flex-start with queued provisioning, see Run a large-scale workload with flex-start with queued provisioning.
To update your existing node pools to use the Intel TDX Confidential Computing technology, see Update an existing node pool.

Manually install GPU drivers that support Confidential GKE Nodes

If you didn't enable automatic driver installation when you created or updated your node pools, you must manually install a GPU driver that supports Confidential GKE Nodes.

This change requires recreating the nodes, which can cause disruption to your running workloads. For details about this specific change, find the corresponding row in the manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies table. To learn more about node updates, see Planning for node update disruptions.

For instructions, see the "COS" tab in Manually install NVIDIA GPU drivers.

Troubleshoot

For troubleshooting information, see Troubleshoot GPUs in GKE.