You can encrypt GPU workload data in-use by running the workloads on encrypted Confidential Google Kubernetes Engine Nodes. This page shows Security engineers and Operators how to improve security for the data in accelerated workloads, such as AI/ML tasks. You should be familiar with the following concepts:
About running GPU workloads on Confidential GKE Nodes
You can request Confidential GKE Nodes for your GPU workloads by using one of the following methods:
- Automatically provision Confidential GKE Nodes for your GPU workloads by using GKE ComputeClasses. You can use this method in Autopilot clusters and in Standard clusters. For more information, see the Use ComputeClasses to run GPU workloads on Confidential GKE Nodes section.
- Manually configure Confidential GKE Nodes for your Standard clusters or node pools. For more information, see the Manually configure Confidential GKE Nodes in GKE Standard section.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Requirements and limitations
Regardless of the Confidential GKE Nodes configuration method that you choose, you must meet all of the following requirements:
- The nodes must be in a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
- The nodes must use only one NVIDIA H100 80 GB GPU and the
a3-highgpu-1g
machine type. - The nodes must use the Intel TDX Confidential Computing technology.
- You must have quota for preemptible H100 80 GB GPUs
(
compute.googleapis.com/preemptible_nvidia_h100_gpus
) in your node locations. For more information about managing your quota, see View and manage quotas.
In addition to these requirements, you must meet specific conditions depending on the Confidential GKE Nodes configuration method that you choose, as described in the following table:
Configuration method | Requirements | Limitations |
---|---|---|
ComputeClasses |
|
|
Manual configuration in Standard mode |
|
|
Required roles
To get the permissions that you need to create Confidential GKE Nodes, ask your administrator to grant you the following IAM roles on the Google Cloud project:
-
Create Confidential GKE Nodes:
Kubernetes Engine Cluster Admin (
roles/container.clusterAdmin
) -
Deploy GPU workloads:
Kubernetes Engine Developer (
roles/container.developer
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Use ComputeClasses to run confidential GPU workloads
You can define your Confidential GKE Nodes configuration in a ComputeClass. ComputeClasses are Kubernetes custom resources that let you declaratively set node configurations for GKE autoscaling and scheduling. You can follow the steps in this section in any Autopilot or Standard cluster that runs GKE version 1.33.3-gke.1392000 or later.
To use a ComputeClass to run GPU workloads on Confidential GKE Nodes, follow these steps:
Save the following ComputeClass manifest as a YAML file:
apiVersion: cloud.google.com/v1 kind: ComputeClass metadata: name: COMPUTECLASS_NAME spec: nodePoolConfig: confidentialNodeType: TDX priorityDefaults: location: zones: ['ZONE1','ZONE2'] priorities: - gpu: type: nvidia-h100-80gb count: 1 driverVersion: default spot: true activeMigration: optimizeRulePriority: true nodePoolAutoCreation: enabled: true whenUnsatisfiable: DoNotScaleUp
Replace the following:
COMPUTECLASS_NAME
: a name for the ComputeClass.ZONE1,ZONE2
: a comma-separated list of zones to create nodes in, such as['us-central1-a','us-central1-b']
. Specify zones that support the Intel TDX Confidential Computing technology. For more information, see View supported zones.
Create the ComputeClass:
kubectl apply -f PATH_TO_MANIFEST
Replace
PATH_TO_MANIFEST
with the path to the ComputeClass manifest file.To run your GPU workload on Confidential GKE Nodes, select the ComputeClass in the workload manifest. For example, save the following Deployment manifest, which selects a ComputeClass and GPUs, as a YAML file:
apiVersion: apps/v1 kind: Deployment metadata: name: confidential-gpu-deployment labels: app: conf-gpu spec: selector: matchLabels: app: conf-gpu replicas: 1 template: metadata: labels: app: conf-gpu spec: nodeSelector: cloud.google.com/compute-class: COMPUTECLASS_NAME containers: - name: example-app image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0 resources: limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: 1 requests: cpu: "4" memory: "16Gi"
Replace
COMPUTECLASS_NAME
with the name of the ComputeClass that you created.Create the Deployment:
kubectl apply -f PATH_TO_DEPLOYMENT_MANIFEST
Replace
PATH_TO_DEPLOYMENT_MANIFEST
with the path to the Deployment manifest.
When you create your GPU workload, GKE uses the configuration in the ComputeClass to create Confidential GKE Nodes with attached GPUs.
Manually configure Confidential GKE Nodes in GKE Standard
You can run GPU workloads on Confidential GKE Nodes in Standard mode clusters or node pools. For GPU workloads, your Confidential GKE Nodes must use the Intel TDX Confidential Computing technology.
Enable Confidential GKE Nodes in new Standard clusters
You can enable Confidential GKE Nodes for your entire Standard cluster, so that every GPU node pool that you create uses the same Confidential Computing technology. When you create a new Standard mode cluster that uses Confidential GKE Nodes for GPU workloads, ensure that you specify the following cluster settings:
- Location: a region or a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
- Confidential Computing type: Intel TDX
Cluster version: one of the following versions, depending on how you want to install your GPU drivers:
- Manual GPU driver installation: 1.32.2-gke.1297000 or later.
- Automatic GPU driver installation: 1.33.3-gke.1392000 or later.
You can optionally configure GPUs for the default node pool that GKE creates in your cluster. However, we recommend that you use a separate node pool for your GPUs, so that at least one node pool in the cluster can run any workload.
For more information, see Enable Confidential GKE Nodes on Standard clusters.
Use Confidential GKE Nodes with GPUs in Standard node pools
If your cluster doesn't have Confidential GKE Nodes enabled, you can enable Confidential GKE Nodes in specific new or existing GPU node pools. The control plane and node pools must meet the requirements in the Availability section. When you configure the node pool, you can choose to install GPU drivers automatically or manually.
To create a new GPU node pool that uses Confidential GKE Nodes, select one of the following options:
Console
In the Google Cloud console, go to the Kubernetes clusters page:
Click the name of the Standard mode cluster to modify.
Click
Add node pool. The Add a node pool page opens.On the Node pool details pane, do the following:
- Select Specify node locations.
- Select only the supported zones that are listed in the Availability section.
- Ensure that the control plane version is one of the versions that's listed in the Availability section.
In the navigation menu, click Nodes.
On the Configure node settings pane, do the following:
- In the Machine configuration section, click GPUs.
- In the GPU type menu, select NVIDIA H100 80GB.
- In the Number of GPUs menu, select 1.
- Ensure that Enable GPU sharing isn't selected.
In the GPU Driver installation section, select one of the following options:
Google-managed: GKE automatically installs a driver. If you select this option, in the Version drop-down list, select one of the following driver versions:
- Default: install the default driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
- Latest: install the latest driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.
User-managed: skip automatic driver installation. If you select this option, you must manually install a compatible GPU driver. Requires 1.32.2-gke.1297000 or later.
In the Machine type section, ensure that the machine type is
a3-highgpu-1g
.Select Enable nodes on spot VMs or configure flex-start with queued provisioning.
When you're ready to create the node pool, click Create.
gcloud
You can create GPU node pools that run Confidential GKE Nodes on Spot VMs or by using flex-start with queued provisioning.
Create a GPU node pool that runs Confidential GKE Nodes on Spot VMs:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --confidential-node-type=tdx --location=LOCATION \ --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \ --spot --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=DRIVER_VERSION \ --machine-type=a3-highgpu-1g
Replace the following:
NODE_POOL_NAME
: a name for your new node pool.CLUSTER_NAME
: the name of your existing cluster.LOCATION
: the location for your new node pool. The location must support using GPUs in Confidential GKE Nodes.NODE_LOCATION1,NODE_LOCATION2,...
: a comma-separated list of zones to run the nodes in. These zones must support using NVIDIA Confidential Computing. For more information, see View supported zones.DRIVER_VERSION
: the GPU driver version to install. Specify one of the following values:default
: install the default driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.latest
: install the latest driver version for the node GKE version. Requires GKE version 1.33.3-gke.1392000 or later.disabled
: skip automatic driver installation. If you specify this value, you must manually install a compatible GPU driver. Requires 1.32.2-gke.1297000 or later.
Create a GPU node pool that runs Confidential GKE Nodes by using flex-start with queued provisioning:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \ --machine-type=a3-highgpu-1g --confidential-node-type=tdx \ --location=LOCATION \ --flex-start --enable-queued-provisioning \ --enable-autoscaling --num-nodes=0 --total-max-nodes=TOTAL_MAX_NODES \ --location-policy=ANY --reservation-affinity=none --no-enable-autorepair \ --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=DRIVER_VERSION
Replace
TOTAL_MAX_NODES
with the maximum number of nodes that the node pool can automatically scale to.For more information about the configuration options in flex-start with queued provisioning, see Run a large-scale workload with flex-start with queued provisioning.
To update your existing node pools to use the Intel TDX Confidential Computing technology, see Update an existing node pool.
Manually install GPU drivers that support Confidential GKE Nodes
If you didn't enable automatic driver installation when you created or updated your node pools, you must manually install a GPU driver that supports Confidential GKE Nodes.
This change requires recreating the nodes, which can cause disruption to your running workloads. For details about this specific change, find the corresponding row in the manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies table. To learn more about node updates, see Planning for node update disruptions.
For instructions, see the "COS" tab in Manually install NVIDIA GPU drivers.
Troubleshoot
For troubleshooting information, see Troubleshoot GPUs in GKE.
What's next
- Verify that your GPU nodes use Confidential GKE Nodes
- Deploy a workload on your GPU nodes
- Learn about the methods to run large-scale workloads with GPUs