This page explains how to set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA). On this page, you'll create clusters that can deploy GPU or TPU workloads, and manually install the drivers that you need to enable DRA.
This page is intended for platform administrators who want to reduce the complexity and overhead of setting up infrastructure with specialized hardware devices.
About DRA
DRA is a built-in Kubernetes feature that lets you flexibly request, allocate, and share hardware in your cluster among Pods and containers. For more information, see About dynamic resource allocation.
Limitations
- Node auto-provisioning isn't supported.
- Autopilot clusters don't support DRA.
- You can't use the following GPU sharing features:
- Time-sharing GPUs
- Multi-instance GPUs
- Multi-process Service (MPS)
Requirements
To use DRA, your GKE version must be version 1.32.1-gke.1489001 or later.
You should also be familiar with the following requirements and limitations, depending on the type of hardware that you want to use:
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
If you're not using the Cloud Shell, install the Helm CLI:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh
Create a GKE Standard cluster
Create a Standard mode cluster that enables the Kubernetes beta APIs for DRA:
gcloud container clusters create CLUSTER_NAME \
--quiet \
--enable-kubernetes-unstable-apis="resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices" \
--release-channel=rapid \
--enable-autoupgrade \
--location LOCATION \
--num-nodes "1" \
--cluster-version GKE_VERSION \
--workload-pool="PROJECT_ID.svc.id.goog"
Replace the following:
CLUSTER_NAME
: a name for your cluster.LOCATION
: the location for your cluster, such asus-central1
.GKE_VERSION
: the GKE version to use for the cluster and nodes. Must be 1.32.1-gke.1489001 or later.PROJECT_ID
: your project ID.
Prepare your GKE environment to support DRA
On GKE, you can use DRA with both GPUs and TPUs. When you create your node pools, you must use the following settings that work with DRA during the Preview:
- For GPUs, disable automatic GPU driver installation.
- Add required access scopes for the nodes.
- Add node labels to run only DRA workloads on the nodes.
- Enable cluster autoscaling.
All other node pool configuration settings, such as machine type, accelerator type, count, node operating system, and node locations depend on your requirements.
Prepare your environment for GPUs
Create a node pool with the required hardware:
gcloud beta container node-pools create "gpu-pool" \ --quiet \ --project PROJECT_ID \ --cluster CLUSTER_NAME \ --zone ZONE \ --node-version KUBERNETES_VERSION \ --machine-type "n1-standard-8" \ --accelerator "type=nvidia-tesla-t4,count=2,gpu-driver-version=disabled" \ --image-type "UBUNTU_CONTAINERD" \ --disk-type "pd-standard" \ --disk-size "100" \ --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \ --num-nodes "1" \ --enable-autoscaling \ --min-nodes "1" \ --max-nodes "6" \ --location-policy "ANY" \ --max-surge-upgrade 1 \ --max-unavailable-upgrade 0 \ --node-locations ZONE \ --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true
Replace
ZONE
with a zone where the GPUs that you specify are available.Manually install the drivers on your Container-Optimized OS or Ubuntu nodes. For detailed instructions, refer to Manually install NVIDIA GPU drivers.
If using COS, run the following commands to deploy the installation DaemonSet and install the default GPU driver version:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
If using Ubuntu, the installation DaemonSet that you deploy depends on the GPU type and on the GKE node version as described in the Ubuntu section of the instructions.
Pull and update the Helm chart that contains the GPU operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
Create a namespace for the GPU operator:
kubectl create namespace gpu-operator
Create a ResourceQuota in the
gpu-operator
namespace. The ResourceQuota lets the GPU operator deploy Pods that have the same priority as Kubernetes system Pods.kubectl apply -n gpu-operator -f - << EOF apiVersion: v1 kind: ResourceQuota metadata: name: gpu-operator-quota spec: hard: pods: 100 scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: - system-node-critical - system-cluster-critical EOF
Install the GPU operator with the Helm script:
helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator \ --set driver.enabled=false \ --set operator.repository=ghcr.io/nvidia \ --set operator.version=6171a52d \ --set validator.repository=ghcr.io/nvidia/gpu-operator \ --set validator.version=6171a52d \ --set toolkit.repository=ghcr.io/nvidia \ --set toolkit.version=5d9b27f1-ubuntu20.04 \ --set gfd.repository=ghcr.io/nvidia \ --set gfd.version=f171c926-ubi9 \ --set cdi.enabled=true \ --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ --set toolkit.installDir=/home/kubernetes/bin/nvidia
Prepare your environment for TPUs
Create a node pool that uses TPUs. The following example creates a TPU Trillium node pool:
gcloud container node-pools create NODEPOOL_NAME \
--cluster CLUSTER_NAME --num-nodes 1 \
--region REGION \
--node-labels "gke-no-default-tpu-device-plugin=true,gke-no-default-tpu-dra-plugin=true" \
--machine-type=ct6e-standard-8t \
--enable-autoupgrade
Replace NODEPOOL_NAME
with the name for your nodepool.
Access and install DRA drivers
The following sections show you how to install DRA drivers for GPUs and TPUs. The DRA drivers let Kubernetes dynamically allocate the attached devices to workloads. You can install DRA drivers for GPUs and TPUs with the provided Helm chart. To get access to the Helm charts, complete the following steps:
Clone the
ai-on-gke
repository to access the Helm charts that contain the DRA drivers for GPUs and TPUs:git clone https://github.com/GoogleCloudPlatform/ai-on-gke.git
Navigate to the directory that contains the charts:
cd ai-on-gke/charts
Install DRA drivers on GPUs
After you have access to the Helm chart that contains DRA drivers, install the DRA driver for GPUs by completing the following step:
COS
helm upgrade -i --create-namespace --namespace nvidia nvidia-dra-driver-gpu nvidia-dra-driver-gpu/ \
--set image.repository=ghcr.io/nvidia/k8s-dra-driver-gpu \
--set image.tag=d1fad7ed-ubi9 \
--set image.pullPolicy=Always \
--set controller.priorityClassName="" \
--set kubeletPlugin.priorityClassName="" \
--set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \
--set nvidiaCtkPath=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk \
--set deviceClasses="{gpu}" \
--set gpuResourcesEnabledOverride=true \
--set resources.computeDomains.enabled=false \
--set kubeletPlugin.tolerations[0].key=nvidia.com/gpu \
--set kubeletPlugin.tolerations[0].operator=Exists \
--set kubeletPlugin.tolerations[0].effect=NoSchedule \
--set kubeletPlugin.tolerations[1].key=cloud.google.com/compute-class \
--set kubeletPlugin.tolerations[1].operator=Exists \
--set kubeletPlugin.tolerations[1].effect=NoSchedule
Ubuntu
helm upgrade -i --create-namespace --namespace nvidia nvidia-dra-driver-gpu nvidia-dra-driver-gpu/ \
--set image.repository=ghcr.io/nvidia/k8s-dra-driver-gpu \
--set image.tag=d1fad7ed-ubi9 \
--set image.pullPolicy=Always \
--set controller.priorityClassName="" \
--set kubeletPlugin.priorityClassName="" \
--set nvidiaDriverRoot="/opt/nvidia" \
--set nvidiaCtkPath=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk \
--set deviceClasses="{gpu}" \
--set gpuResourcesEnabledOverride=true \
--set resources.computeDomains.enabled=false \
--set kubeletPlugin.tolerations[0].key=nvidia.com/gpu \
--set kubeletPlugin.tolerations[0].operator=Exists \
--set kubeletPlugin.tolerations[0].effect=NoSchedule \
--set kubeletPlugin.tolerations[1].key=cloud.google.com/compute-class \
--set kubeletPlugin.tolerations[1].operator=Exists \
--set kubeletPlugin.tolerations[1].effect=NoSchedule
Install DRA drivers on TPUs
After you have access to the Helm chart that contains the drivers, install the TPU driver by completing the following step:
./tpu-dra-driver/install-tpu-dra-driver.sh
Verify that your infrastructure is ready for DRA
Confirm that the ResourceSlice
lists the hardware devices that you added:
kubectl get resourceslices -o yaml
If you used the example in the previous section, the ResourceSlice
resembles the following,
depending on the type of hardware you used:
GPU
apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
# lines omitted for clarity
spec:
devices:
- basic:
attributes:
architecture:
string: Turing
brand:
string: Nvidia
cudaComputeCapability:
version: 7.5.0
cudaDriverVersion:
version: 12.2.0
driverVersion:
version: 535.230.2
index:
int: 0
minor:
int: 0
productName:
string: Tesla T4
type:
string: gpu
uuid:
string: GPU-2087ac7a-f781-8cd7-eb6b-b00943cc13ef
capacity:
memory:
value: 15Gi
name: gpu-0
TPU
apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
# lines omitted for clarity
spec:
devices:
- basic:
attributes:
index:
int: 0
tpuGen:
string: v6e
uuid:
string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
name: "0"
- basic:
attributes:
index:
int: 1
tpuGen:
string: v6e
uuid:
string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
name: "1"
- basic:
attributes:
index:
int: 2
tpuGen:
string: v6e
uuid:
string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
name: "2"
- basic:
attributes:
index:
int: 3
tpuGen:
string: v6e
uuid:
string: tpu-54de4859-dd8d-f67e-6f91-cf904d965454
name: "3"
driver: tpu.google.com
nodeName: gke-tpu-b4d4b61b-fwbg
pool:
generation: 1
name: gke-tpu-b4d4b61b-fwbg
resourceSliceCount: 1
kind: List
metadata:
resourceVersion: ""