This page describes how to deploy GPU container workloads on the Google Distributed Cloud (GDC) Sandbox AI Optimized SKU.
Deploy GPU container workloads
The GDC Sandbox AI Optimized SKU includes four
NVIDIA A100 SXM4 80GB GPUs within the org-infra cluster. These GPUs are
accessible using the resource name nvidia.com/gpu-pod-NVIDIA_A100_SXM4_80GB
.
This section describes how to update a container configuration to use these
GPUS.
The GPUs in GDC Sandbox AI Optimized SKU are associated with a pre-configured project, "sandbox-gpu-project". You must deploy your container using this project in order to make use of the GPUs.
Before you begin
To run commands against the org infrastructure cluster, make sure that you have the kubeconfig of the
org-1-infra
cluster, as described in Work with clusters:- Configure and authenticate with the
gdcloud
command line, and - generate the kubeconfig file for the org infrastructure cluster, and
assign its path to the environment variable
KUBECONFIG
.
- Configure and authenticate with the
To run the workloads, you must have the
sandbox-gpu-admin
role assigned. By default, the role is assigned to theplatform-admin
user. You can assign the role to other users by signing in as theplatform-admin
and running the following command:kubectl --kubeconfig ${KUBECONFIG} create rolebinding ${NAME} --role=sandbox-gpu-admin \ --user=${USER} --namespace=sandbox-gpu-project
Configure a container to use GPU resources
Add the
.containers.resources.requests
and.containers.resources.limits
fields to your container specification to request GPUs for the workload. All containers within the sandbox-gpu-project can request up to a total of 4 GPUs across the entire project. The following example requests one GPU as part of the container specification.apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment namespace: sandbox-gpu-project labels: app: nginx spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:latest resources: requests: nvidia.com/gpu-pod-NVIDIA_A100_SXM4_80GB: 1 limits: nvidia.com/gpu-pod-NVIDIA_A100_SXM4_80GB: 1
Containers also require additional permissions to access GPUs. For each container that requests GPUs, add the following permissions to your container spec:
securityContext: seLinuxOptions: type: unconfined_t
Apply your container manifest file:
kubectl apply -f ${CONTAINER_MANIFEST_FILE_PATH} \ -n sandbox-gpu-project \ --kubeconfig ${KUBECONFIG}