Deploy GPU container workloads

This page describes how to deploy GPU container workloads on the Google Distributed Cloud (GDC) Sandbox AI Optimized SKU.

Deploy GPU container workloads

The GDC Sandbox AI Optimized SKU includes four NVIDIA A100 SXM4 80GB GPUs within the org-infra cluster. These GPUs are accessible using the resource name nvidia.com/gpu-pod-NVIDIA_A100_SXM4_80GB. This section describes how to update a container configuration to use these GPUS.

The GPUs in GDC Sandbox AI Optimized SKU are associated with a pre-configured project, "sandbox-gpu-project". You must deploy your container using this project in order to make use of the GPUs.

Before you begin

  • To run commands against the org infrastructure cluster, make sure that you have the kubeconfig of the org-1-infra cluster, as described in Work with clusters:

    • Configure and authenticate with the gdcloud command line, and
    • generate the kubeconfig file for the org infrastructure cluster, and assign its path to the environment variable KUBECONFIG.
  • To run the workloads, you must have the sandbox-gpu-admin role assigned. By default, the role is assigned to the platform-admin user. You can assign the role to other users by signing in as the platform-admin and running the following command:

    kubectl --kubeconfig ${KUBECONFIG} create rolebinding ${NAME} --role=sandbox-gpu-admin \
    --user=${USER} --namespace=sandbox-gpu-project
    

Configure a container to use GPU resources

  1. Add the .containers.resources.requests and .containers.resources.limits fields to your container specification to request GPUs for the workload. All containers within the sandbox-gpu-project can request up to a total of 4 GPUs across the entire project. The following example requests one GPU as part of the container specification.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-deployment
      namespace: sandbox-gpu-project
      labels:
        app: nginx
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx:latest
            resources:
            requests:
                nvidia.com/gpu-pod-NVIDIA_A100_SXM4_80GB: 1
            limits:
                nvidia.com/gpu-pod-NVIDIA_A100_SXM4_80GB: 1
    
  2. Containers also require additional permissions to access GPUs. For each container that requests GPUs, add the following permissions to your container spec:

    securityContext:
    seLinuxOptions:
      type: unconfined_t
    
  3. Apply your container manifest file:

    kubectl apply -f ${CONTAINER_MANIFEST_FILE_PATH} \
        -n sandbox-gpu-project \
        --kubeconfig ${KUBECONFIG}