This page shows you how to use CUDA Multi-Process Service (MPS) to let multiple workloads share a single NVIDIA GPU hardware accelerator in your Google Kubernetes Engine (GKE) nodes.
Overview
NVIDIA MPS is a GPU sharing solution that allows multiple containers to share a single physical NVIDIA GPU hardware attached to a node.
NVIDIA MPS relies on NVIDIA's Multi-Process Service on CUDA. NVIDIA MPS is an alternative, binary-compatible implementation of the CUDA API designed to transparently enable co-operative multi-process CUDA applications to run concurrently on a single GPU device.
With NVIDIA MPS, you can specify the maximum shared containers of a physical GPU. This value determines how much of the physical GPU power each container gets, in terms of the following characteristics:
To learn more about how GPUs scheduled with NVIDIA MPS, when you should use CUDA MPS, see About GPU sharing solutions in GKE.
Who should use this guide
The instructions in this section apply to you if you are one of the following:
- Platform administrator: Creates and manages a GKE cluster, plans infrastructure and resourcing requirements, and monitors the cluster's performance.
- Application developer: Designs and deploys workloads on GKE clusters. If you want instructions for requesting NVIDIA MPS with GPUs, refer to Deploy workloads that use NVIDIA MPS with GPUs.
Requirements
- GKE version: You can enable GPU sharing with NVIDIA MPS on GKE Standard clusters running GKE version 1.27.7-gke.1088000 and later.
- GPU type: You can enable NVIDIA MPS for all NVIDIA Tesla GPU types.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Ensure that you have sufficient NVIDIA Tesla GPU quota. If you need more quota, refer to Requesting an increase in quota.
- Plan your GPU capacity based on the resource needs of the workloads and the capacity of the underlying GPU.
- Review the limitations for the NVIDIA MPS with GPUs.
Enable NVIDIA MPS with GPUs on GKE clusters
As a platform administrator, you must enable NVIDIA MPS with GPUs on a GKE Standard cluster. Then, application developers can deploy workloads to use the NVIDIA MPS with GPUs. To enable NVIDIA MPS with GPUs on GKE, do the following:
- Enable NVIDIA MPS with GPUs on a new GKE cluster.
- Install NVIDIA GPU device drivers (if required).
- Verify the GPU resources available on your nodes.
Enable NVIDIA MPS with GPUs on a GKE cluster
You can enable NVIDIA MPS with GPUs when you create GKE Standard clusters. The default node pool in the cluster has the feature enabled. You still need to enable NVIDIA MPS with GPUs when you manually create new node pools in that cluster.
Create a cluster with NVIDIA MPS enabled using the Google Cloud CLI:
gcloud container clusters create CLUSTER_NAME \
--region=COMPUTE_REGION \
--cluster-version=CLUSTER_VERSION \
--machine-type=MACHINE_TYPE \
--accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=mps,max-shared-clients-per-gpu=CLIENTS_PER_GPU,gpu-driver-version=DRIVER_VERSION
Replace the following:
CLUSTER_NAME
: the name of your new cluster.COMPUTE_REGION
: the Compute Engine region for your new cluster. For zonal clusters, specify--zone=COMPUTE_ZONE
. The GPU type that you use must be available in the selected zone.CLUSTER_VERSION
: the GKE version for the cluster control plane and nodes. Use GKE version 1.27.7-gke.1088000 or later. Alternatively, specify a release channel with that GKE version by using the--release-channel=RELEASE_CHANNEL
flag.MACHINE_TYPE
: the Compute Engine machine type for your nodes.- For H200 GPUs, use the A3 Ultra machine type
- For H100 GPUs, use an A3 machine type other than Ultra (Mega, High, or Edge)
- For A100 GPUs, use an A2 machine type
- For L4 GPUs, use a G2 machine type
- For all other GPUs, use an N1 machine type
GPU_TYPE
: the GPU type, which must be an NVIDIA Tesla GPU platform such asnvidia-tesla-v100
.GPU_QUANTITY
: the number of physical GPUs to attach to each node in the default node pool.CLIENTS_PER_GPU
: the maximum number of containers that can share each physical GPU.DRIVER_VERSION
: the NVIDIA driver version to install. Can be one of the following:default
: Install the default driver version for your GKE version.latest
: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.disabled
: Skip automatic driver installation. You must manually install a driver after you create the node pool. If you omitgpu-driver-version
, this is the default option.
Enable NVIDIA MPS with GPUs on a new node pool
You can enable NVIDIA MPS with GPUs when you manually create new node pools in a GKE cluster. Create a node pool with NVIDIA MPS enabled using the Google Cloud CLI:
gcloud container node-pools create NODEPOOL_NAME \
--cluster=CLUSTER_NAME \
--machine-type=MACHINE_TYPE \
--region=COMPUTE_REGION \
--accelerator=type=GPU_TYPE,count=GPU_QUANTITY,gpu-sharing-strategy=mps,max-shared-clients-per-gpu=CONTAINER_PER_GPU,gpu-driver-version=DRIVER_VERSION
Replace the following:
NODEPOOL_NAME
: the name of your new node pool.CLUSTER_NAME
: the name of your cluster, which must run GKE version 1.27.7-gke.1088000 or later.COMPUTE_REGION
: the Compute Engine region of your cluster. For zonal clusters, specify--zone=COMPUTE_ZONE
.MACHINE_TYPE
: the Compute Engine machine type for your nodes. For A100 GPUs, use an A2 machine type. For all other GPUs, use an N1 machine type.GPU_TYPE
: the GPU type, which must be an NVIDIA Tesla GPU platform such asnvidia-tesla-v100
.GPU_QUANTITY
: the number of physical GPUs to attach to each node in the node pool.CONTAINER_PER_GPU
: the maximum number of containers that can share each physical GPU.DRIVER_VERSION
: the NVIDIA driver version to install. Can be one of the following:default
: Install the default driver version for your GKE version.latest
: Install the latest available driver version for your GKE version. Available only for nodes that use Container-Optimized OS.disabled
: Skip automatic driver installation. You must manually install a driver after you create the node pool. If you omitgpu-driver-version
, this is the default option.
Install NVIDIA GPU device drivers
If you chose to disable automatic driver installation when creating the cluster, or if you use a GKE version earlier than 1.27.2-gke.1200, you must manually install a compatible NVIDIA driver to manage the NVIDIA MPS division of the physical GPUs. To install the drivers, you deploy a GKE installation DaemonSet that sets the drivers up.
For instructions, refer to Installing NVIDIA GPU device drivers.
Verify the GPU resources available
You can verify that the number of GPUs in your nodes matches the number you specified when you enabled NVIDIA MPS. You can also verify that the NVIDIA MPS control daemon is running.
Verify the GPU resources available on your nodes
To verify the GPU resources available on your nodes, run the following command:
kubectl describe nodes NODE_NAME
Replace NODE_NAME with the name of your node.
The output is similar to the following:
...
Capacity:
...
nvidia.com/gpu: 3
Allocatable:
...
nvidia.com/gpu: 3
In this output, the number of GPU resources on the node is 3
because of the following values:
- The value in
max-shared-clients-per-gpu
is3
. - The
count
of physical GPUs to attach to the node is1
. If thecount
of physical GPUs was2
, the output would show6
allocatable GPU resources, three on each physical GPU.
Verify that the MPS control daemon is running
The GPU device plugin performs a health check on the MPS control daemon. When the MPS control daemon is healthy, you can deploy a container.
To verify that the MPS is status, run the following command:
kubectl logs -l k8s-app=nvidia-gpu-device-plugin -n kube-system --tail=100 | grep MPS
The output is similar to the following:
I1118 08:08:41.732875 1 nvidia_gpu.go:75] device-plugin started
...
I1110 18:57:54.224832 1 manager.go:285] MPS is healthy, active thread percentage = 100.0
...
In the output, you might see that the following events happened:
- The
failed to start GPU device manager
error is preceding theMPS is healthy
error. This error is transient. If you see theMPS is healthy
message, then the control daemon is running. - The
active thread percentage = 100.0
message means that the whole physical GPU resource has a completely active thread.
Deploy workloads that use MPS
As an application operator who is deploying GPU workloads, you can tell
GKE to share MPS sharing units in the same physical GPU. In the
following manifest, you request one physical GPU and set
max-shared-clients-per-gpu=3
. The physical GPU gets three MPS sharing units, and starts a nvidia/samples:nbody
Job with three Pods (containers) running parallel.
Save the manifest as
gpu-mps.yaml
:apiVersion: batch/v1 kind: Job metadata: name: nbody-sample spec: completions: 3 parallelism: 3 template: spec: hostIPC: true nodeSelector: cloud.google.com/gke-gpu-sharing-strategy: mps containers: - name: nbody-sample image: nvidia/samples:nbody command: ["/tmp/nbody"] args: ["-benchmark", "-i=5000"] resources: limits: nvidia.com/gpu: 1 restartPolicy: "Never" backoffLimit: 1
In this manifest:
hostIPC: true
enables Pods to talk to the MPS control daemon. It is required. However, consider that thehostIPC: true
configuration allows container to access the host resource which introduce security risks.- 5,000 iterations run in benchmark mode.
Apply the manifest:
kubectl apply -f gpu-mps.yaml
Verify that all Pods are running:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE nbody-sample-6948ff4484-54p6q 1/1 Running 0 2m6s nbody-sample-6948ff4484-5qs6n 1/1 Running 0 2m6s nbody-sample-6948ff4484-5zpdc 1/1 Running 0 2m5s
Check the logs from Pods to verify the Job completed:
kubectl logs -l job-name=nbody-sample -f
The output is similar to the following:
... > Compute 8.9 CUDA device: [NVIDIA L4] 18432 bodies, total time for 5000 iterations: 9907.976 ms = 171.447 billion interactions per second = 3428.941 single-precision GFLOP/s at 20 flops per interaction ...
Because GKE runs 50,000 iterations, the log might take several minutes.
Clean up
Delete the Jobs and all of its Pods by running the following command:
kubectl delete job --all
Limit pinned device memory and active thread with NVIDIA MPS
By default, when using GPU with NVIDIA MPS on GKE, the following CUDA environment variables are injected into the GPU workload:
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
: This variable indicates the percentage of available threads that each MPS sharing unit can use. By default, each MPS sharing unit of the GPU is set to100 / MaxSharedClientsPerGPU
to get an equal slice of the GPU compute in terms of stream multiprocessor.CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
: This variable limits the amount of GPU memory that can be allocated by a MPS sharing unit of GPU. By default, each MPS sharing unit of the GPU is set tototal mem / MaxSharedClientsPerGPU
to get an equal slice of the GPU memory.
To set resource limit for your GPU workloads, configure these NVIDIA MPS environment variables:
Review and build the image of the
cuda-mps
example in GitHub.Save the following manifest as
cuda-mem-and-sm-count.yaml
:apiVersion: v1 kind: Pod metadata: name: cuda-mem-and-sm-count spec: hostIPC: true nodeSelector: cloud.google.com/gke-gpu-sharing-strategy: mps containers: - name: cuda-mem-and-sm-count image: CUDA_MPS_IMAGE securityContext: privileged: true resources: limits: nvidia.com/gpu: 1
Replace the
CUDA_MPS_IMAGE
with the name of the image that you built for thecuda-mps
example.NVIDIA MPS requires that you set
hostIPC:true
on Pods. ThehostIPC:true
configuration allows container to access the host resource which introduces security risks.Apply the manifest:
kubectl apply -f cuda-mem-and-sm-count.yaml
Check the logs for this Pod:
kubectl logs cuda-mem-and-sm-count
In an example which uses NVIDIA Tesla L4 with
gpu-sharing-strategy=mps
andmax-shared-clients-per-gpu=3
, the output is similar as the following:For device 0: Free memory: 7607 M, Total memory: 22491 M For device 0: multiProcessorCount: 18
In this example, the NVIDIA Tesla L4 GPU has 60 SM count and 24 GB memory. Each MPS sharing unit roughly gets 33% active thread and 8 GB memory.
Update the manifest to request 2
nvidia.com/gpu
:resources: limits: nvidia.com/gpu: 2
The output is similar to the following:
For device 0: Free memory: 15230 M, Total memory: 22491 M For device 0: multiProcessorCount: 38
Update the manifest to override the
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
andCUDA_MPS_PINNED_DEVICE_MEM_LIMIT
variables:env: - name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE value: "20" - name: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT value: "0=8000M"
The output is similar to the following:
For device 0: Free memory: 7952 M, Total memory: 22491 M For device 0: multiProcessorCount: 10
Limitations
- MPS on pre-Volta GPUs (P100) has limited capabilities compared with GPU types in and after Volta.
- With NVIDIA MPS, GKE ensures that each container gets limited pinned device memory and active thread. However, other resources like memory bandwidth, encoders or decoders are not captured as part of these resource limits. As a result, containers might negatively affect the performance of other containers if they are all requesting the same unlimited resource.
- NVIDIA MPS has memory protection and error containment limitations. We recommend that you evaluate this limitations to ensure compatibility with your workloads.
- NVIDIA MPS requires that you set
hostIPC:true
on Pods. ThehostIPC:true
configuration allows container to access the host resource which introduces security risks. - GKE might reject certain GPU requests when using NVIDIA MPS, to prevent unexpected behavior during capacity allocation.
- The maximum number of containers that can share a single physical GPU with NVIDIA MPS is 48 (pre-Volta GPU only supports 16). When planning your NVIDIA MPS configuration, consider the resource needs of your workloads and the capacity of the underlying physical GPUs to optimize your performance and responsiveness.
What's next
- For more information about the GPU sharing strategies available in GKE, see About GPU sharing strategies in GKE
- For more information about Multi-Process Service (MPS), refer to the NVIDIA documentation.