GPU (services)

This page describes GPU configuration for your Cloud Run service. GPUs work well for AI inference workloads, such as large language models (LLMs) or other compute intensive non-AI use cases such as video transcoding and 3D rendering. Google provides NVIDIA L4 GPUs with 24 GB of GPU memory (VRAM), which is separate from the instance memory.

To use the GPU feature, you must request Total Nvidia L4 GPU allocation, per project per region quota under Cloud Run Admin API in the Quotas and system limits page.

GPU on Cloud Run is fully managed, with no extra drivers or libraries needed. The GPU feature offers on-demand availability with no reservations needed, similar to the way on-demand CPU and on-demand memory work in Cloud Run. Instances of a Cloud Run service that has been configured to use GPU can scale down to zero for cost savings when not in use.

Cloud Run instances with an attached L4 GPU with drivers pre-installed start in approximately 5 seconds, at which point the processes running in your container can start to use the GPU.

You can configure one GPU per Cloud Run instance. If you use sidecar containers, note that the GPU can only be attached to one container.

Supported regions

Pricing impact

See Cloud Run pricing for GPU pricing details. Note the following important considerations:

  • There are no per request fees. Because you must useCPU always allocated to use the GPU feature, minimum instances are charged at the full rate even when idle.
  • You must use a minimum of 4 CPU and 16 GiB of memory.
  • GPU is billed for the entire duration of the instance lifecycle.

Supported GPU types

You can use one L4 GPU per Cloud Run instance. An L4 GPU has the following pre-installed drivers:

  • The current NVIDIA driver version: 535.129.03 (CUDA 12.2)

Before you begin

The following list describes requirements and limitations that apply when using GPUs in Cloud Run:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Enable the Cloud Run API.

    Enable the API

  7. To use the GPUs feature, you must request Total Nvidia L4 GPU allocation, per project per region quota under Cloud Run Admin API in the Quotas and system limits page.
  8. Consult Best practices: AI inference on Cloud Run with GPUs for recommendations on building your container image and loading large models.
  9. Make sure your Cloud Run service has the following configurations:

Required roles

To get the permissions that you need to configure and deploy Cloud Run services, ask your administrator to grant you the following IAM roles:

For a list of IAM roles and permissions that are associated with Cloud Run, see Cloud Run IAM roles and Cloud Run IAM permissions. If your Cloud Run service interfaces with Google Cloud APIs, such as Cloud Client Libraries, see the service identity configuration guide. For more information about granting roles, see deployment permissions and manage access.

Configure a Cloud Run service with GPU

Any configuration change leads to the creation of a new revision. Subsequent revisions will also automatically get this configuration setting unless you make explicit updates to change it.

You can use the Google Cloud console, Google Cloud CLI or YAML to configure GPU.

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Click Deploy container and select Service to configure a new service. If you are configuring an existing service, click the service, then click Edit and deploy new revision.

  3. If you are configuring a new service, fill out the initial service settings page, then click Container(s), volumes, networking, security to expand the service configuration page.

  4. Click the Container tab.

    image

    • Configure CPU, memory, concurrency, execution environment, and startup probe following the recommendations in Before you begin.
    • Check the GPU checkbox, then select the GPU type from the GPU type menu, and the number of GPUs from the Number of GPUs menu.
  5. Click Create or Deploy.

gcloud

To set or update the GPU setting for a service, use the gcloud beta run services update command:

  gcloud beta run deploy SERVICE \
    --image IMAGE_URL \
    --project PROJECT_ID \
    --region REGION \
    --port PORT \
    --cpu CPU \
    --memory MEMORY \
    --no-cpu-throttling \
    --gpu GPU_NUMBER \
    --gpu-type GPU_TYPE \
    --max-instances MAX_INSTANCE

Replace:

  • SERVICE with the name of your Cloud Run service.
  • PROJECT_ID with the ID of the project you are deploying to.
  • REGION with the region you are deploying to. You must specify a region that supports GPU.
  • PORT with the port to send requests to. Note that the default port is 8080.
  • IMAGE_URL with a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest. If you use Artifact Registry, the repository REPO_NAME must already be created. The URL has the shape LOCATION-docker.pkg.dev/PROJECT_ID/REPO_NAME/PATH:TAG .
  • CPU with the number of CPU. You must specify at least 4 CPU.
  • MEMORY with the amount of memory. You must specify at least 16Gi (16 GiB).
  • GPU_NUMBER with the value 1 (one).
  • GPU_TYPE with the GPU type. You must use nvidia-l4 (nvidia L4 lowercase L, not numeric value fourteen).
  • MAX_INSTANCE with the maximum number of instances. This number can't exceed the GPU quota allocated for your project.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. Update the nvidia.com/gpu: attribute and nodeSelector:
    run.googleapis.com/accelerator:
    :

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      name: SERVICE
      annotations:
        run.googleapis.com/launch-stage: BETA
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/maxScale: 'MAX_INSTANCE'
            run.googleapis.com/cpu-throttling: 'false'
        spec:
          containers:
          - image: IMAGE_URL
            ports:
            - containerPort: CONTAINER_PORT
              name: http1
            resources:
              limits:
                cpu: 'CPU'
                memory: 'MEMORY'
                nvidia.com/gpu: 'GPU_NUMBER'
            # Optional: use a longer startup probe to allow long starting containers
            startupProbe:
              failureThreshold: 1800
              periodSeconds: 1
              tcpSocket:
                port: CONTAINER_PORT
              timeoutSeconds: 1
          nodeSelector:
            run.googleapis.com/accelerator: GPU_TYPE

    Replace:

    • SERVICE with the name of your Cloud Run service.
    • IMAGE_URL with a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest. If you use Artifact Registry, the repository REPO_NAME must already be created. The URL has the shape LOCATION-docker.pkg.dev/PROJECT_ID/REPO_NAME/PATH:TAG
    • CONTAINER_PORT with the container port set for your service.
    • CPU with the number of CPU. You must specify at least 4 CPU.
    • MEMORY with the amount of memory. You must specify at least 16Gi (16 GiB).
    • GPU_NUMBER with the value 1 (one) because we only support attaching one GPU per Cloud Run instance.
    • GPU_TYPE with the value nvidia-l4 (nvidia-L4 lowercase L, not numeric value fourteen).
    • MAX_INSTANCE with the maximum number of instances. This number can't exceed the GPU quota allocated for your project.
  3. Create or update the service using the following command:

    gcloud run services replace service.yaml

View GPU settings

To view the current GPU settings for your Cloud Run service:

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Click the service you are interested in to open the Service details page.

  3. Click the Revisions tab.

  4. In the details panel at the right, the GPU setting is listed under the Container tab.

gcloud

  1. Use the following command:

    gcloud run services describe SERVICE
  2. Locate the GPU setting in the returned configuration.

Removing GPU

You can remove GPU using the Google Cloud console, the Google Cloud CLI, or YAML.

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Click Deploy container and select Service to configure a new service. If you are configuring an existing service, click the service, then click Edit and deploy new revision.

  3. If you are configuring a new service, fill out the initial service settings page, then click Container(s), volumes, networking, security to expand the service configuration page.

  4. Click the Container tab.

    image

    • Uncheck the GPU checkbox.
  5. Click Create or Deploy.

gcloud

To remove GPU, set the number of GPUs to 0 using the gcloud beta run services update command:

  gcloud beta run services update SERVICE --gpu 0
  

Replace SERVICE with the name of your Cloud Run service.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. Delete the nvidia.com/gpu: and the nodeSelector: run.googleapis.com/accelerator: nvidia-l4 lines.

  3. Create or update the service using the following command:

    gcloud run services replace service.yaml

Libraries

By default, all of the NVIDIA L4 driver libraries are mounted under /usr/local/nvidia/lib64.

If your service is unable to find the provided libraries, update the search path for the dynamic linker by adding the line ENV LD_LIBRARY_PATH /usr/local/nvidia/lib64:${LD_LIBRARY_PATH} to your Dockerfile.

Note that you can also set LD_LIBRARY_PATH as an environment variable for the Cloud Run service, if you have an existing image and don't want to rebuild the image with an updated Dockerfile.

If you want to use a CUDA version greater than 12.2, the easiest way is to depend on a newer NVIDIA base image with forward compatibility packages already installed. Another option is to manually install the NVIDIA forward compatibility packages and add them to LD_LIBRARY_PATH. Consult NVIDIA's compatibility matrix to determine which CUDA versions are forward compatible with the provided NVIDIA driver version (535.129.03).

About GPUs and maximum instances

The number of instances with GPUs is limited in two ways:

  • The Maximum instances setting limits the number of instances per service. This can't be set higher than the GPU quota per project per region for GPU.
  • The quota of GPUs allowed per project per region. This limits the number of instances across services in the same region.