GPU zonal redundancy for services

This page describes setting zonal redundancy options for GPU for your Cloud Run service. By default, GPUs have zonal redundancy enabled so data and traffic are automatically load balanced across zones within a region. In the event of a failure within a particular zone, traffic automatically routes to other zones.

If instead you want to use best-effort reliability at a lower cost per GPU second, turn zonal redundancy off for GPU.

Supported regions

  • us-central1 (Iowa) leaf icon Low CO2
  • asia-southeast1 (Singapore)
  • europe-west1 (Belgium) leaf icon Low CO2
  • europe-west4 (Netherlands) leaf icon Low CO2
  • asia-south1 (Mumbai)
    • Note: This region is available by invitation only. Contact your Google Account team if you are interested in this region

Pricing impact

See Cloud Run pricing for GPU pricing details for the cost of zonal redundancy.

Request quota

By default, there is no quota for either zonal redundancy or zonal redundancy. You will need to request quota. Use the links provided in the following buttons to request the quota you need.

Quota needed Quota link
GPU with zonal redundancy turned on Request GPU quota with zonal redundancy
GPU with zonal redundancy turned off Request GPU quota without zonal redundancy
GPUs quota page (both zonal and non zonal redundancy) Request GPU quota

Before you begin

The following list describes requirements and limitations when using GPUs in Cloud Run:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Enable the Cloud Run API.

    Enable the API

Required roles

To get the permissions that you need to configure and deploy Cloud Run services, ask your administrator to grant you the following IAM roles:

For a list of IAM roles and permissions that are associated with Cloud Run, see Cloud Run IAM roles and Cloud Run IAM permissions. If your Cloud Run service interfaces with Google Cloud APIs, such as Cloud Client Libraries, see the service identity configuration guide. For more information about granting roles, see deployment permissions and manage access.

Configure zonal redundancy for a Cloud Run service that has GPU

Any configuration change leads to the creation of a new revision. Subsequent revisions will also automatically get this configuration setting unless you make explicit updates to change it.

You can use the Google Cloud console, Google Cloud CLI or YAML to configure GPU.

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Click Deploy container and select Service to configure a new service. If you are configuring an existing service, click the service, then click Edit and deploy new revision.

  3. If you are configuring a new service, fill out the initial service settings page, then click Container(s), volumes, networking, security to expand the service configuration page.

  4. Click the Container tab.

    image

    • Select the GPU checkbox to show the GPU redundancy options.
      • Select No zonal redundancy to turn off zonal redundancy
      • Select Zonal redundancy to turn on zonal redundancy.
  5. Click Create or Deploy.

gcloud

By default, GPU zonal redundancy is turned on. To turn off the GPU zonal redundancy configuration for a service, or to turn it back on if you have previously turned zonal redundancy off, use the gcloud beta run services update command:

  gcloud beta run services update SERVICE \
    --image IMAGE_URL \
    --cpu CPU \
    --memory MEMORY \
    --no-cpu-throttling \
    --gpu GPU_NUMBER \
    --gpu-type GPU_TYPE \
    --max-instances MAX_INSTANCE
    --GPU_ZONAL_REDUNDANCY
    

Replace:

  • SERVICE with the name of your Cloud Run service.
  • IMAGE_URL with a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest. If you use Artifact Registry, the repository REPO_NAME must already be created. The URL has the shape LOCATION-docker.pkg.dev/PROJECT_ID/REPO_NAME/PATH:TAG .
  • CPU with the number of CPU. You must specify at least 4 CPU.
  • MEMORY with the amount of memory. You must specify at least 16Gi (16 GiB).
  • GPU_NUMBER with the value 1 (one). If this is unspecified but a GPU_TYPE is present, the default is 1.
  • GPU_TYPE with the GPU type. If this is unspecified but a GPU_NUMBER is present, the default is nvidia-l4 (nvidia L4 lowercase L, not numeric value fourteen).
  • MAX_INSTANCE with the maximum number of instances. This number can't exceed the GPU quota allocated for your project.
  • GPU_ZONAL_REDUNDANCY with no-gpu-zonal-redundancy to turn off zonal redundancy, or gpu-zonal-redundancy to turn on zonal redundancy.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. Update the run.googleapis.com/gpu-zonal-redundancy-disabled annotation:

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      name: SERVICE
      annotations:
        run.googleapis.com/launch-stage: BETA
    spec:
      template:
        metadata:
          annotations:
            run.googleapis.com/gpu-zonal-redundancy-disabled: GPU_ZONAL_REDUNDANCY
            

    Replace:

    • SERVICE with the name of your Cloud Run service.
    • GPU_ZONAL_REDUNDANCY with false to turn on GPU zonal redundancy, or true to turn it off.
  3. Create or update the service using the following command:

    gcloud run services replace service.yaml