Run Gemma 3 on Cloud Run

This guide describes how to deploy Gemma 3 open models on Cloud Run using a prebuilt container, and provides guidance on using the deployed Cloud Run service with the Google Gen AI SDK.

Before you begin

If you used Google AI Studio to deploy to Cloud Run, skip to the Securely interact with the Google Gen AI SDK section.

If you didn't use Google AI Studio, follow these steps before using Cloud Run to create a new service.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Set up your Cloud Run development environment in your Google Cloud project.
  7. Install and initialize the gcloud CLI.
  8. Ensure you have the following IAM roles granted to your account:
  9. Learn how to grant the roles

    Console

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the Google Account email address that is used to deploy the Cloud Run service.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

    gcloud

    To grant the required IAM roles to your account on your project:

            gcloud projects add-iam-policy-binding PROJECT_ID \
                --member=PRINCIPAL \
                --role=ROLE
            

    Replace:

    • PROJECT_NUMBER with your Google Cloud project number.
    • PROJECT_ID with your Google Cloud project ID.
    • PRINCIPAL with the account you are adding the binding for. This is typically the Google Account email address that is used to deploy the Cloud Run service.
    • ROLE with the role you are adding to the deployer account.
  10. Request Total Nvidia L4 GPU allocation, per project per region quota under Cloud Run Admin API in the Quotas and system limits page.
  11. Review the Cloud Run pricing page. To generate a cost estimate based on your projected usage, use the pricing calculator.

Deploy a Gemma model with a prebuilt container

Cloud Run provides a prebuilt container for serving Gemma open models on Cloud Run.

To deploy Gemma models on Cloud Run, use the following gcloud CLI command with the recommended settings:

gcloud run deploy SERVICE_NAME \
   --image us-docker.pkg.dev/cloudrun/container/gemma/GEMMA_PARAMETER \
   --concurrency 4 \
   --cpu 8 \
   --set-env-vars OLLAMA_NUM_PARALLEL=4 \
   --gpu 1 \
   --gpu-type nvidia-l4 \
   --max-instances 1 \
   --memory 32Gi \
   --no-allow-unauthenticated \
   --no-cpu-throttling \
   --timeout=600 \
   --region REGION

Replace:

  • SERVICE_NAME with a unique name for the Cloud Run service.
  • GEMMA_PARAMETER with the Gemma model you used:

    • Gemma 3 1B (gemma-3-1b-it): gemma3-1b
    • Gemma 3 4B (gemma-3-4b-it): gemma3-4b
    • Gemma 3 12B (gemma-3-12b-it): gemma3-12b
    • Gemma 3 27B (gemma-3-27b-it): gemma3-27b

    Optionally, replace the entire image URL with a Docker image you've built from the Gemma-on-Cloudrun GitHub repository.

  • REGION with the Google Cloud region where your Cloud Run will be deployed, such as europe-west1. If you need to modify the region, see GPU configuration to learn about supported regions for GPU-enabled deployments.

The other settings are as follows:

Option Description
--concurrency

The maximum number of requests that can be processed simultaneously by a given instance, such as 4. See Set concurrency for optimal performance for recommendations on optimal request latency.

--cpu

The amount of allocated CPU for your service, such as 8.

--set-env-vars

The environment variables set for your service. For example, OLLAMA_NUM_PARALLEL=4. See Set concurrency for optimal performance for recommendations on optimal request latency.

--gpu

The GPU value for your service, such as 1.

--gpu-type

The type of GPU to use for your service, such as nvidia-l4.

--max-instances

The maximum number of container instances for your service, such as 1.

--memory

The amount of allocated memory for your service, such as 32Gi.

--no-invoker-iam-check

Disable invoker IAM checks. See Securely interact with the Google Gen AI SDK for recommendations on how to better secure your app.

--no-cpu-throttling

This setting disables CPU throttling when the container is not actively serving requests.

--timeout

The time within which a response must be returned, such as 600 seconds.

If you need to modify the default settings or add more customized settings to your Cloud Run service, see Configure services.

After completion of the deployed service, a success message is displayed along with the Cloud Run endpoint URL ending with run.app.

Test the deployed Gemma service with curl

Now that you have deployed the Gemma service, you can send requests to it. However, if you send a request directly, Cloud Run responds with HTTP 401 Unauthorized. This is intentional, because an LLM inference API is intended for other services to call, such as a front-end application. For more information on service-to-service authentication on Cloud Run, refer to Authenticating service-to-service.

To send requests to the Gemma service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:

  1. Start the proxy, and when prompted to install the cloud-run-proxy component, choose Y:

    gcloud run services proxy ollama-gemma --port=9090
  2. Send a request to it in a separate terminal tab, leaving the proxy running. Note that the proxy runs on localhost:9090:

    curl http://localhost:9090/api/generate -d '{
      "model": "gemma3:4b",
      "prompt": "Why is the sky blue?"
    }'

    This command should provide streaming output similar to this:

    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.641492408Z","response":"That","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.687529153Z","response":"'","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.753284927Z","response":"s","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.812957381Z","response":" a","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.889102649Z","response":" fantastic","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.925748116Z","response":",","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.958391572Z","response":" decept","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.971035028Z","response":"ively","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.989678484Z","response":" tricky","done":false}
    {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.999321940Z","response":" question","done":false}
    ...
    

Securely interact with the Google Gen AI SDK

After you have deployed your Cloud Run service, you can use the Cloud Run endpoint with the Google Gen AI SDK.

Before you use the Google Gen AI SDK, ensure that incoming requests pass the appropriate identity token. To learn more about using IAM authentication and Cloud Run, see Authenticating service-to-service.

The following examples show how to use the Google Gen AI SDK with IAM authentication.

Javascript or TypeScript

If you are using the Google Gen AI SDK for Javascript and TypeScript, the code might look as follows:

import { GoogleGenAI, setDefaultBaseUrls } from "@google/genai";
import { GoogleAuth} from 'google-auth-library'

const cloudrunurl = 'https://CLOUD_RUN_SERVICE_URL';
const targetAudience = url;

const auth = new GoogleAuth();

async function main() {

  const client = await auth.getIdTokenClient(targetAudience);
  const headers = await client.getRequestHeaders(targetAudience);
  const idToken = headers['Authorization']

const ai = new GoogleGenAI({
  apiKey:"placeholder",
  httpOptions: { baseUrl: url, headers: {'Authorization': idToken}  },
});

  const response = await ai.models.generateContent({
    model: "gemma-3-1b-it",
    contents: "I want a pony",
  });
  console.log(response.text);
}

main();

curl

If using curl, run the following commands to reach the Google Gen AI SDK endpoints:

  • For Generate Content, use /v1beta/{model=models/*}:generateContent: Generates a model response given an input GenerateContentRequest.

    curl "<cloud_run_url>/v1beta/models/<model>:generateContent" \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
    -X POST \
    -d '{
      "contents": [{
        "parts":[{"text": "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."}]
        }]
        }'
    
  • For Stream Generate Content, use /v1beta/{model=models/*}:streamGenerateContent: Generates a streamed response from the model given an input GenerateContentRequest.

    curl "<cloud_run_url>/v1beta/models/<model>:streamGenerateContent" \
      -H 'Content-Type: application/json' \
      -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
      -X POST \
      -d '{
        "contents": [{
          "parts":[{"text": "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."}]
          }]
          }'
    

Set concurrency for optimal performance

This section provides context on the recommended concurrency settings. For optimal request latency, ensure the --concurrency setting is equal to Ollama's OLLAMA_NUM_PARALLEL environment variable.

  • OLLAMA_NUM_PARALLEL determines how many request slots are available per each model to handle inference requests concurrently.
  • --concurrency determines how many requests Cloud Run sends to an Ollama instance at the same time.

If --concurrency exceeds OLLAMA_NUM_PARALLEL, Cloud Run can send more requests to a model in Ollama than it has available request slots for. This leads to request queuing within Ollama, increasing request latency for the queued requests. It also leads to less responsive auto scaling, as the queued requests don't trigger Cloud Run to scale out and start new instances.

Ollama also supports serving multiple models from one GPU. To completely avoid request queuing on the Ollama instance, you should still set --concurrency to match OLLAMA_NUM_PARALLEL.

It's important to note that increasing OLLAMA_NUM_PARALLEL also makes parallel requests take longer.

Optimize utilization

For optimal GPU utilization, increase --concurrency, keeping it within twice the value of OLLAMA_NUM_PARALLEL. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.

Clean up

Delete the following Google Cloud resources created:

What's next