This guide describes how to deploy Gemma 3 open models on Cloud Run using a prebuilt container, and provides guidance on using the deployed Cloud Run service with the Google Gen AI SDK.
Before you begin
If you used Google AI Studio to deploy to Cloud Run, skip to the Securely interact with the Google Gen AI SDK section.
If you didn't use Google AI Studio, follow these steps before using Cloud Run to create a new service.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- Set up your Cloud Run development environment in your Google Cloud project.
- Install and initialize the gcloud CLI.
- Ensure you have the following IAM roles granted to your account:
- Cloud Run Admin (
roles/run.admin
) - Project IAM Admin (
roles/resourcemanager.projectIamAdmin
) - Service Usage Consumer (
roles/serviceusage.serviceUsageConsumer
)
- Cloud Run Admin (
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the Google Account email address that is used to deploy the Cloud Run service.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
- PROJECT_NUMBER with your Google Cloud project number.
- PROJECT_ID with your Google Cloud project ID.
- PRINCIPAL with the account you are adding the binding for. This is typically the Google Account email address that is used to deploy the Cloud Run service.
- ROLE with the role you are adding to the deployer account.
- Request
Total Nvidia L4 GPU allocation, per project per region
quota under Cloud Run Admin API in the Quotas and system limits page. - Review the Cloud Run pricing page. To generate a cost estimate based on your projected usage, use the pricing calculator.
Learn how to grant the roles
Console
gcloud
To grant the required IAM roles to your account on your project:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=PRINCIPAL \ --role=ROLE
Replace:
Deploy a Gemma model with a prebuilt container
Cloud Run provides a prebuilt container for serving Gemma open models on Cloud Run.
To deploy Gemma models on Cloud Run, use the following gcloud CLI command with the recommended settings:
gcloud run deploy SERVICE_NAME \ --image us-docker.pkg.dev/cloudrun/container/gemma/GEMMA_PARAMETER \ --concurrency 4 \ --cpu 8 \ --set-env-vars OLLAMA_NUM_PARALLEL=4 \ --gpu 1 \ --gpu-type nvidia-l4 \ --max-instances 1 \ --memory 32Gi \ --no-allow-unauthenticated \ --no-cpu-throttling \ --timeout=600 \ --region REGION
Replace:
SERVICE_NAME
with a unique name for the Cloud Run service.GEMMA_PARAMETER
with the Gemma model you used:- Gemma 3 1B (
gemma-3-1b-it
):gemma3-1b
- Gemma 3 4B (
gemma-3-4b-it
):gemma3-4b
- Gemma 3 12B (
gemma-3-12b-it
):gemma3-12b
- Gemma 3 27B (
gemma-3-27b-it
):gemma3-27b
Optionally, replace the entire image URL with a Docker image you've built from the Gemma-on-Cloudrun GitHub repository.
- Gemma 3 1B (
REGION
with the Google Cloud region where your Cloud Run will be deployed, such aseurope-west1
. If you need to modify the region, see GPU configuration to learn about supported regions for GPU-enabled deployments.
The other settings are as follows:
Option | Description |
---|---|
--concurrency |
The maximum number of requests that can be processed simultaneously by a given instance, such as |
--cpu |
The amount of allocated CPU for your service, such as |
--set-env-vars |
The environment variables set for your service. For example, |
--gpu |
The GPU value for your service, such as |
--gpu-type |
The type of GPU to use for your service, such as |
--max-instances |
The maximum number of container instances for your service, such as |
--memory |
The amount of allocated memory for your service, such as |
--no-invoker-iam-check |
Disable invoker IAM checks. See Securely interact with the Google Gen AI SDK for recommendations on how to better secure your app. |
--no-cpu-throttling |
This setting disables CPU throttling when the container is not actively serving requests. |
--timeout |
The time within which a response must be returned, such as |
If you need to modify the default settings or add more customized settings to your Cloud Run service, see Configure services.
After completion of the deployed service, a success message is displayed along
with the Cloud Run endpoint URL
ending with run.app
.
Test the deployed Gemma service with curl
Now that you have deployed the Gemma service, you can send
requests to it. However, if you send a request directly, Cloud Run
responds with HTTP 401 Unauthorized
. This is intentional, because an LLM
inference API is intended for other services to call, such as a front-end
application. For more information on service-to-service
authentication on Cloud Run, refer to Authenticating service-to-service.
To send requests to the Gemma service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:
Start the proxy, and when prompted to install the
cloud-run-proxy
component, chooseY
:gcloud run services proxy ollama-gemma --port=9090
Send a request to it in a separate terminal tab, leaving the proxy running. Note that the proxy runs on
localhost:9090
:curl http://localhost:9090/api/generate -d '{ "model": "gemma3:4b", "prompt": "Why is the sky blue?" }'
This command should provide streaming output similar to this:
{"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.641492408Z","response":"That","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.687529153Z","response":"'","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.753284927Z","response":"s","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.812957381Z","response":" a","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.889102649Z","response":" fantastic","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.925748116Z","response":",","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.958391572Z","response":" decept","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.971035028Z","response":"ively","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.989678484Z","response":" tricky","done":false} {"model":"gemma3:4b","created_at":"2025-03-10T03:02:18.999321940Z","response":" question","done":false} ...
Securely interact with the Google Gen AI SDK
After you have deployed your Cloud Run service, you can use the Cloud Run endpoint with the Google Gen AI SDK.
Before you use the Google Gen AI SDK, ensure that incoming requests pass the appropriate identity token. To learn more about using IAM authentication and Cloud Run, see Authenticating service-to-service.
The following examples show how to use the Google Gen AI SDK with IAM authentication.
Javascript or TypeScript
If you are using the Google Gen AI SDK for Javascript and TypeScript, the code might look as follows:
import { GoogleGenAI, setDefaultBaseUrls } from "@google/genai";
import { GoogleAuth} from 'google-auth-library'
const cloudrunurl = 'https://CLOUD_RUN_SERVICE_URL';
const targetAudience = url;
const auth = new GoogleAuth();
async function main() {
const client = await auth.getIdTokenClient(targetAudience);
const headers = await client.getRequestHeaders(targetAudience);
const idToken = headers['Authorization']
const ai = new GoogleGenAI({
apiKey:"placeholder",
httpOptions: { baseUrl: url, headers: {'Authorization': idToken} },
});
const response = await ai.models.generateContent({
model: "gemma-3-1b-it",
contents: "I want a pony",
});
console.log(response.text);
}
main();
curl
If using curl, run the following commands to reach the Google Gen AI SDK endpoints:
For Generate Content, use
/v1beta/{model=models/*}:generateContent
: Generates a model response given an inputGenerateContentRequest
.curl "<cloud_run_url>/v1beta/models/<model>:generateContent" \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $(gcloud auth print-identity-token)" \ -X POST \ -d '{ "contents": [{ "parts":[{"text": "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."}] }] }'
For Stream Generate Content, use
/v1beta/{model=models/*}:streamGenerateContent
: Generates a streamed response from the model given an inputGenerateContentRequest
.curl "<cloud_run_url>/v1beta/models/<model>:streamGenerateContent" \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $(gcloud auth print-identity-token)" \ -X POST \ -d '{ "contents": [{ "parts":[{"text": "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."}] }] }'
Set concurrency for optimal performance
This section provides context on the recommended concurrency settings. For optimal
request latency, ensure the --concurrency
setting is equal to Ollama's
OLLAMA_NUM_PARALLEL
environment variable.
OLLAMA_NUM_PARALLEL
determines how many request slots are available per each model to handle inference requests concurrently.--concurrency
determines how many requests Cloud Run sends to an Ollama instance at the same time.
If --concurrency
exceeds OLLAMA_NUM_PARALLEL
, Cloud Run can send
more requests to a model in Ollama than it has available request slots for.
This leads to request queuing within Ollama, increasing request latency for the
queued requests. It also leads to less responsive auto scaling, as the queued
requests don't trigger Cloud Run to scale out and start new instances.
Ollama also supports serving multiple models from one GPU. To completely
avoid request queuing on the Ollama instance, you should still set
--concurrency
to match OLLAMA_NUM_PARALLEL
.
It's important to note that increasing OLLAMA_NUM_PARALLEL
also makes parallel requests take longer.
Optimize utilization
For optimal GPU utilization, increase --concurrency
, keeping it within
twice the value of OLLAMA_NUM_PARALLEL
. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.
Clean up
Delete the following Google Cloud resources created:
What's next
- Configure GPU
- Best practices: AI inference on Cloud Run with GPUs
- Run Gemma 3 models with various AI runtime frameworks