Run LLM inference on Cloud Run GPUs with Ollama


In this tutorial, you'll learn how to deploy Google's Gemma 2, an open large language model (LLM), on a GPU-enabled Cloud Run service (for fast inference).

You'll use Ollama, an LLM inference server for open models. Once you've completed the tutorial, feel free to also explore other open models that are supported by Ollama, including Llama 3.1 (8B), Mistral (7B), and Qwen2 (7B).

Objectives

  • Deploy Ollama with the Gemma 2 model on a GPU-enabled Cloud Run service.
  • Send prompts to the Ollama service on its private endpoint.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Enable the Artifact Registry, Cloud Build, Cloud Run, and Cloud Storage APIs.

    Enable the APIs

  7. Install and initialize the gcloud CLI.
  8. Request Total Nvidia L4 GPU allocation, per project per region quota under Cloud Run Admin API in the Quotas and system limits page to complete this tutorial.

Required roles

To get the permissions that you need to complete the tutorial, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Set up gcloud

To configure the Google Cloud CLI for your Cloud Run service:

  1. Set your default project:

    gcloud config set project PROJECT_ID

    Click the icon to replace the variable PROJECT_ID with the name of the project you created for this tutorial. This ensures that all listings on this page that reference PROJECT_ID have the correct value already filled in.

  2. Configure Google Cloud CLI to use the region us-central1 for Cloud Run commands.

    gcloud config set run/region us-central1

Create an Artifact Registry Docker repository

Create a Docker repository to store the container images for this tutorial:

gcloud artifacts repositories create REPOSITORY \
  --repository-format=docker \
  --location=us-central1

Replace REPOSITORY with the name of the repository. For example, repo.

Use Docker to create a container image with Ollama and Gemma

  1. Create a directory for the Ollama service and change your working directory to this new directory:

    mkdir ollama-backend
    cd ollama-backend
  2. Create a Dockerfile file

    FROM ollama/ollama:0.3.6
    
    # Listen on all interfaces, port 8080
    ENV OLLAMA_HOST 0.0.0.0:8080
    
    # Store model weight files in /models
    ENV OLLAMA_MODELS /models
    
    # Reduce logging verbosity
    ENV OLLAMA_DEBUG false
    
    # Never unload model weights from the GPU
    ENV OLLAMA_KEEP_ALIVE -1 
    
    # Store the model weights in the container image
    ENV MODEL gemma2:9b
    RUN ollama serve & sleep 5 && ollama pull $MODEL 
    
    # Start Ollama
    ENTRYPOINT ["ollama", "serve"]

Store model weights in the container image for faster instance starts

Google recommends storing the model weights for Gemma 2 (9B) and similarly sized models directly in the container image.

Model weights are the numerical parameters that define the behavior of an LLM. Ollama must fully read these files and load the weights into GPU memory (VRAM) during container instance startup, before it can start serving inference requests.

On Cloud Run, a fast container instance startup is important for minimizing request latency. If your container instance has a slow startup time, the service takes longer to scale from zero to one instance, and it needs more time to scale out during a traffic spike.

To ensure a fast startup, store the model files in the container image itself. This is faster and more reliable than downloading the files from a remote location during startup. Cloud Run's internal container image storage is optimized for handling traffic spikes, allowing it to quickly set up the container's file system when an instance starts.

Note that the model weights for Gemma 2 (9B) take up 5.4 GB of storage. Larger models have larger model weight files, and these might be impractical to store in the container image. Refer to Best practices: AI inference on Cloud Run with GPUs for an overview of the trade-offs.

Build the container image using Cloud Build

To build the container image with Cloud Build and push it to the Artifact Registry repository:

gcloud builds submit \
   --tag us-central1-docker.pkg.dev/PROJECT_ID/REPOSITORY/ollama-gemma \
   --machine-type e2-highcpu-32

Note the following considerations:

  • For a faster build, this command selects a powerful machine type with more CPU and network bandwidth.
  • You should expect the build to take around 7 minutes.
  • An alternative is to build the image locally with Docker and push it to Artifact Registry. This might be slower than running on Cloud Build, depending on your network bandwidth.

Deploy Ollama as a Cloud Run service

With the container image stored in a Artifact Registry repository, you're now ready to deploy Ollama as a Cloud Run service.

Create a dedicated service account

Create a dedicated service account that the Ollama service uses as its service identity:

gcloud iam service-accounts create OLLAMA_IDENTITY \
  --display-name="Service Account for Ollama Cloud Run service"

Replace OLLAMA_IDENTITY with the name of the service account you want to create, for example, ollama.

It's a best practice to create a dedicated service account for every Cloud Run service with the minimal required set of permissions. The Ollama service doesn't need to call any Google Cloud APIs, which means there's no need to grant its service account any permissions.

Deploy the service

Deploy the service to Cloud Run:

gcloud beta run deploy ollama-gemma \
  --image us-central1-docker.pkg.dev/PROJECT_ID/REPOSITORY/ollama-gemma \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 7 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --service-account OLLAMA_IDENTITY@PROJECT_ID.iam.gserviceaccount.com \
  --timeout=600

Note the following important flags in this command:

  • --concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
  • --gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
  • --no-allow-authenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication. Refer to Managing access using IAM.
  • --no-cpu-throttling is required for enabling GPU.
  • --service-account sets the service identity of the service.

Setting concurrency for optimal performance

This section provides context on the recommended concurrency settings. For optimal request latency, ensure the --concurrency setting is equal to Ollama's OLLAMA_NUM_PARALLEL environment variable.

  • OLLAMA_NUM_PARALLEL determines how many request slots are available per each model to handle inference requests concurrently.
  • --concurrency determines how many requests Cloud Run sends to an Ollama instance at the same time.

If --concurrency exceeds OLLAMA_NUM_PARALLEL, Cloud Run can send more requests to a model in Ollama than it has available request slots for. This leads to request queuing within Ollama, increasing request latency for the queued requests. It also leads to less responsive auto scaling, as the queued requests don't trigger Cloud Run to scale out and start new instances.

Ollama also supports serving multiple models from one GPU. To completely avoid request queuing on the Ollama instance, you should still set --concurrency to match OLLAMA_NUM_PARALLEL.

It's important to note that increasing OLLAMA_NUM_PARALLEL also makes parallel requests take longer.

Optimizing utilization

For optimal GPU utilization, increase --concurrency, keeping it within twice the value of OLLAMA_NUM_PARALLEL. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.

Test the deployed Ollama service with curl

Now that you have deployed the Ollama service, you can send requests to it. However, if you send a request directly, Cloud Run responds with HTTP 401 Unauthorized. This is intentional, because an LLM inference API is intended for other services to call, such as a frontend application. For more information on service-to-service authentication on Cloud Run, refer to Authenticating service-to-service.

To send requests to the Ollama service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:

  1. Start the proxy, and when prompted to install the cloud-run-proxy component, choose Y:

    gcloud run services proxy ollama-gemma --port=9090
  2. Send a request to it in a separate terminal tab, leaving the proxy running. Note that the proxy runs on localhost:9090:

    curl http://localhost:9090/api/generate -d '{
      "model": "gemma2:9b",
      "prompt": "Why is the sky blue?"
    }'

    This command should provide streaming output similar to this:

    {"model":"gemma2:9b","created_at":"2024-07-15T23:21:39.288463414Z","response":"The","done":false}
    {"model":"gemma2:9b","created_at":"2024-07-15T23:21:39.320937525Z","response":" sky","done":false}
    {"model":"gemma2:9b","created_at":"2024-07-15T23:21:39.353173544Z","response":" appears","done":false}
    {"model":"gemma2:9b","created_at":"2024-07-15T23:21:39.385284976Z","response":" blue","done":false}
    ...
    

Clean up

  1. Delete other Google Cloud resources created in this tutorial:

What's next