Context Caching for Fine-tuned Gemini Models

This guide shows you how to use context caching with your fine-tuned Gemini models to improve performance and reduce costs for prompts that include large amounts of context. This page describes the following:

By caching frequently used context, you avoid re-sending large amounts of data with each request to your fine-tuned model.

The process for managing (Read, Update, Delete) a context cache is the same for both fine-tuned and base Gemini models. This page focuses on the specific steps for creating and using a cache with a fine-tuned model.

Supported models

Context caching is supported for models fine-tuned from the following base model:

  • gemini-2.0-flash-001

Prerequisites

Before you begin, you need the following:

  • A deployed fine-tuned model: You need a deployed fine-tuned Gemini model based on a supported base model. For details, see Fine-tune a Gemini model.
  • Model information:
    • The ID and the version of the tuned Gemini model.
    • The endpoint resource name for the deployed fine-tuned model. To get the endpoint, see Deploy a tuned model.

Create a context cache for a fine-tuned model

The process for creating a context cache for a fine-tuned model is similar to the process for a base model. For the general steps, see Create a context cache.

The key difference is that you must specify your fine-tuned model version in the model field instead of the base model. Use the format projects/{PROJECT}/locations/{LOCATION}/models/{MODEL}@{VERSION}.

The following example shows how to create a context cache with a tuned Gemini model.

REST

You can use REST to create a context cache by using the Vertex AI API to send a POST request to the publisher model endpoint. The following example shows how to create a context cache using a file stored in a Cloud Storage bucket.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request and where the cached content is stored. For a list of supported regions, see Available regions.
  • MODEL_ID: The fine-tuned Gemini model ID.
  • MODEL_VERSION: The fine-tuned Gemini model version.
  • CACHE_DISPLAY_NAME: A meaningful display name to describe and to help you identify each context cache.
  • MIME_TYPE: The MIME type of the content to cache.
  • CONTENT_TO_CACHE_URI: The Cloud Storage URI of the content to cache.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/cachedContents

Request JSON body:

{
  "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID@MODEL_VERSION",
  "displayName": "CACHE_DISPLAY_NAME",
  "contents": [{
    "role": "user",
      "parts": [{
        "fileData": {
          "mimeType": "MIME_TYPE",
          "fileUri": "CONTENT_TO_CACHE_URI"
        }
      }]
  },
  {
    "role": "model",
      "parts": [{
        "text": "This is sample text to demonstrate explicit caching."
      }]
  }]
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/cachedContents"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/cachedContents" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

Example curl command

LOCATION="us-central1"
MODEL_ID="model-id"
PROJECT_ID="test-project"
MODEL_VERSION=1
MIME_TYPE="video/mp4"
CACHED_CONTENT_URI="gs://path-to-bucket/video-file-name.mp4"

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents -d \
'{
  "model":"projects/${PROJECT_ID}/locations/${LOCATION}/models/${MODEL_ID}@${MODEL_VERSION}",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "${MIME_TYPE}",
            "fileUri": "${CACHED_CONTENT_URI}"
          }
        }
      ]
    }
  ]
}'

Use a context cache for a fine-tuned model

The process for using a context cache with a fine-tuned model is similar to the process for a base model. For the general steps, see Use a context cache.

The key difference is that you send the request to your deployed fine-tuned model's endpoint (.../endpoints/{ENDPOINT_ID}) instead of the base model endpoint (.../publishers/google/models/{MODEL}).

When you use a context cache, you can't specify the following properties in your request:

  • GenerativeModel.system_instructions
  • GenerativeModel.tool_config
  • GenerativeModel.tools

The following code example shows how to use a context cache with a tuned Gemini model.

REST

You can use REST to specify a context cache with a prompt by using the Vertex AI API to send a POST request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region where the request to create the context cache was processed.
  • ENDPOINT_ID: The endpoint where the fine-tuned model is deployed.
  • MIME_TYPE: The text prompt to submit to the model.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:generateContent

Request JSON body:

{
  "cachedContent": "projects/PROJECT_NUMBER/locations/LOCATION/cachedContents/CACHE_ID",
  "contents": [
      {"role":"user","parts":[{"text":"PROMPT_TEXT"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
      {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_HARASSMENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      }
  ],
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:generateContent"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:generateContent" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Example curl command

LOCATION="us-central1"
PROJECT_ID="test-project"
ENDPOINT_ID=987654321

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:generateContent" -d \
'{
  "cachedContent": "projects/${PROJECT_NUMBER}/locations/${LOCATION}/cachedContents/${CACHE_ID}",
  "contents": [
      {"role":"user","parts":[{"text":"What are the benefits of exercise?"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
    {
      "category": "HARM_CATEGORY_HATE_SPEECH",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_HARASSMENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    }
  ],
}'