Context Caching for Fine-tuned Gemini Models

You can use context caching for your fine-tuned Gemini models to improve performance and reduce costs for prompts that include large amounts of context. By caching frequently used context, you avoid re-sending large amounts of data with each request to your fine-tuned model.

The management operations (Read, Update, Delete) of context cache for tuned Gemini remain consistent with base models. Only cached content creation and inference requires specific adjustment, which is detailed in the following.

Prerequisites

Fine-tuning a Gemini Model: You need a deployed fine-tuned Gemini model based on a supported base model (see Context caching overview). For details on how to fine-tune a Gemini model, see Fine-tune a Gemini model. To get the endpoint for your deployed tuned model, see Deploy a tuned model.

Make sure that you have the following information:

The ID and the version of the tuned Gemini model
The endpoint resource name for the deployed fine-tuned model

Create a context cache for a fine-tuned model

The procedure for creating a context cache for a fine-tuned model largely follows the steps outlined in Create a context cache. Consult the linked documentation for the general process; this guide focuses on the difference of creating context cache for fine-tuned Gemini models.

Instead of using the base model in the form of projects/{PROJECT}/locations/{LOCATION}/publishers/google/models/{MODEL}, you must use your fine-tuned model in the form of projects/{PROJECT}/locations/{LOCATION}/models/{MODEL}@{VERSION}.

The following examples show how to create a context cache with a tuned Gemini model.

REST

You can use REST to create a context cache by using the Vertex AI API to send a POST request to the publisher model endpoint. The following example shows how to create a context cache using a file stored in a Cloud Storage bucket.

Before using any of the request data, make the following replacements:

PROJECT_ID: Your project ID.
LOCATION: The region to process the request and where the cached content is stored. For a list of supported regions, see Available regions.
MODEL_ID: The fine-tuned Gemini model ID.
MODEL_VERSION: The fine-tuned Gemini model version.
CACHE_DISPLAY_NAME: A meaningful display name to describe and to help you identify each context cache.
MIME_TYPE: The MIME type of the content to cache.
CONTENT_TO_CACHE_URI: The Cloud Storage URI of the content to cache.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/cachedContents

Request JSON body:

{
  "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID@MODEL_VERSION",
  "displayName": "CACHE_DISPLAY_NAME",
  "contents": [{
    "role": "user",
      "parts": [{
        "fileData": {
          "mimeType": "MIME_TYPE",
          "fileUri": "CONTENT_TO_CACHE_URI"
        }
      }]
  },
  {
    "role": "model",
      "parts": [{
        "text": "This is sample text to demonstrate explicit caching."
      }]
  }]
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/cachedContents"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/cachedContents" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

Response

{
  "name": "projects/PROJECT_NUMBER/locations/us-central1/cachedContents/CACHE_ID",
  "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID@MODEL_VERSION",
  "createTime": "2024-06-04T01:11:50.808236Z",
  "updateTime": "2024-06-04T01:11:50.808236Z",
  "expireTime": "2024-06-04T02:11:50.794542Z"
}

Example curl command

LOCATION="us-central1"
MODEL_ID="model-id"
PROJECT_ID="test-project"
MODEL_VERSION=1
MIME_TYPE="video/mp4"
CACHED_CONTENT_URI="gs://path-to-bucket/video-file-name.mp4"

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents -d \
'{
  "model":"projects/${PROJECT_ID}/locations/${LOCATION}/models/${MODEL_ID}@${MODEL_VERSION}",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "${MIME_TYPE}",
            "fileUri": "${CACHED_CONTENT_URI}"
          }
        }
      ]
    }
  ]
}'

Use a context cache for a fine-tuned model

The procedure for using a context cache for a fine-tuned model largely follows the steps outlined in Use a context cache. Consult the linked documentation for the general process; this guide focuses on the difference of using context cache for fine-tuned Gemini models.

Instead of sending the request to the base model endpoint in the form of projects/{PROJECT}/locations/{LOCATION}/publishers/google/models/{MODEL}, you must send it to the endpoint of your deployed fine-tuned model in the form of projects/{PROJECT}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}.

The following code example shows how to use a context cache with a tuned Gemini model.

When you use a context cache, you can't specify the following properties:

GenerativeModel.system_instructions
GenerativeModel.tool_config
GenerativeModel.tools

REST

You can use REST to specify a context cache with a prompt by using the Vertex AI API to send a POST request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

PROJECT_ID: Your project ID.
LOCATION: The region where the request to create the context cache was processed.
ENDPOINT_ID: The endpoint where the fine-tuned model is deployed.
MIME_TYPE: The text prompt to submit to the model.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:generateContent

Request JSON body:

{
  "cachedContent": "projects/PROJECT_NUMBER/locations/LOCATION/cachedContents/CACHE_ID",
  "contents": [
      {"role":"user","parts":[{"text":"PROMPT_TEXT"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
      {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_HARASSMENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      }
  ],
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:generateContent"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:generateContent" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "MODEL_RESPONSE"
          }
        ]
      },
      "finishReason": "STOP",
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.21866937,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.19946389
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "MEDIUM",
          "probabilityScore": 0.6880493,
          "severity": "HARM_SEVERITY_MEDIUM",
          "severityScore": 0.43374163
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.4442634,
          "severity": "HARM_SEVERITY_LOW",
          "severityScore": 0.37903354
        },
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.10502681,
          "severity": "HARM_SEVERITY_LOW",
          "severityScore": 0.28170192
        }
      ]
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 55927,
    "candidatesTokenCount": 105,
    "totalTokenCount": 56032
  }
}

Example curl command

LOCATION="us-central1"
PROJECT_ID="test-project"
ENDPOINT_ID=987654321

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:generateContent" -d \
'{
  "cachedContent": "projects/${PROJECT_NUMBER}/locations/${LOCATION}/cachedContents/${CACHE_ID}",
  "contents": [
      {"role":"user","parts":[{"text":"What are the benefits of exercise?"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
    {
      "category": "HARM_CATEGORY_HATE_SPEECH",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_HARASSMENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    }
  ],
}'