Context caching overview

Context caching helps reduce the cost and latency of requests to Gemini that contain repeated content. Vertex AI offers two types of caching:

  • Implicit caching: Automatic caching enabled by default that provides cost savings when cache hits occur.
  • Explicit caching: Manual caching enabled using the Vertex AI API, where you explicitly declare the content you want to cache and whether or not your prompts should refer to the cache content.

For both implicit and explicit caching, the cachedContentTokenCount field in your response's metadata indicates the number of tokens in the cached part of your input. Caching requests must contain a minimum of 2,048 tokens.

Both implicit and explicit caching are supported when using the following models:

For both implicit and explicit caching, there is no additional charge to write to cache other than the standard input token costs. For explicit caching, there are storage costs based on how long caches are stored. There are no storage costs for implicit caching. For more information, see Vertex AI pricing.

Implicit caching

All Google Cloud projects have implicit caching enabled by default. Implicit caching provides a 75% discount on cached tokens compared to standard input tokens.

When enabled, implicit cache hit cost savings are automatically passed on to you. To increase the chances of an implicit cache hit:

  • Place large and common contents at the beginning of your prompt.
  • Send requests with a similar prefix in a short amount of time.

Explicit caching

Explicit caching offers more control and ensures a 75% discount when explicit caches are referenced.

Using the Vertex AI API, you can:

You can also use the Vertex AI API to retrieve information about a context cache.

Explicit caches interact with implicit caching, potentially leading to additional caching beyond the specified contents when creating a cache. To prevent cache data retention, disable implicit caching and avoid creating explicit caches. For more information, see Enable and disable caching.

When to use context caching

Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by subsequent requests.

Cached context items, such as a large amount of text, an audio file, or a video file, can be used in prompt requests to the Gemini API to generate output. Requests that use the same cache in the prompt also include text unique to each prompt. For example, each prompt request that composes a chat conversation might include the same context cache that references a video along with unique text that comprises each turn in the chat.

Consider using context caching for use cases such as:

  • Chatbots with extensive system instructions
  • Repetitive analysis of lengthy video files
  • Recurring queries against large document sets
  • Frequent code repository analysis or bug fixing

Context caching support for Provisioned Throughput is in Preview for implicit caching. Explicit caching is not supported for Provisioned Throughput. Refer to the Provisioned Throughput guide for more details.

Availability

Context caching is available in regions where Generative AI on Vertex AI is available. For more information, see Generative AI on Vertex AI locations.

VPC Service Controls support

Context caching supports VPC Service Controls, meaning your cache cannot be exfiltrated beyond your service perimeter. If you use Cloud Storage to build your cache, include your bucket in your service perimeter as well to protect your cache content.

For more information, see VPC Service Controls with Vertex AI in the Vertex AI documentation.

What's next