This page provides prerequisites and detailed instructions for fine-tuning
Gemini on video data using supervised learning. The following Gemini models support video tuning: Fine-tuning lets you adapt base Gemini models for specialized tasks.
Here are some video use cases: Automated video summarization: Tuning LLMs to generate concise and
coherent summaries of long videos, capturing the main themes, events, and
narratives. This is useful for content discovery, archiving, and quick
reviews. Detailed event recognition and localization: Fine-tuning allows LLMs to
identify and pinpoint specific actions, events, or objects within a video
timeline with greater accuracy. For example, identifying all instances of a
particular product in a marketing video or a specific action in sports
footage. Content moderation: Specialized tuning can improve an LLM's ability to
detect sensitive, inappropriate, or policy-violating content within videos,
going beyond simple object detection to understand context and nuance. Video captioning and subtitling: While already a common application,
tuning can improve the accuracy, fluency, and context-awareness of
automatically generated captions and subtitles, including descriptions of
nonverbal cues. The The Model tuning with When a video segment is used for training and validation, the video segment
is in the To see the generic format example, see
Dataset example for Gemini. The following sections present video dataset format examples. This schema is added as a single line in the JSONL file. This schema is added as a single line in the JSONL file.Supported models
Use cases
Limitations
MEDIA_RESOLUTION_MEDIUM
and 20 minutes with MEDIA_RESOLUTION_LOW
.mediaResolution
for each example in the entire training dataset must be
consistent. All lines in the JSONL files used for training and validation
should have the same value of mediaResolution
.Dataset format
fileUri
field specifies the location of your dataset. It can be the URI
for a file in a Cloud Storage bucket, or it can be a publicly available HTTP
or HTTPS URL.mediaResolution
field is used to specify the token count per frame for
the input videos, as one of the following values:
MEDIA_RESOLUTION_LOW
: 64 tokens per frameMEDIA_RESOLUTION_MEDIUM
: 256 tokens per frameMEDIA_RESOLUTION_LOW
is roughly 4 times faster than the ones
tuned with MEDIA_RESOLUTION_MEDIUM
with minimal performance improvement.videoMetadata
field. During tuning, this data point is decoded
to contain information from the segment extracted from the specified video file,
starting from timestamp startOffset
(the start offset, in seconds) until
endOffset
.JSON schema example for cases where the full video is used for training and validation
{
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"fileUri": "gs://<path to the mp4 video file>",
"mimeType": "video/mp4"
},
},
{
"text": "
You are a video analysis expert. Detect which animal appears in the
video.The video can only have one of the following animals: dog, cat,
rabbit.\n Output Format:\n Generate output in the following JSON
format:\n
[{\n
\"animal_name\": \"<CATEGORY>\",\n
}]\n"
}
]
},
{
"role": "model",
"parts": [
{
"text": "```json\n[{\"animal_name\": \"dog\"}]\n```"
}
]
},
],
"generationConfig": {
"mediaResolution": "MEDIA_RESOLUTION_LOW"
}
}
JSON schema example for cases where a video segment is used for training and validation
{
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"fileUri": "gs://<path to the mp4 video file>",
"mimeType": "video/mp4"
},
"videoMetadata": {
"startOffset": "5s",
"endOffset": "25s"
}
},
{
"text": "
You are a video analysis expert. Detect which animal appears in the
video.The video can only have one of the following animals: dog, cat,
rabbit.\n Output Format:\n Generate output in the following JSON
format:\n
[{\n
\"animal_name\": \"<CATEGORY>\",\n
}]\n"
}
]
},
{
"role": "model",
"parts": [
{
"text": "```json\n[{\"animal_name\": \"dog\"}]\n```"
}
]
},
],
"generationConfig": {
"mediaResolution": "MEDIA_RESOLUTION_LOW"
}
}
What's next
Video tuning
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-21 UTC.