Audio Tuning

This page shows you how to prepare for supervised fine-tuning of Gemini models on audio data. This page covers the following topics:

  • Use cases: Learn about common applications for audio model tuning.
  • Limitations: Understand the limitations for audio tuning, such as maximum file size and length.
  • Dataset format: Review the required JSONL format for your audio tuning dataset.

Use cases

Tuning an audio model can enhance its performance for specific tasks. Common use cases include the following:

  • Enhanced voice assistants: Develop voice-activated systems for food ordering and delivery.
  • Audio content analysis: Generate accurate transcripts in noisy environments, summarize key points from podcasts, or classify music by genre and mood.
  • Accessibility and assistive technologies: Provide real-time captioning for events, develop voice-controlled applications, or create language learning tools with personalized pronunciation feedback.

Limitations

Gemini 2.5 models

Specification Value
Maximum audio length per example 60 minutes
Maximum audio files per example 1
Maximum audio file size 100MB

Gemini 2.0 Flash
Gemini 2.0 Flash-Lite

Specification Value
Maximum audio length per example 60 minutes
Maximum audio files per example 1
Maximum audio file size 100MB

To learn more about audio sample requirements, see Audio understanding (speech only).

Dataset format

The fileUri for your dataset can be the URI for a file in a Cloud Storage bucket or a publicly available HTTP or HTTPS URL.

For a generic format example, see Dataset example for Gemini.

The following example shows an audio dataset.

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "audio/mpeg",
            "fileUri": "gs://cloud-samples-data/generative-ai/audio/pixel.mp3"
            }
        },
        {
          "text": "Please summarize the conversation in one sentence."
        }
      ]
    },
    {
      "role": "model",
      "parts": [
        {
          "text": "The podcast episode features two product managers for Pixel devices discussing the new features coming to Pixel phones and watches."
        }
      ]
    }
  ]
}

What's next