Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
This page shows you how to prepare for supervised fine-tuning of Gemini models on audio data. This page covers the following topics:
Use cases: Learn about common applications for audio model tuning.
Limitations: Understand the limitations for audio tuning, such as maximum file size and length.
Dataset format: Review the required JSONL format for your audio tuning dataset.
Use cases
Tuning an audio model can enhance its performance for specific tasks. Common use cases include the following:
Enhanced voice assistants: Develop voice-activated systems for food ordering and delivery.
Audio content analysis: Generate accurate transcripts in noisy environments, summarize key points from podcasts, or classify music by genre and mood.
Accessibility and assistive technologies: Provide real-time captioning for events, develop voice-controlled applications, or create language learning tools with personalized pronunciation feedback.
{"contents":[{"role":"user","parts":[{"fileData":{"mimeType":"audio/mpeg","fileUri":"gs://cloud-samples-data/generative-ai/audio/pixel.mp3"}},{"text":"Please summarize the conversation in one sentence."}]},{"role":"model","parts":[{"text":"The podcast episode features two product managers for Pixel devices discussing the new features coming to Pixel phones and watches."}]}]}
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-28 UTC."],[],[],null,["# Audio Tuning\n\nThis page provides prerequisites and detailed instructions for fine-tuning\nGemini on audio data using supervised learning.\n\nUse cases\n---------\n\nTuning audio models enhances their performance by tailoring them to specific\nneeds. This can involve improving speech recognition for different accents,\nfine-tuning music genre classification, optimizing sound event detection,\ncustomizing audio generation, adapting to noisy environments, improving audio\nquality, and personalizing audio experiences. Here are some common audio tuning use\ncases:\n\n- **Enhanced voice assistants**:\n\n - Voice food ordering: Develop voice-activated systems for seamless food ordering and delivery.\n- **Audio content analysis**:\n\n - Automated transcription: Generate highly accurate transcripts, even in noisy environments.\n - Audio summarization: Summarize key points from podcasts or audiobooks.\n - Music classification: Categorize music based on genre, mood, or other characteristics.\n- **Accessibility and assistive technologies**:\n\n - Real-time captioning: Provide live captions for events or video calls.\n - Voice-controlled applications: Develop applications controlled entirely by voice.\n - Language learning: Create tools that provide personalized feedback on pronunciation.\n\nLimitations\n-----------\n\n### Gemini 2.5 models\n\n### Gemini 2.0 Flash\nGemini 2.0 Flash-Lite\n\nTo learn more about audio sample requirements, see the [Audio understanding (speech only)](/vertex-ai/generative-ai/docs/multimodal/audio-understanding#audio-requirements) page.\n\nDataset format\n--------------\n\nThe `fileUri` for your dataset can be the URI for a file in a Cloud Storage\nbucket, or it can be a publicly available HTTP or HTTPS URL.\n\nTo see the generic format example, see\n[Dataset example for Gemini](/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare#dataset-example).\n\nThe following is an example of an audio dataset. \n\n {\n \"contents\": [\n {\n \"role\": \"user\",\n \"parts\": [\n {\n \"fileData\": {\n \"mimeType\": \"audio/mpeg\",\n \"fileUri\": \"gs://cloud-samples-data/generative-ai/audio/pixel.mp3\"\n }\n },\n {\n \"text\": \"Please summarize the conversation in one sentence.\"\n }\n ]\n },\n {\n \"role\": \"model\",\n \"parts\": [\n {\n \"text\": \"The podcast episode features two product managers for Pixel devices discussing the new features coming to Pixel phones and watches.\"\n }\n ]\n }\n ]\n }\n\nWhat's next\n-----------\n\n- To learn more about the Gemini audio understanding model, see [Audio understanding (speech only)](/vertex-ai/generative-ai/docs/multimodal/audio-understanding).\n- To start tuning, see [Tune Gemini models by using supervised fine-tuning](/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning).\n- To learn how supervised fine-tuning can be used in a solution that builds a generative AI knowledge base, see [Jump Start Solution: Generative AI\n knowledge base](/architecture/ai-ml/generative-ai-knowledge-base)."]]