Gemini-TTS

Text-to-Speech Gemini-TTS is the latest evolution of our Text-to-Speech technology that's moving beyond just naturalness to giving granular control over generated audio using text-based prompts. Using Gemini-TTS, you can synthesize speech from short snippets to long-form narratives, precisely dictating style, accent, pace, tone, and even emotional expression, all steerable through natural-language prompts.

Gemini-TTS capabilities are supported by the following:

  • gemini-2.5-flash-preview-tts: Gemini 2.5 Flash Preview is good for cost-efficient everyday applications.

  • gemini-2.5-pro-preview-tts: Gemini 2.5 Pro Preview is good for controllable speech generation (TTS) and for state-of-the-art quality of complex prompts.

Model Optimized for Input modality Output modality Single speaker
Gemini 2.5 Flash Preview TTS Low latency, controllable, single- and multi-speaker Text-to-Speech audio generation for cost-efficient everyday applications Text Audio ✔️
Gemini 2.5 Pro Preview TTS High control for structured workflows like podcast generation, audiobooks, customer support, and more Text Audio ✔️

Additional controls and capabilities include the following:

  1. Natural conversation: Voice interactions of remarkable quality, more appropriate expressivity, and prosody (patterns of rhythm) are delivered with very low latency so you can converse fluidly.

  2. Style control: Using natural language prompts, you can adapt the delivery within the conversation by steering it to adopt specific accents and produce a range of tones and expressions including a whisper.

  3. Dynamic performance: These models can bring text to life for expressive readings of poetry, newscasts, and engaging storytelling. They can also perform with specific emotions and produce accents when requested.

  4. Enhanced pace and pronunciation control: Controlling delivery speed helps to ensure more accuracy in pronunciation including specific words.

Examples

model: "gemini-2.5-pro-preview-tts"
prompt: "You are having a casual conversation with a friend. Say the following in a friendly and amused way."
text: "hahah I did NOT expect that. Can you believe it!."
speaker: "Callirhoe"

model: "gemini-2.5-flash-preview-tts"
prompt: "Say the following in a curious way"
text: "OK, so... tell me about this [uhm] AI thing.",
speaker: "Orus"

model: "gemini-2.5-flash-preview-tts"
prompt: "Say the following"
text: "[extremely fast] Availability and terms may vary. Check our website or your local store for complete details and restrictions."
speaker: "Kore"

See Use Gemini-TTS section for details on how to use these voices programmatically.

Voice Options

Gemini-TTS offers a wide range of voice options similar to our existing Chirp 3: HD Voices, each with distinct characteristics:

Name Gender Demo
Achernar Female
Achird Male
Algenib Male
Algieba Male
Alnilam Male
Aoede Female
Autonoe Female
Callirrhoe Female
Charon Male
Despina Female
Enceladus Male
Erinome Female
Fenrir Male
Gacrux Female
Iapetus Male
Kore Female
Laomedeia Female
Leda Female
Orus Male
Pulcherrima Female
Puck Male
Rasalgethi Male
Sadachbia Male
Sadaltager Male
Schedar Male
Sulafat Female
Umbriel Male
Vindemiatrix Female
Zephyr Female
Zubenelgenubi Male

Language availability

Gemini-TTS offers a wide range of voice options similar to our existing Chirp 3: HD Voices, each with distinct characteristics:

Language BCP-47 Code
English (United States) en-US

Regional availability

Gemini-TTS models are available in the following Google Cloud regions respectively:

Google Cloud zone Launch readiness
us Public Preview

Supported output formats

The default response format is LINEAR16. Other supported formats include the following:

API method Format
batch ALAW, MULAW, MP3, OGG_OPUS, and PCM

Use Gemini-TTS

Discover how to use Gemini-TTS models to synthesize single-speaker speech, with up to 800 characters of text.

Perform synchronous speech synthesis request

Python

import os
from google.cloud import texttospeech

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")

def synthesize(prompt: str, text: str, model_name: str, output_filepath: str = "output.mp3"):
   """Synthesizes speech from the input text and saves it to an MP3 file.

   Args:
       prompt: Stylisting instructions on how to synthesize the content in
         the text field.
       text: The text to synthesize.
       model_name: Gemini model to use. Currently, the available models are
         gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts
       output_filepath: The path to save the generated audio file.
         Defaults to "output.mp3".
   """
   client = texttospeech.TextToSpeechClient()

   synthesis_input = texttospeech.SynthesisInput(text=text, prompt=prompt)

   # Select the voice you want to use.
   voice = texttospeech.VoiceSelectionParams(
       language_code="en-US",
       name="Charon",  # Example voice, adjust as needed
       model_name=model_name
   )

   audio_config = texttospeech.AudioConfig(
       audio_encoding=texttospeech.AudioEncoding.MP3
   )

   # Perform the text-to-speech request on the text input with the selected
   # voice parameters and audio file type.
   response = client.synthesize_speech(
       input=synthesis_input, voice=voice, audio_config=audio_config
   )

   # The response's audio_content is binary.
   with open(output_filepath, "wb") as out:
       out.write(response.audio_content)
       print(f"Audio content written to file: {output_filepath}")