Chirp 3: HD voices

Text-to-Speech Chirp 3: HD voices represent the latest generation of Text-to-Speech technology. Powered by our cutting-edge LLMs, these voices deliver an unparalleled level of realism and emotional resonance.

Try a Colab notebook View notebook on GitHub

Voice Options

A range of voice options are available, each with distinct characteristics:

Name Gender Demo
Aoede Female
Puck Male
Charon Male
Kore Female
Fenrir Male
Leda Female
Orus Male
Zephyr Female

Language Availability

Chirp 3: HD voices are supported in the following languages:

Language BCP-47 Code
German (Germany) de-DE
English (Australia) en-AU
English (United Kingdom) en-GB
English (India) en-IN
Spanish (United States) es-US
French (France) fr-FR
Hindi (India) hi-IN
Portuguese (Brazil) pt-BR
Arabic (Generic) ar-XA
Spanish (Spain) es-ES
French (Canada) fr-CA
Indonesian (Indonesia) id-ID
Italian (Italy) it-IT
Japanese (Japan) ja-JP
Turkish (Turkey) tr-TR
Vietnamese (Vietnam) vi-VN
Bengali (India) bn-IN
Gujarati (India) gu-IN
Kannada (India) kn-IN
Malayalam (India) ml-IN
Marathi (India) mr-IN
Tamil (India) ta-IN
Telugu (India) te-IN
Dutch (Netherlands) nl-NL
Korean (South Korea) ko-KR
Mandarin Chinese (China) cmn-CN
Polish (Poland) pl-PL
Russian (Russia) ru-RU
Thai (Thailand) th-TH

Regional Availability

Chirp 3: HD voices are available in the following Google Cloud regions respectively:

Google Cloud Zone Launch Readiness
global GA
us GA
eu GA
asia-southeast1 GA

Supported output formats

The default response format is LINEAR16, but other formats which are supported include:

API Method Format
streaming ALAW, MULAW, OGG_OPUS and PCM
batch ALAW, MULAW, MP3, OGG_OPUS and PCM

Use Chirp 3: HD Voices

Discover how to use Chirp 3: HD Voices to synthesize speech

Perform streaming speech synthesis request

Python

To learn how to install and use the client library for Text-to-Speech, see Text-to-Speech client libraries. For more information, see the Text-to-Speech Python API reference documentation.

To authenticate to Text-to-Speech, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

"""Synthesizes speech from a stream of input text."""
from google.cloud import texttospeech

client = texttospeech.TextToSpeechClient()

# See https://cloud.google.com/text-to-speech/docs/voices for all voices.
streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=texttospeech.VoiceSelectionParams(
        name="en-US-Chirp3-HD-Charon",
        language_code="en-US",
    )
)

# Set the config for your stream. The first request must contain your config, and then each subsequent request must contain text.
config_request = texttospeech.StreamingSynthesizeRequest(
    streaming_config=streaming_config
)

text_iterator = [
    "Hello there. ",
    "How are you ",
    "today? It's ",
    "such nice weather outside.",
]

# Request generator. Consider using Gemini or another LLM with output streaming as a generator.
def request_generator():
    yield config_request
    for text in text_iterator:
        yield texttospeech.StreamingSynthesizeRequest(
            input=texttospeech.StreamingSynthesisInput(text=text)
        )

streaming_responses = client.streaming_synthesize(request_generator())

for response in streaming_responses:
    print(f"Audio content size in bytes is: {len(response.audio_content)}")

Perform online speech synthesis request

Python

To learn how to install and use the client library for Text-to-Speech, see Text-to-Speech client libraries. For more information, see the Text-to-Speech Python API reference documentation.

To authenticate to Text-to-Speech, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

def synthesize_text():
    """Synthesizes speech from the input string of text."""
    from google.cloud import texttospeech

    text = "Hello there."
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(text=text)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Chirp3-HD-Charon",
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        input=input_text,
        voice=voice,
        audio_config=audio_config,
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Scripting and Prompting Tips

Creating engaging and natural-sounding audio from text requires understanding the nuances of spoken language and translating them into script form. The following tips will help you craft scripts that sound authentic and capture the chosen tone.

Understanding the Goal: Natural Speech

The primary objective is to make the synthesized voice sound as close to a natural human speaker as possible. This involves:

  • Mimicking Natural Pacing: How quickly or slowly someone speaks.
  • Creating Smooth Flow: Ensuring seamless transitions between sentences and phrases.
  • Adding Realistic Pauses: Incorporating pauses for emphasis and clarity.
  • Capturing Conversational Tone: Making the audio sound like a real conversation.

Key Techniques for Natural Speech

  • Punctuation for Pacing and Flow

    • Periods (.): Indicate a full stop and a longer pause. Use them to separate complete thoughts and create clear sentence boundaries.
    • Commas (,): Signal shorter pauses within sentences. Use them to separate clauses, list items, or introduce brief breaks for breath.
    • Ellipses (...): Represent a longer, more deliberate pause. They can indicate trailing thoughts, hesitation, or a dramatic pause.
      • Example: "And then... it happened."
    • Hyphens (-): Can be used to indicate a brief pause or a sudden break in thought.
      • Example: "I wanted to say - but I couldn't."
  • Incorporating Pauses and Disfluencies

    • Strategic Pauses: Use ellipses, commas, or hyphens to create pauses in places where a human speaker would naturally pause for breath or emphasis.
    • Disfluencies (Ums and Uhs): While some Text-to-Speech models handle disfluencies automatically, understanding their role is crucial. They add authenticity and make the speech sound less robotic. Even if the model adds them, being aware of where they would naturally occur in human speech helps you understand the overall flow of your script.
  • Experimentation and Iteration

    • Re-synthesizing: Don't be afraid to re-synthesize the same message with the same voice multiple times. Minor tweaks to punctuation, spacing, or word choice can significantly impact the final audio.
    • Listen Critically: Pay close attention to the pacing, flow, and overall tone of the synthesized audio. Identify areas that sound unnatural and adjust your script accordingly.
    • Voice Variation: If the system allows for it, try using different voices to see which one best suits your script and chosen tone.
  • Practical Scripting Tips

    • Read Aloud: Before synthesizing, read your script aloud. This will help you identify awkward phrasing, unnatural pauses, and areas that need adjustment.
    • Write Conversationally: Use contractions (e.g., "it's," "we're") and informal language to make the script sound more natural.
    • Consider the Context: The tone and pacing of your script should match the context of the audio. A formal presentation will require a different approach than a casual conversation.
    • Break Down Complex Sentences: Long, convoluted sentences can be difficult for TTS engines to handle. Break them down into shorter, more manageable sentences.
  • Sample Script Improvements

    • Original Script (Robotic): "The product is now available. We have new features. It is very exciting."

    • Improved Script (Natural): "The product is now available... and we've added some exciting new features. It's, well, it's very exciting."

    • Original Script (Robotic): "This is an automated confirmation message. Your reservation has been processed. The following details pertain to your upcoming stay. Reservation number is 12345. Guest name registered is Anthony Vasquez Arrival date is March 14th. Departure date is March 16th. Room type is Deluxe Suite. Number of guests is 1 guest. Check-in time is 3 PM. Check-out time is 11 AM. Please note, cancellation policy requires notification 48 hours prior to arrival. Failure to notify within this timeframe will result in a charge of one night's stay. Additional amenities included in your reservation are: complimentary Wi-Fi, access to the fitness center, and complimentary breakfast. For any inquiries, please contact the hotel directly at 855-555-6689 Thank you for choosing our hotel."

    • Improved Script (Natural): "Hi Anthony Vasquez! We're so excited to confirm your reservation with us! You're all set for your stay from March 14th to March 16th in our beautiful Deluxe Suite. That's for 1 guest. Your confirmation number is 12345, just in case you need it.

      So, just a quick reminder, check-in is at 3 PM, and check-out is at, well, 11 AM.

      Now, just a heads-up about our cancellation policy… if you need to cancel, just let us know at least 48 hours before your arrival, okay? Otherwise, there'll be a charge for one night's stay.

      And to make your stay even better, you'll have complimentary Wi-Fi, access to our fitness center, and a delicious complimentary breakfast each morning!

      If you have any questions at all, please don't hesitate to call us at 855-555-6689. We can't wait to welcome you to the hotel!"

    • Explanation of Changes:

      • The ellipses (...) create a pause for emphasis.
      • "and we've" uses a contraction for a more conversational tone.
      • "It's, well, it's very exciting" adds a small amount of disfluency, and emphasis.
      • "Okay?" friendly reminder softens tone.

    By following these guidelines, you can create text-to-audio scripts that sound natural, engaging, and human-like. Remember that practice and experimentation are key to mastering this skill.

Chirp 3: HD voice controls

Voice control features are specifically for HD voice synthesis. Note that HD voices don't support SSML, and pause and pace control can produce inconsistent results.

Language availability for voice controls

Chirp 3: HD voice controls are currently only available in US English.

Pace control

You can adjust the speed of the generated audio using the pace parameter. This parameter lets you slow down or speed up the speech, with values ranging from 0.25x (very slow) to 2x (very fast), in increments of 0.25x. To set the pace, use the 'speaking_rate' parameter in your request, choosing a float value between 0.25 and 2.0. Values below 1.0 slow down the speech, while values above 1.0 speed it up. A value of 1.0 indicates unadjusted pace.

Sample SynthesizeSpeechRequest using pace control:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
    "speaking_rate": 2.0,
  },
  "input": {
    "text": "Once upon a time, there was a cute cat. He was so cute that he got lots of treats.",
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Sample StreamingSynthesizeConfig using pace control:

{
  "streaming_audio_config": {
    "audio_encoding": "LINEAR16",
    "speaking_rate": 2.0,
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Pace control audio samples:

Speaking rate Output
0.5
1.0
2.0

Pause control

You can insert pauses into AI-generated speech by embedding special tags directly into your text using the markup input field. Note that pause tags will only work in the markup field, and not in the text field.

These tags signal the AI to create silences, but the precise length of these pauses isn't fixed. The AI adjusts the duration based on context, much like natural human speech varies with speaker, location, and sentence structure. The available pause tags are [pause short], [pause long], and [pause]. For alternative methods of creating pauses without using markup tags, refer to our prompting and crafting guidelines.

The AI model might occasionally disregard the pause tags, especially if they are placed in unnatural positions in the text. You can combine multiple pause tags for longer silences, but excessive use can lead to problems.

Sample SynthesizeSpeechRequest using pause control:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
  },
  "input": {
    "markup": "Let me take a look, [pause long] yes, I see it.",
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Sample StreamingSynthesisInput using pause control:

{
  "markup": "Let me take a look, [pause long] yes, I see it.",
}

Pause control audio samples:

Markup input Output
"Let me take a look, yes, I see it."
"Let me take a look, [pause long] yes, I see it."

FAQ

Common questions and their answers:

How do I control pacing and flow to improve the speech output?

You can utilize our troubleshooting tips to improve your text prompt to improve your speech output.

How do I access voices in supported languages?

Voice names follow a specific format, allowing usage across supported languages by specifying the voice uniquely. The format follows \<locale\>-\<model\>-\<voice\>. For example, to use the Kore voice for English (United States) using the Chirp 3: HD voices model, you would specify it as en-US-Chirp3-HD-Kore.

Does Chirp 3: HD voices support SSML?

While Chirp 3: HD voices don't work with SSML, you can still manage pace and pause control through the HD voice control options.