Chirp 3: HD voices

Text-to-Speech Chirp 3: HD voices represent the latest generation of Text-to-Speech technology. Powered by our cutting-edge LLMs, these voices deliver an unparalleled level of realism and emotional resonance.

Try a Colab notebook

View notebook on GitHub

Voice Options

A range of voice options are available, each with distinct characteristics:

Name	Gender	Demo
Achernar	Female
Achird	Male
Algenib	Male
Algieba	Male
Alnilam	Male
Aoede	Female
Autonoe	Female
Callirrhoe	Female
Charon	Male
Despina	Female
Enceladus	Male
Erinome	Female
Fenrir	Male
Gacrux	Female
Iapetus	Male
Kore	Female
Laomedeia	Female
Leda	Female
Orus	Male
Pulcherrima	Female
Puck	Male
Rasalgethi	Male
Sadachbia	Male
Sadaltager	Male
Schedar	Male
Sulafat	Female
Umbriel	Male
Vindemiatrix	Female
Zephyr	Female
Zubenelgenubi	Male

Language Availability

Chirp 3: HD voices are supported in the following languages:

Language	BCP-47 Code
Arabic (Generic)	ar-XA
Bengali (India)	bn-IN
Danish (Denmark)	da-DK
Dutch (Belgium)	nl-BE
Dutch (Netherlands)	nl-NL
English (Australia)	en-AU
English (India)	en-IN
English (United Kingdom)	en-GB
English (United States)	en-US
Finnish (Finland)	fi-FI
French (Canada)	fr-CA
French (France)	fr-FR
German (Germany)	de-DE
Gujarati (India)	gu-IN
Hindi (India)	hi-IN
Indonesian (Indonesia)	id-ID
Italian (Italy)	it-IT
Japanese (Japan)	ja-JP
Kannada (India)	kn-IN
Korean (South Korea)	ko-KR
Malayalam (India)	ml-IN
Mandarin Chinese (China)	cmn-CN
Marathi (India)	mr-IN
Norwegian Bokmål (Norway)	nb-NO
Polish (Poland)	pl-PL
Portuguese (Brazil)	pt-BR
Russian (Russia)	ru-RU
Spanish (Spain)	es-ES
Spanish (United States)	es-US
Swahili (Kenya)	sw-KE
Swedish (Sweden)	sv-SE
Tamil (India)	ta-IN
Telugu (India)	te-IN
Thai (Thailand)	th-TH
Turkish (Turkey)	tr-TR
Ukrainian (Ukraine)	uk-UA
Urdu (India)	ur-IN
Vietnamese (Vietnam)	vi-VN

Regional Availability

Chirp 3: HD voices are available in the following Google Cloud regions respectively:

Google Cloud Zone	Launch Readiness
`global`	GA
`us`	GA
`eu`	GA
`asia-southeast1`	GA

Supported output formats

The default response format is LINEAR16, but other formats which are supported include:

API Method	Format
`streaming`	ALAW, MULAW, OGG_OPUS and PCM
`batch`	ALAW, MULAW, MP3, OGG_OPUS and PCM

Use Chirp 3: HD voices

Discover how to use Chirp 3: HD voices to synthesize speech.

Perform streaming speech synthesis request

Python

To learn how to install and use the client library for Text-to-Speech, see Text-to-Speech client libraries. For more information, see the Text-to-Speech Python API reference documentation.

To authenticate to Text-to-Speech, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

"""Synthesizes speech from a stream of input text."""
from google.cloud import texttospeech

client = texttospeech.TextToSpeechClient()

# See https://cloud.google.com/text-to-speech/docs/voices for all voices.
streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=texttospeech.VoiceSelectionParams(
        name="en-US-Chirp3-HD-Charon",
        language_code="en-US",
    )
)

# Set the config for your stream. The first request must contain your config, and then each subsequent request must contain text.
config_request = texttospeech.StreamingSynthesizeRequest(
    streaming_config=streaming_config
)

text_iterator = [
    "Hello there. ",
    "How are you ",
    "today? It's ",
    "such nice weather outside.",
]

# Request generator. Consider using Gemini or another LLM with output streaming as a generator.
def request_generator():
    yield config_request
    for text in text_iterator:
        yield texttospeech.StreamingSynthesizeRequest(
            input=texttospeech.StreamingSynthesisInput(text=text)
        )

streaming_responses = client.streaming_synthesize(request_generator())

for response in streaming_responses:
    print(f"Audio content size in bytes is: {len(response.audio_content)}")

Perform online speech synthesis request

Python

To learn how to install and use the client library for Text-to-Speech, see Text-to-Speech client libraries. For more information, see the Text-to-Speech Python API reference documentation.

To authenticate to Text-to-Speech, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

def synthesize_text():
    """Synthesizes speech from the input string of text."""
    from google.cloud import texttospeech

    text = "Hello there."
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(text=text)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Chirp3-HD-Charon",
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        input=input_text,
        voice=voice,
        audio_config=audio_config,
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Scripting and prompting tips

Creating engaging and natural-sounding audio from text requires understanding the nuances of spoken language and translating them into script form. The following tips will help you craft scripts that sound authentic and capture the chosen tone.

Understanding the Goal: Natural Speech

The primary objective is to make the synthesized voice sound as close to a natural human speaker as possible. This involves:

Mimicking Natural Pacing: How quickly or slowly someone speaks.
Creating Smooth Flow: Ensuring seamless transitions between sentences and phrases.
Adding Realistic Pauses: Incorporating pauses for emphasis and clarity.
Capturing Conversational Tone: Making the audio sound like a real conversation.

Key Techniques for Natural Speech

Punctuation for Pacing and Flow
- Periods (.): Indicate a full stop and a longer pause. Use them to separate complete thoughts and create clear sentence boundaries.
- Commas (,): Signal shorter pauses within sentences. Use them to separate clauses, list items, or introduce brief breaks for breath.
- Ellipses (...): Represent a longer, more deliberate pause. They can indicate trailing thoughts, hesitation, or a dramatic pause.
  - Example: "And then... it happened."
- Hyphens (-): Can be used to indicate a brief pause or a sudden break in thought.
  - Example: "I wanted to say - but I couldn't."
Incorporating Pauses and Disfluencies
- Strategic Pauses: Use ellipses, commas, or hyphens to create pauses in places where a human speaker would naturally pause for breath or emphasis.
- Disfluencies (Ums and Uhs): While some Text-to-Speech models handle disfluencies automatically, understanding their role is crucial. They add authenticity and make the speech sound less robotic. Even if the model adds them, being aware of where they would naturally occur in human speech helps you understand the overall flow of your script.
Experimentation and Iteration
- Re-synthesizing: Don't be afraid to re-synthesize the same message with the same voice multiple times. Minor tweaks to punctuation, spacing, or word choice can significantly impact the final audio.
- Listen Critically: Pay close attention to the pacing, flow, and overall tone of the synthesized audio. Identify areas that sound unnatural and adjust your script accordingly.
- Voice Variation: If the system allows for it, try using different voices to see which one best suits your script and chosen tone.
Practical Scripting Tips
- Read Aloud: Before synthesizing, read your script aloud. This will help you identify awkward phrasing, unnatural pauses, and areas that need adjustment.
- Write Conversationally: Use contractions (e.g., "it's," "we're") and informal language to make the script sound more natural.
- Consider the Context: The tone and pacing of your script should match the context of the audio. A formal presentation will require a different approach than a casual conversation.
- Break Down Complex Sentences: Long, convoluted sentences can be difficult for TTS engines to handle. Break them down into shorter, more manageable sentences.
Sample Script Improvements
- Original Script (Robotic): "The product is now available. We have new features. It is very exciting."
- Improved Script (Natural): "The product is now available... and we've added some exciting new features. It's, well, it's very exciting."
- Original Script (Robotic): "This is an automated confirmation message. Your reservation has been processed. The following details pertain to your upcoming stay. Reservation number is 12345. Guest name registered is Anthony Vasquez Arrival date is March 14th. Departure date is March 16th. Room type is Deluxe Suite. Number of guests is 1 guest. Check-in time is 3 PM. Check-out time is 11 AM. Please note, cancellation policy requires notification 48 hours prior to arrival. Failure to notify within this timeframe will result in a charge of one night's stay. Additional amenities included in your reservation are: complimentary Wi-Fi, access to the fitness center, and complimentary breakfast. For any inquiries, please contact the hotel directly at 855-555-6689 Thank you for choosing our hotel."
- Improved Script (Natural): "Hi Anthony Vasquez! We're so excited to confirm your reservation with us! You're all set for your stay from March 14th to March 16th in our beautiful Deluxe Suite. That's for 1 guest. Your confirmation number is 12345, just in case you need it.
  
  So, just a quick reminder, check-in is at 3 PM, and check-out is at, well, 11 AM.
  
  Now, just a heads-up about our cancellation policy… if you need to cancel, just let us know at least 48 hours before your arrival, okay? Otherwise, there'll be a charge for one night's stay.
  
  And to make your stay even better, you'll have complimentary Wi-Fi, access to our fitness center, and a delicious complimentary breakfast each morning!
  
  If you have any questions at all, please don't hesitate to call us at 855-555-6689. We can't wait to welcome you to the hotel!"
- Explanation of Changes:
  - The ellipses (...) create a pause for emphasis.
  - "and we've" uses a contraction for a more conversational tone.
  - "It's, well, it's very exciting" adds a small amount of disfluency, and emphasis.
  - "Okay?" friendly reminder softens tone.
By following these guidelines, you can create text-to-audio scripts that sound natural, engaging, and human-like. Remember that practice and experimentation are key to mastering this skill.

Chirp 3: HD voice controls

Voice control features are specifically for HD voice synthesis. Note that Chirp 3: HD voices don't support SSML, but you can still manage pace control, pause control, and custom pronunciations through the Chirp 3: HD voice control options.

Pace control

You can adjust the speed of the generated audio using the pace parameter. The pace parameter lets you slow down or speed up the speech, with values ranging from 0.25x (very slow) to 2x (very fast). To set the pace, use the speaking_rate parameter in your request. Choose a value between 0.25 and 2.0. Values below 1.0 slow down the speech, and values above 1.0 speed it up. A value of 1.0 indicates an unadjusted pace.

Sample SynthesizeSpeechRequest using pace control:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
    "speaking_rate": 2.0,
  },
  "input": {
    "text": "Once upon a time, there was a cute cat. He was so cute that he got lots of treats.",
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Sample StreamingSynthesizeConfig using pace control:

{
  "streaming_audio_config": {
    "audio_encoding": "LINEAR16",
    "speaking_rate": 2.0,
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Pace control audio samples:

Speaking rate	Output
0.5
1.0
2.0

Pause control

You can insert pauses into AI-generated speech by embedding special tags directly into your text using the markup input field. Note that pause tags will only work in the markup field, and not in the text field.

These tags signal the AI to create silences, but the precise length of these pauses isn't fixed. The AI adjusts the duration based on context, much like natural human speech varies with speaker, location, and sentence structure. The available pause tags are [pause short], [pause long], and [pause]. For alternative methods of creating pauses without using markup tags, refer to our prompting and crafting guidelines.

The AI model might occasionally disregard the pause tags, especially if they are placed in unnatural positions in the text. You can combine multiple pause tags for longer silences, but excessive use can lead to problems.

Sample SynthesizeSpeechRequest using pause control:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
  },
  "input": {
    "markup": "Let me take a look, [pause long] yes, I see it.",
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Sample StreamingSynthesisInput using pause control:

{
  "markup": "Let me take a look, [pause long] yes, I see it.",
}

Pause control audio samples:

Markup input	Output
"Let me take a look, yes, I see it."
"Let me take a look, [pause long] yes, I see it."

Custom pronunciations

You can specify custom pronunciations using IPA or X-SAMPA phonetic representations for words within the input text. Be sure to use language-appropriate phonemes for accurate rendering. You can learn more about phoneme override in our phoneme documentation.

Sample SynthesizeSpeechRequest using custom pronunciations:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
  },
  "input": {
    "text": "There is a dog in the boat",
    "custom_pronunciations": {
      "phrase": "dog",
      "phonetic_encoding": "PHONETIC_ENCODING_X_SAMPA",
      "pronunciation": "\"k{t",
    }
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Sample StreamingSynthesizeConfig using custom pronunciations:

{
  "streaming_audio_config": {
    "audio_encoding": "LINEAR16",
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
  "custom_pronunciations": {
    "phrase": "dog",
    "phonetic_encoding": "PHONETIC_ENCODING_X_SAMPA",
    "pronunciation": "\"k{t",
  }
}

Custom pronunciations audio samples:

Custom pronunciations applied	Output
None
"dog" pronounced as ""k{t"

The overridden phrases can be formatted in any way, including using symbols. For example, in case of potential context-based ambiguity in phrase matching (which is common in languages like Chinese and Japanese) or sentences where one word might be pronounced in different ways, the phrase can be formatted to remove ambiguity. For example, to avoid accidentally overriding other instances of the word read in the input, the phrase "read" could be formatted as "read1", "[read]", or "(read)" for both the input text and the overridden phrase.

See this example of applying custom pronunciations to a sentence where the word read is pronounced in two different ways:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
  },
  "input": {
    "text": "I read1 a book, and I will now read2 it to you.",
    "custom_pronunciations": {
      "phrase": "read1",
      "phonetic_encoding": "PHONETIC_ENCODING_IPA",
      "pronunciation": "rɛd",
    }
    "custom_pronunciations": {
      "phrase": "read2",
      "phonetic_encoding": "PHONETIC_ENCODING_IPA",
      "pronunciation": "riːd",
    }
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Custom pronunciations applied	Output
"read" overridden two ways

Furthermore, custom pronunciations may be used with markup input, which enables the usage of pause tags as well:

{
  "audio_config": {
    "audio_encoding": "LINEAR16",
  },
  "input": {
    "markup": "Did you [pause long] read this book?",
    "custom_pronunciations": {
      "phrase": "read",
      "phonetic_encoding": "PHONETIC_ENCODING_IPA",
      "pronunciation": "riːd",
    }
  },
  "voice": {
    "language_code": "en-US",
    "name": "en-us-Chirp3-HD-Leda",
  }
}

Custom pronunciations used	Output
Override pronunciation with pause tag

Language availability for voice controls

Pace control is available across all locales.
Pause control is available across all locales.
Custom pronunciations is available across all locales except: bn-in, gu-in, nl-be, sw-ke, th-th, uk-ua, ur-in, and vi-vn.

FAQ

Common questions and their answers:

How do I control pacing and flow to improve the speech output?

You can utilize our prompting and crafting guidelines and improve your text prompt to improve your speech output.

How do I access voices in supported languages?

Voice names follow a specific format, allowing usage across supported languages by specifying the voice uniquely. The format follows \<locale\>-\<model\>-\<voice\>. For example, to use the Kore voice for English (United States) using the Chirp 3: HD voices model, you would specify it as en-US-Chirp3-HD-Kore.

Does Chirp 3: HD voices support SSML?

While Chirp 3: HD voices don't work with SSML, you can still manage pace control, pause control, and custom pronunciations through the Chirp 3: HD voice control options.

Chirp 3: HD voices Stay organized with collections Save and categorize content based on your preferences.

Voice Options

Language Availability

Regional Availability

Supported output formats

Use Chirp 3: HD voices

Perform streaming speech synthesis request

Python

Perform online speech synthesis request

Python

Scripting and prompting tips

Understanding the Goal: Natural Speech

Key Techniques for Natural Speech

Chirp 3: HD voice controls

Pace control

Pause control

Custom pronunciations

Language availability for voice controls

FAQ

How do I control pacing and flow to improve the speech output?

How do I access voices in supported languages?

Does Chirp 3: HD voices support SSML?

Chirp 3: HD voices