Chirp 3: Instant custom voice

Instant Custom Voice in Text-to-Speech enables users to create personalized voice models by training a model with their own high-quality audio recordings. It allows for the rapid generation of personal voices, which can then be used to synthesize audio using the Cloud TTS API, supporting both streaming and long-form text.

Due to safety considerations, access to this voice cloning capability is restricted to allow-listed users. To access this feature, contact a member of the sales team to be added to the allow list.

Try a Colab notebook View notebook on GitHub

Language availability

Instant Custom Voice creation and synthesis is supported in the following languages:

Language BCP-47 Code Consent Statement
Arabic (XA) ar-XA .أنا مالك هذا الصوت وأوافق على أن تستخدم Google هذا الصوت لإنشاء نموذج صوتي اصطناعي
Bengali (India) bn-IN আমি এই ভয়েসের মালিক এবং আমি একটি সিন্থেটিক ভয়েস মডেল তৈরি করতে এই ভয়েস ব্যবহার করে Google-এর সাথে সম্মতি দিচ্ছি।
Chinese (China) cmn-CN 我是此声音的拥有者并授权谷歌使用此声音创建语音合成模型
German (Germany) de-DE Ich bin der Eigentümer dieser Stimme und bin damit einverstanden, dass Google diese Stimme zur Erstellung eines synthetischen Stimmmodells verwendet.
English (Australia) en-AU I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
English (UK) en-GB I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
English (India) en-IN I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
English (US) en-US I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
Spanish (Spain) es-ES Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.
Spanish (US) es-US Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.
French (Canada) fr-CA Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
French (France) fr-FR Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
Gujarati (India) gu-IN હું આ વોઈસનો માલિક છું અને સિન્થેટિક વોઈસ મોડલ બનાવવા માટે આ વોઈસનો ઉપયોગ કરીને google ને હું સંમતિ આપું છું
Hindi (India) hi-IN मैं इस आवाज का मालिक हूं और मैं सिंथेटिक आवाज मॉडल बनाने के लिए Google को इस आवाज का उपयोग करने की सहमति देता हूं
Indonesian (Indonesia) id-ID Saya pemilik suara ini dan saya menyetujui Google menggunakan suara ini untuk membuat model suara sintetis.
Italian (Italy) it-IT Sono il proprietario di questa voce e acconsento che Google la utilizzi per creare un modello di voce sintetica.
Kannada (India) kn-IN ನಾನು ಈ ಧ್ವನಿಯ ಮಾಲಿಕ ಮತ್ತು ಸಂಶ್ಲೇಷಿತ ಧ್ವನಿ ಮಾದರಿಯನ್ನು ರಚಿಸಲು ಈ ಧ್ವನಿಯನ್ನು ಬಳಸಿಕೊಂಡುಗೂಗಲ್ ಗೆ ನಾನು ಸಮ್ಮತಿಸುತ್ತೇನೆ.
Korean (Korea) ko-KR 나는 이 음성의 소유자이며 구글이 이 음성을 사용하여 음성 합성 모델을 생성할 것을 허용합니다.
Malayalam (India) ml-IN ഈ ശബ്ദത്തിന്റെ ഉടമ ഞാനാണ്, ഒരു സിന്തറ്റിക് വോയ്‌സ് മോഡൽ സൃഷ്ടിക്കാൻ ഈ ശബ്‌ദം ഉപയോഗിക്കുന്നതിന് ഞാൻ Google-ന് സമ്മതം നൽകുന്നു."
Marathi (India) mr-IN मी या आवाजाचा मालक आहे आणि सिंथेटिक व्हॉइस मॉडेल तयार करण्यासाठी हा आवाज वापरण्यासाठी मी Google ला संमती देतो
Dutch (Netherlands) nl-NL Ik ben de eigenaar van deze stem en ik geef Google toestemming om deze stem te gebruiken om een synthetisch stemmodel te maken.
Polish (Poland) pl-PL Jestem właścicielem tego głosu i wyrażam zgodę na wykorzystanie go przez Google w celu utworzenia syntetycznego modelu głosu.
Portuguese (Brazil) pt-BR Eu sou o proprietário desta voz e autorizo o Google a usá-la para criar um modelo de voz sintética.
Russian (Russia) ru-RU Я являюсь владельцем этого голоса и даю согласие Google на использование этого голоса для создания модели синтетического голоса.
Tamil (India) ta-IN நான் இந்த குரலின் உரிமையாளர் மற்றும் செயற்கை குரல் மாதிரியை உருவாக்க இந்த குரலை பயன்படுத்த குகல்க்கு நான் ஒப்புக்கொள்கிறேன்.
Telugu (India) te-IN నేను ఈ వాయిస్ యజమానిని మరియు సింతటిక్ వాయిస్ మోడల్ ని రూపొందించడానికి ఈ వాయిస్ ని ఉపయోగించడానికి googleకి నేను సమ్మతిస్తున్నాను.
Thai (Thailand) th-TH ฉันเป็นเจ้าของเสียงนี้ และฉันยินยอมให้ Google ใช้เสียงนี้เพื่อสร้างแบบจำลองเสียงสังเคราะห์
Turkish (Turkey) tr-TR Bu sesin sahibi benim ve Google'ın bu sesi kullanarak sentetik bir ses modeli oluşturmasına izin veriyorum.
Vietnamese (Vietnam) vi-VN Tôi là chủ sở hữu giọng nói này và tôi đồng ý cho Google sử dụng giọng nói này để tạo mô hình giọng nói tổng hợp.

Regional Availability

Instant Custom Voice creation and synthesis is available in the following Google Cloud regions respectively:

Google Cloud Zone Supported Method Launch Readiness
global Creation, Synthesis Private Preview
us Synthesis Private Preview
eu Synthesis Private Preview
asia-southeast1 Synthesis Private Preview

Supported output formats

The default response format is LINEAR16, but other formats which are supported include:

API Method Format
streaming ALAW, MULAW, OGG_OPUS and PCM
batch ALAW, MULAW, MP3, OGG_OPUS and PCM

Feature support and limitations

Feature Support Description
SSML No SSML tags to personalize synthetic audio
Text-Based Prompting Experimental Use punctuation, pauses, and disfluency to add natural flow and pacing to Text-to-Speech.
Timestamps No Word-level timestamps
Pause Tags No Introduce on-demand pauses to synthesized audio
Pace Control No Adjust the speed of synthesized audio, from 0.25x speed to 2x speed.
Pronunciation Control No Custom pronunciations of words or phrases using IPA or X-SAMPA phonetic encoding

Use Chirp 3: Instant Custom Voice

Let's explore how to use Chirp 3: Instant Custom Voice capabilities in Text-to-Speech API

  1. Record the consent statement: To comply with legal and ethical guidelines for Instant Custom Voice, record the required consent statement as a mono WAV file, with LINEAR16 encoding and a 24 kHz sampling rate, in the appropriate language. (I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.)
  2. Record reference audio: Use your computer microphone to record up to 10 seconds of audio as a LINEAR16-encoded, mono WAV file at a 24 kHz sampling rate. Ensure there is no background noise during the recording. Both the consent and reference audio must be recorded in the same environment.
  3. Store audio files: Save the recorded audio files in a designated Cloud Storage location.

Create an Instant Custom Voice

import requests, os, json

def create_instant_custom_voice_key(
    access_token, project_id, reference_audio_bytes, consent_audio_bytes
):
    url = "https://texttospeech.googleapis.com/v1beta1/voices:generateVoiceCloningKey"

    request_body = {
        "reference_audio": {
            "audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 24000},
            "content": reference_audio_bytes,
        },
        "voice_talent_consent": {
            "audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 24000},
            "content": consent_audio_bytes,
        },
        "consent_script": "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.",
        "language_code": "en-US",
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8",
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        return response_json.get("voiceCloningKey")

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Synthesize with an Instant Custom Voice

import requests, os, json, base64
from IPython.display import Audio, display

def synthesize_text_with_cloned_voice(access_token, project_id, voice_key, text):
    url = "https://texttospeech.googleapis.com/v1beta1/text:synthesize"

    request_body = {
        "input": {
            "text": text
        },
        "voice": {
            "language_code": "en-US",
            "voice_clone": {
                "voice_cloning_key": voice_key,
            }
        },
        "audioConfig": {
            "audioEncoding": "LINEAR16",
            "sample_rate_hertz": 24000
        }
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8"
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        audio_content = response_json.get("audioContent")

        if audio_content:
            display(Audio(base64.b64decode(audio_content), rate=24000))
        else:
            print("Error: Audio content not found in the response.")
            print(response_json)

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")