Chirp 3: Enhanced multilingual accuracy

Chirp 3 is the latest generation of Google's multilingual ASR-specific generative models, designed to meet user needs based on feedback and experience. It improves upon the original Chirp and Chirp 2 models in accuracy and speed, as well as expanding into key new features like diarization.

Model details

Chirp_3 details

Model identifiers

Chirp 3 is only available in the Speech-to-Text API V2, and you can utilize it like any other model. Specify the appropriate identifier in your recognition request when using the API or the model name when using the Google Cloud console.

Model Model identifier
Chirp 3 chirp_3

API methods

Not all recognition methods support the same language availability sets, because Chirp 3 is available in the Speech-to-Text API V2, it supports the following recognition methods: Not all recognition methods support the same language availability sets, because Chirp 3 is available in the Speech-to-Text API V2, it supports the following recognition methods:

API API method support Support
v2 Speech.BatchRecognize (good for long audio 1 min to 1 hour) Supported
v2 Speech.Recognize (good for audio shorter than one minute) Not supported
v2 Speech.StreamingRecognize (good for streaming and real-time audio) Not supported

*You can always find the latest list of supported locales and features for each transcription model, using the locations API as explained here

Regional availability

Chirp 3 is available in the following Google Cloud regions, with more planned:

Google Cloud Zone Launch Readiness
us-west1 Private Preview

Using the locations API as explained here, you can find the latest list of supported Google Cloud regions, languages and locales, and features for each transcription model.

Language availability for transcription

Chirp 3 supports transcription in BatchRecognize only in the supports the following languages:

LanguageBCP-47 Code
Arabic (Egypt)ar-EG
Arabic (Saudi Arabia)ar-SA
Bengali (Bangladesh)bn-BD
Bengali (India)bn-IN
Czech (Czech Republic)cs-CZ
Danish (Denmark)da-DK
Greek (Greece)el-GR
Spanish (Mexico)es-MX
Estonian (Estonia)et-EE
Persian (Iran)fa-IR
Finnish (Finland)fi-FI
Filipino (Philippines)fil-PH
French (Canada)fr-CA
Gujarati (India)gu-IN
Croatian (Croatia)hr-HR
Hungarian (Hungary)hu-HU
Indonesian (Indonesia)id-ID
Hebrew (Israel)iw-IL
Kannada (India)kn-IN
Lithuanian (Lithuania)lt-LT
Latvian (Latvia)lv-LV
Malayalam (India)ml-IN
Marathi (India)mr-IN
Dutch (Netherlands)nl-NL
Norwegian (Norway)no-NO
Punjabi (India)pa-IN
Polish (Poland)pl-PL
Portuguese (Portugal)pt-PT
Romanian (Romania)ro-RO
Russian (Russia)ru-RU
Slovak (Slovakia)sk-SK
Slovenian (Slovenia)sl-SI
Serbian (Serbia)sr-RS
Swedish (Sweden)sv-SE
Tamil (India)ta-IN
Telugu (India)te-IN
Thai (Thailand)th-TH
Turkish (Turkey)tr-TR
Ukrainian (Ukraine)uk-UA
Urdu (Pakistan)ur-PK
Vietnamese (Vietnam)vi-VN
Chinese (China)zh-CN
Chinese (Taiwan)zh-TW
Zulu (South Africa)zu-SA

Language availability for diarization

Language BCP-47 Code
Chinese (Simplified, China) cmn-Hans-CN
German (Germany) de-DE
English (Australia) en-AU
English (United Kingdom) en-GB
English (India) en-IN
English (United States) en-US
Spanish (Spain) en-ES
Spanish (United States) en-US
French (France) fr-FR
Hindi (India) hi-IN
Italian (Italy) it-IT
Japanese (Japan) ja-JP
Korean (Korea) ko-KR
Portuguese (Brazil) pt-BR

Feature support and limitations

Chirp 3 supports the following features:

Feature Description Launch Stage
Automatic punctuation Automatically generated by the model and can be optionally disabled. Preview
Automatic capitalization Automatically generated by the model and can be optionally disabled. Preview
Speaker Diarization Automatically identify the different speakers in a single-channel audio sample. Preview
Language-agnostic audio transcription. The model automatically infers the spoken language in your audio file and transcribes in the most prevalent language. Preview

Chirp 3 doesn't support the following features:

Feature Description
Word-timings (Timestamps) Automatically generated by the model and can be optionally disabled.
Word-level confidence scores The API returns a value, but it isn't truly a confidence score.
Speech adaptation (Biasing) Provide hints to the model in the form of phrases or words to improve recognition accuracy for specific terms or proper nouns.

Using Chirp 3

Using Chirp 3 for transcription and diarization tasks.

Transcribe using Chirp 3 batch request with diarization

Discover how to use Chirp 3 for your transcription needs

Perform batch speech recognition

Allow Cloud Speech service to read your Cloud Storage storage bucket (this is needed temporarily while in private preview). This can be done using command line using Google Cloud CLI command:

gcloud storage buckets add-iam-policy-binding gs://<YOUR_BUCKET_NAME_HERE> --member=serviceAccount:service-727103546492@gcp-sa-aiplatform.iam.gserviceaccount.com --role=roles/storage.objectViewer

Or use Cloud console by navigating to http://console.cloud.google.com/storage/browser, pick your bucket > click Permissions > Grant Access > add the service account, like so:

Screenshot of the Speech-to-text service account being granted IAM permission.

import os

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech
from google.api_core.client_options import ClientOptions

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")

def transcribe_batch_chirp3(
    audio_uri: str,
) -> cloud_speech.BatchRecognizeResults:
    """Transcribes an audio file from a Google Cloud Storage URI using the Chirp 3 model of Google Cloud Speech-to-Text V2 API.
    Args:
        audio_uri (str): The Google Cloud Storage URI of the input
          audio file. E.g., gs://[BUCKET]/[FILE]
    Returns:
        cloud_speech.RecognizeResponse: The response from the
           Speech-to-Text API containing the transcription results.
    """

    # Instantiates a client
    client = SpeechClient(
        client_options=ClientOptions(
            api_endpoint="us-west1-speech.googleapis.com",
        )
    )

    speaker_diarization_config = cloud_speech.SpeakerDiarizationConfig(
        min_speaker_count=1,  # minimum number of speakers
        max_speaker_count=6,  # maximum expected number of speakers
    )

    config = cloud_speech.RecognitionConfig(
        auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
        language_codes=["en-US"],  # Use "auto" to detect language
        model="chirp_3",
        features=cloud_speech.RecognitionFeatures(
            diarization_config=speaker_diarization_config,
        ),
    )

    file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=audio_uri)

    request = cloud_speech.BatchRecognizeRequest(
        recognizer=f"projects/{PROJECT_ID}/locations/us-west1/recognizers/_",
        config=config,
        files=[file_metadata],
        recognition_output_config=cloud_speech.RecognitionOutputConfig(
            inline_response_config=cloud_speech.InlineOutputConfig(),
        ),
    )

    # Transcribes the audio into text    
    operation = client.batch_recognize(request=request)

    print("Waiting for operation to complete...")
    response = operation.result(timeout=120)

    for result in response.results[audio_uri].transcript.results:
        print(f"Transcript: {result.alternatives[0].transcript}")
        print(f"Detected Language: {result.language_code}")
        print(f"Speakers per word: {result.alternatives[0].words}")

    return response.results[audio_uri].transcript

Use Chirp 3 in the Google Cloud console