Stay organized with collections
Save and categorize content based on your preferences.
Speech Transcription transcribes spoken audio in a video or video segment
into text and returns blocks of text for each portion of the transcribed audio.
Supported models
The Video Intelligence only supports English (US). For other languages, use
the Speech-to-Text API, which supports all available languages. For the list of
available languages, see Language
support in the
Speech-to-Text documentation.
To transcribe speech from a video, call the
annotate
method and specify
SPEECH_TRANSCRIPTION
in the features field.
You can use the following features when transcribing speech:
Alternative words: Use the maxAlternatives option to specify
the maximum number of options for recognized text translations to include in the
response. This value can be an integer from 1 to 30. The default is 1.
The API returns multiple transcriptions in descending order based on
the confidence value for the transcription. Alternative transcriptions
do not include word-level entries.
Profanity filtering: Use the filterProfanity option to filter out known
profanities in transcriptions. Matched words are replaced with the leading
character of the word followed by asterisks. The default is false.
Transcription hints: Use the speechContexts option to provide common or
unusual phrases in your audio. Those phrases are then used to assist the
transcription service to create more accurate transcriptions. You provide
a transcription hint as a
SpeechContext
object.
Audio track selection: Use the audioTracks option to specify which track
to transcribe from multi-track video. Users can specify up to two tracks.
Default is 0.
Once the language code is set to en-US, the request is routed to the enhanced
mode, which is trained on en-US audio; it does not really know en-US or
any other languages per se. If we feed a Spanish audio into the enhanced model,
transcription will run its course but there may be outputs with low confidence
scores, or no output at all – which is what is expected of a good model.
Automatic punctuation: Use the enableAutomaticPunctuation option
to include punctuation in the transcribed text. The default is false.
Multiple speakers: Use the enableSpeakerDiarization option to identify
different speakers in a video. In the response, each recognized word includes
a speakerTag field that identifies which speaker the recognized word is
attributed to.
For best results, provide audio recorded at 16,000Hz or greater sampling rate.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-03 UTC."],[],[],null,["# Speech transcription\n\n*Speech Transcription* transcribes spoken audio in a video or video segment\ninto text and returns blocks of text for each portion of the transcribed audio.\n\nSupported models\n----------------\n\nThe Video Intelligence only supports English (US). For other languages, use\nthe Speech-to-Text API, which supports all available languages. For the list of\navailable languages, see [Language\nsupport](/speech-to-text/docs/speech-to-text-supported-languages) in the\nSpeech-to-Text documentation.\n\nTo transcribe speech from a video, call the\n[`annotate`](/video-intelligence/docs/reference/rest/v1/videos/annotate)\nmethod and specify\n[`SPEECH_TRANSCRIPTION`](/video-intelligence/docs/reference/rest/v1/videos#Feature)\nin the `features` field.\n\nYou can use the following features when transcribing speech:\n\n- **Alternative words** : Use the `maxAlternatives` option to specify\n the maximum number of options for recognized text translations to include in the\n response. This value can be an integer from 1 to 30. The default is 1.\n The API returns multiple transcriptions in descending order based on\n the confidence value for the transcription. Alternative transcriptions\n do not include word-level entries.\n\n- **Profanity filtering** : Use the `filterProfanity` option to filter out known\n profanities in transcriptions. Matched words are replaced with the leading\n character of the word followed by asterisks. The default is false.\n\n- **Transcription hints** : Use the `speechContexts` option to provide common or\n unusual phrases in your audio. Those phrases are then used to assist the\n transcription service to create more accurate transcriptions. You provide\n a transcription hint as a\n [SpeechContext](/video-intelligence/docs/reference/rest/v1/videos#SpeechContext)\n object.\n\n- **Audio track selection** : Use the `audioTracks` option to specify which track\n to transcribe from multi-track video. Users can specify up to two tracks.\n Default is 0.\n Once the language code is set to en-US, the request is routed to the enhanced\n mode, which is trained on en-US audio; it does not really *know* en-US or\n any other languages per se. If we feed a Spanish audio into the enhanced model,\n transcription will run its course but there may be outputs with low confidence\n scores, or no output at all -- which is what is expected of a good model.\n\n- **Automatic punctuation** : Use the `enableAutomaticPunctuation` option\n to include punctuation in the transcribed text. The default is false.\n\n- **Multiple speakers** : Use the `enableSpeakerDiarization` option to identify\n different speakers in a video. In the response, each recognized word includes\n a `speakerTag` field that identifies which speaker the recognized word is\n attributed to.\n\nFor best results, provide audio recorded at 16,000Hz or greater sampling rate.\n\nCheck out the [Video Intelligence API visualizer](https://zackakil.github.io/video-intelligence-api-visualiser/#Speech%20Transcription) to see this feature in action.\n\nFor examples of requesting speech transcription,\nsee [Speech Transcription](/video-intelligence/docs/transcription)."]]