- HTTP request
- Path parameters
- Request body
- Response body
- Authorization scopes
- IAM Permissions
- SpeechRecognitionResult
- SpeechRecognitionAlternative
- WordInfo
- RecognitionResponseMetadata
Performs synchronous Speech recognition: receive results after all audio has been sent and processed.
HTTP request
POST https://{endpoint}/v2/{recognizer=projects/*/locations/*/recognizers/*}:recognize
Where {endpoint}
is one of the supported service endpoints.
The URLs use gRPC Transcoding syntax.
Path parameters
Parameters | |
---|---|
recognizer |
Required. The name of the Recognizer to use during recognition. The expected format is |
Request body
The request body contains data with the following structure:
JSON representation |
---|
{ "config": { object ( |
Fields | |
---|---|
config |
Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the |
config |
The list of fields in This is a comma-separated list of fully qualified names of fields. Example: |
Union field audio_source . The audio source, which is either inline content or a Google Cloud Storage URI. audio_source can be only one of the following: |
|
content |
The audio data bytes encoded as specified in A base64-encoded string. |
uri |
URI that points to a file that contains audio data bytes as specified in |
Response body
Response message for the recognizers.recognize
method.
If successful, the response body contains data with the following structure:
JSON representation |
---|
{ "results": [ { object ( |
Fields | |
---|---|
results[] |
Sequential list of transcription results corresponding to sequential portions of audio. |
metadata |
Metadata about the recognition. |
Authorization scopes
Requires the following OAuth scope:
https://www.googleapis.com/auth/cloud-platform
For more information, see the Authentication Overview.
IAM Permissions
Requires the following IAM permission on the recognizer
resource:
speech.recognizers.recognize
For more information, see the IAM documentation.
SpeechRecognitionResult
A speech recognition result corresponding to a portion of the audio.
JSON representation |
---|
{
"alternatives": [
{
object ( |
Fields | |
---|---|
alternatives[] |
May contain one or more recognition hypotheses. These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer. |
channel |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For |
result |
Time offset of the end of this result relative to the beginning of the audio. A duration in seconds with up to nine fractional digits, ending with ' |
language |
Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio. |
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
JSON representation |
---|
{
"transcript": string,
"confidence": number,
"words": [
{
object ( |
Fields | |
---|---|
transcript |
Transcript text representing the words that the user spoke. |
confidence |
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where |
words[] |
A list of word-specific information for each recognized word. When the |
WordInfo
Word-specific information for recognized words.
JSON representation |
---|
{ "startOffset": string, "endOffset": string, "word": string, "confidence": number, "speakerLabel": string } |
Fields | |
---|---|
start |
Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if A duration in seconds with up to nine fractional digits, ending with ' |
end |
Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if A duration in seconds with up to nine fractional digits, ending with ' |
word |
The word corresponding to this set of information. |
confidence |
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where |
speaker |
A distinct label is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. |
RecognitionResponseMetadata
Metadata about the recognition request and response.
JSON representation |
---|
{ "requestId": string, "totalBilledDuration": string } |
Fields | |
---|---|
request |
Global request identifier auto-generated by the API. |
total |
When available, billed audio seconds for the corresponding request. A duration in seconds with up to nine fractional digits, ending with ' |