API documentation for speech_v1p1beta1.types
module.
Classes
Any
API documentation for speech_v1p1beta1.types.Any
class.
CancelOperationRequest
API documentation for speech_v1p1beta1.types.CancelOperationRequest
class.
DeleteOperationRequest
API documentation for speech_v1p1beta1.types.DeleteOperationRequest
class.
Duration
API documentation for speech_v1p1beta1.types.Duration
class.
GetOperationRequest
API documentation for speech_v1p1beta1.types.GetOperationRequest
class.
ListOperationsRequest
API documentation for speech_v1p1beta1.types.ListOperationsRequest
class.
ListOperationsResponse
API documentation for speech_v1p1beta1.types.ListOperationsResponse
class.
LongRunningRecognizeMetadata
Describes the progress of a long-running LongRunningRecognize
call.
It is included in the metadata
field of the Operation
returned
by the GetOperation
call of the google::longrunning::Operations
service.
Time when the request was received.
LongRunningRecognizeRequest
The top-level message sent by the client for the
LongRunningRecognize
method.
Required The audio data to be recognized.
LongRunningRecognizeResponse
The only message returned to the client by the LongRunningRecognize
method. It contains the result as zero or more sequential
SpeechRecognitionResult
messages. It is included in the
result.response
field of the Operation
returned by the
GetOperation
call of the google::longrunning::Operations
service.
Operation
API documentation for speech_v1p1beta1.types.Operation
class.
OperationInfo
API documentation for speech_v1p1beta1.types.OperationInfo
class.
RecognitionAudio
Contains audio data in the encoding specified in the
RecognitionConfig
. Either content
or uri
must be supplied.
Supplying both or neither returns
[google.rpc.Code.INVALID_ARGUMENT][google.rpc.Code.INVALID_ARGUMENT].
See content limits </speech-to-text/quotas#content>
__.
The audio data bytes encoded as specified in
RecognitionConfig
. Note: as with all bytes fields,
protobuffers use a pure binary representation, whereas JSON
representations use base64.
RecognitionConfig
Provides information to the recognizer that specifies how to process the request.
Sample rate in Hertz of the audio data sent in all
RecognitionAudio
messages. Valid values are: 8000-48000.
16000 is optimal. For best results, set the sampling rate of
the audio source to 16000 Hz. If that's not possible, use the
native sample rate of the audio source (instead of re-
sampling). This field is optional for FLAC
and WAV
audio files and required for all other audio formats. For
details, see [AudioEncoding][google.cloud.speech.v1p1beta1.Rec
ognitionConfig.AudioEncoding].
This needs to be set to ‘true’ explicitly and
audio_channel_count
> 1 to get each channel recognized
separately. The recognition result will contain a
channel_tag
field to state which channel that result
belongs to. If this is not true, we will only recognize the
first channel. The request is billed cumulatively for all
channels recognized: audio_channel_count
multiplied by the
length of the audio.
Optional A list of up to 3 additional BCP-47
<https://www.rfc-editor.org/rfc/bcp/bcp47.txt>
language
tags, listing possible alternative languages of the supplied
audio. See Language Support </speech-to-
text/docs/languages>
for a list of the currently supported
language codes. If alternative languages are listed,
recognition result will contain recognition in the most likely
language detected including the main language_code. The
recognition result will include the language tag of the
language detected in the audio. Note: This feature is only
supported for Voice Command and Voice Search use cases and
performance may vary for other use cases (e.g., phone call
transcription).
Optional If set to true
, the server will attempt to
filter out profanities, replacing all but the initial
character in each filtered word with asterisks, e.g.
"f***". If set to false
or omitted, profanities won't
be filtered out.
Optional If true
, the top result includes a list of
words and the start and end time offsets (timestamps) for
those words. If false
, no word-level time offset
information is returned. The default is false
.
Optional If 'true', adds punctuation to recognition result hypotheses. This feature is only available in select languages. Setting this for requests in other languages has no effect at all. The default 'false' value does not add punctuation to result hypotheses. Note: This is currently offered as an experimental service, complimentary to all users. In the future this may be exclusively available as a premium feature.
Optional If set, specifies the estimated number of speakers in the conversation. If not set, defaults to '2'. Ignored unless enable_speaker_diarization is set to true."
Optional Which model to select for the given request. Select the model best suited to your domain to get best results. If a model is not explicitly specified, then we auto-select a model based on the parameters in the RecognitionConfig. .. raw:: html
.. raw:: html :: .. raw:: html .. raw:: html :: .. raw:: html .. raw:: html :: .. raw:: html .. raw:: html :: .. raw:: html .. raw:: html :: .. raw:: html .. raw:: htmlModel | Description |
command_and_search | Best for short queries such as voice commands or voice search. |
phone_call | Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate). |
video | Best for audio that originated from from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate. |
default | Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate. |
RecognitionMetadata
Description of audio data to be recognized.
The industry vertical to which this speech recognition request most closely applies. This is most indicative of the topics contained in the audio. Use the 6-digit NAICS code to identify the industry vertical - see https://www.naics.com/search/.
The original media the speech was recorded on.
The device used to make the recording. Examples 'Nexus 5X' or 'Polycom SoundStation IP 6000' or 'POTS' or 'VoIP' or 'Cardioid Microphone'.
Obfuscated (privacy-protected) ID of the user, to identify number of unique users using the service.
RecognizeRequest
The top-level message sent by the client for the Recognize
method.
Required The audio data to be recognized.
RecognizeResponse
The only message returned to the client by the Recognize
method. It
contains the result as zero or more sequential
SpeechRecognitionResult
messages.
SpeechContext
Provides "hints" to the speech recognizer to favor specific words and phrases in the results.
Hint Boost. Positive value will increase the probability that
a specific phrase will be recognized over other similar
sounding phrases. The higher the boost, the higher the chance
of false positive recognition as well. Negative boost values
would correspond to anti-biasing. Anti-biasing is not enabled,
so negative boost will simply be ignored. Though boost
can
accept a wide range of positive values, most use cases are
best served with values between 0 and 20. We recommend using a
binary search approach to finding the optimal value for your
use case.
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
Output only. The confidence estimate between 0.0 and 1.0. A
higher number indicates an estimated greater likelihood that
the recognized words are correct. This field is set only for
the top alternative of a non-streaming result or, of a
streaming result where is_final=true
. This field is not
guaranteed to be accurate and users should not rely on it to
be always provided. The default of 0.0 is a sentinel value
indicating confidence
was not set.
SpeechRecognitionResult
A speech recognition result corresponding to a portion of the audio.
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'.
Status
API documentation for speech_v1p1beta1.types.Status
class.
StreamingRecognitionConfig
Provides information to the recognizer that specifies how to process the request.
Optional If false
or omitted, the recognizer will
perform continuous recognition (continuing to wait for and
process audio even if the user pauses speaking) until the
client closes the input stream (gRPC API) or until the maximum
time limit has been reached. May return multiple
StreamingRecognitionResult
\ s with the is_final
flag
set to true
. If true
, the recognizer will detect a
single spoken utterance. When it detects that the user has
paused or stopped speaking, it will return an
END_OF_SINGLE_UTTERANCE
event and cease recognition. It
will return no more than one StreamingRecognitionResult
with the is_final
flag set to true
.
StreamingRecognitionResult
A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
Output only. If false
, this StreamingRecognitionResult
represents an interim result that may change. If true
,
this is the final time the speech service will return this
particular StreamingRecognitionResult
, the recognizer will
not return any further hypotheses for this portion of the
transcript and corresponding audio.
Output only. Time offset of the end of this result relative to the beginning of the audio.
Output only. The BCP-47 <https://www.rfc-
editor.org/rfc/bcp/bcp47.txt>
__ language tag of the language
in this result. This language code was detected to have the
most likelihood of being spoken in the audio.
StreamingRecognizeRequest
The top-level message sent by the client for the StreamingRecognize
method. Multiple StreamingRecognizeRequest
messages are sent. The
first message must contain a streaming_config
message and must not
contain audio
data. All subsequent messages must contain audio
data and must not contain a streaming_config
message.
Provides information to the recognizer that specifies how to
process the request. The first StreamingRecognizeRequest
message must contain a streaming_config
message.
StreamingRecognizeResponse
StreamingRecognizeResponse
is the only message returned to the
client by StreamingRecognize
. A series of zero or more
StreamingRecognizeResponse
messages are streamed back to the client.
If there is no recognizable audio, and single_utterance
is set to
false, then no messages are streamed back to the client.
Here's an example of a series of ten StreamingRecognizeResponse
\ s
that might be returned while processing audio:
results { alternatives { transcript: "tube" } stability: 0.01 }
results { alternatives { transcript: "to be a" } stability: 0.01 }
results { alternatives { transcript: "to be" } stability: 0.9 } results { alternatives { transcript: " or not to be" } stability: 0.01 }
results { alternatives { transcript: "to be or not to be" confidence: 0.92 } alternatives { transcript: "to bee or not to bee" } is_final: true }
results { alternatives { transcript: " that's" } stability: 0.01 }
results { alternatives { transcript: " that is" } stability: 0.9 } results { alternatives { transcript: " the question" } stability: 0.01 }
results { alternatives { transcript: " that is the question" confidence: 0.98 } alternatives { transcript: " that was the question" } is_final: true }
Notes:
Only two of the above responses #4 and #7 contain final results; they are indicated by
is_final: true
. Concatenating these together generates the full transcript: "to be or not to be that is the question".The others contain interim
results
. #3 and #6 contain two interimresults
: the first portion has a high stability and is less likely to change; the second portion has a low stability and is very likely to change. A UI designer might choose to show only high stabilityresults
.The specific
stability
andconfidence
values shown above are only for illustrative purposes. Actual values may vary.In each response, only one of these fields will be set:
error
,speech_event_type
, or one or more (repeated)results
.Output only. This repeated list contains zero or more results that correspond to consecutive portions of the audio currently being processed. It contains zero or one
is_final=true
result (the newly settled portion), followed by zero or moreis_final=false
results (the interim results).
WaitOperationRequest
API documentation for speech_v1p1beta1.types.WaitOperationRequest
class.
WordInfo
Word-specific information for recognized words.
Output only. Time offset relative to the beginning of the
audio, and corresponding to the end of the spoken word. This
field is only set if enable_word_time_offsets=true
and
only in the top hypothesis. This is an experimental feature
and the accuracy of the time offset can vary.
Output only. The confidence estimate between 0.0 and 1.0. A
higher number indicates an estimated greater likelihood that
the recognized words are correct. This field is set only for
the top alternative of a non-streaming result or, of a
streaming result where is_final=true
. This field is not
guaranteed to be accurate and users should not rely on it to
be always provided. The default of 0.0 is a sentinel value
indicating confidence
was not set.