Esta página foi traduzida pela API Cloud Translation.

Transcrição de voz

A transcrição de voz permite-lhe converter os seus dados de áudio de streaming em texto transcrito em tempo real. O Agent Assist faz sugestões com base em texto, pelo que os dados de áudio têm de ser convertidos antes de poderem ser usados. Também pode usar áudio em streaming transcrito com as estatísticas de conversação para recolher dados em tempo real sobre as conversas dos agentes (por exemplo, modelagem de tópicos).

Existem duas formas de transcrever áudio de streaming para utilização com o Agent Assist: através da funcionalidade SIPREC ou fazendo chamadas gRPC com dados de áudio como payload. Esta página descreve o processo de transcrever dados de áudio de streaming através de chamadas gRPC.

A transcrição de voz funciona através do reconhecimento de voz em streaming de voz para texto. A API Speech-to-Text oferece vários modelos de reconhecimento, padrão e melhorado. A transcrição de voz é suportada ao nível de GA apenas quando é usada com o modelo de telefonia.

Pré-requisitos

Crie um projeto em Google Cloud.
Ative a Dialogflow API.
Contacte o seu representante da Google para se certificar de que a sua conta tem acesso aos modelos melhorados do Speech-to-Text.

Crie um perfil de conversa

Para criar um perfil de conversa, use a consola do Agent Assist ou chame o método create no recurso ConversationProfile diretamente.

Para a transcrição de voz, recomendamos que configure ConversationProfile.stt_config como o InputAudioConfig predefinido quando enviar dados de áudio numa conversa.

Obtenha transcrições durante a execução da conversa

Para receber transcrições durante o tempo de execução da conversa, tem de criar participantes para a conversa e enviar dados de áudio para cada participante.

Crie participantes

Existem três tipos de participante. Consulte a documentação de referência para mais detalhes sobre as respetivas funções. Chame o método create no participant e especifique o role. Apenas um END_USER ou um participante HUMAN_AGENT pode ligar para o StreamingAnalyzeContent, o que é necessário para obter uma transcrição.

Envie dados de áudio e receba uma transcrição

Pode usar StreamingAnalyzeContent para enviar o áudio de um participante para a Google e receber a transcrição, com os seguintes parâmetros:

O primeiro pedido na stream tem de ser InputAudioConfig. (Os campos configurados aqui substituem as definições correspondentes em ConversationProfile.stt_config.) Não envie nenhuma entrada de áudio até ao segundo pedido.
- audioEncoding tem de estar definido como AUDIO_ENCODING_LINEAR_16 ou AUDIO_ENCODING_MULAW.
- model: este é o modelo de conversão de voz em texto que quer usar para transcrever o seu áudio. Defina este campo como telephony. A variante não afeta a qualidade da transcrição, pelo que pode deixar a variante do modelo de voz não especificada ou escolher Usar a melhor disponível.
- singleUtterance deve ser definido como false para obter a melhor qualidade de transcrição. Não deve esperar END_OF_SINGLE_UTTERANCE se singleUtterance for false, mas pode depender de isFinal==true dentro de StreamingAnalyzeContentResponse.recognition_result para fechar parcialmente a stream.
- Parâmetros adicionais opcionais: os seguintes parâmetros são opcionais. Para aceder a estes parâmetros, contacte o seu representante da Google.
  - languageCode: language_code do áudio. O valor predefinido é en-US.
  - alternativeLanguageCodes: esta é uma funcionalidade de pré-visualização. Idiomas adicionais que podem ser detetados no áudio. O Agent Assist usa o campo language_code para detetar automaticamente o idioma no início do áudio e usá-lo por predefinição em todas as interações seguintes. O campo alternativeLanguageCodes permite-lhe especificar mais opções para o Agent Assist escolher.
  - phraseSets: o nome do recurso de adaptação do modelo de conversão de voz em texto.phraseSet Para usar a adaptação de modelos com a transcrição de voz, tem de criar primeiro o phraseSet através da API Speech-to-Text e especificar o nome do recurso aqui.
Depois de enviar o segundo pedido com a carga útil de áudio, deve começar a receber alguns StreamingAnalyzeContentResponses da stream.
- Pode fechar parcialmente a stream (ou parar de enviar em alguns idiomas, como o Python) quando vir is_final definido como true em StreamingAnalyzeContentResponse.recognition_result.
- Depois de fechar parcialmente a stream, o servidor envia de volta a resposta com a transcrição final, juntamente com potenciais sugestões do Dialogflow ou sugestões do Agent Assist.
Pode encontrar a transcrição final nas seguintes localizações:
- StreamingAnalyzeContentResponse.message.content.
- Se ativar as notificações do Pub/Sub, também pode ver a transcrição no Pub/Sub.
Inicie uma nova stream depois de a stream anterior ser fechada.
- Reenvio de áudio: os dados de áudio gerados após os últimos speech_end_offset da resposta com is_final=true para a nova hora de início da stream têm de ser reenviados para StreamingAnalyzeContent para obter a melhor qualidade de transcrição.
O diagrama seguinte ilustra como funciona a stream.

Exemplo de código de pedido de reconhecimento em streaming

O seguinte exemplo de código ilustra como enviar um pedido de transcrição de streaming:

Python

Para se autenticar no Agent Assist, configure as Credenciais padrão da aplicação. Para mais informações, consulte o artigo Configure a autenticação para um ambiente de desenvolvimento local.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Google Cloud Dialogflow API sample code using the StreamingAnalyzeContent
API.

Also please contact Google to get credentials of this project and set up the
credential file json locations by running:
export GOOGLE_APPLICATION_CREDENTIALS=<cred_json_file_location>

Example usage:
    export GOOGLE_CLOUD_PROJECT='cloud-contact-center-ext-demo'
    export CONVERSATION_PROFILE='FnuBYO8eTBWM8ep1i-eOng'
    export GOOGLE_APPLICATION_CREDENTIALS='/Users/ruogu/Desktop/keys/cloud-contact-center-ext-demo-78798f9f9254.json'
    python streaming_transcription.py

Then started to talk in English, you should see transcription shows up as you speak.

Say "Quit" or "Exit" to stop.
"""

import os
import re
import sys

from google.api_core.exceptions import DeadlineExceeded

import pyaudio

from six.moves import queue

import conversation_management
import participant_management

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
CONVERSATION_PROFILE_ID = os.getenv("CONVERSATION_PROFILE")

# Audio recording parameters
SAMPLE_RATE = 16000
CHUNK_SIZE = int(SAMPLE_RATE / 10)  # 100ms
RESTART_TIMEOUT = 160  # seconds
MAX_LOOKBACK = 3  # seconds

YELLOW = "\033[0;33m"


class ResumableMicrophoneStream:
    """Opens a recording stream as a generator yielding the audio chunks."""

    def __init__(self, rate, chunk_size):
        self._rate = rate
        self.chunk_size = chunk_size
        self._num_channels = 1
        self._buff = queue.Queue()
        self.is_final = False
        self.closed = True
        # Count the number of times the stream analyze content restarts.
        self.restart_counter = 0
        self.last_start_time = 0
        # Time end of the last is_final in millisec since last_start_time.
        self.is_final_offset = 0
        # Save the audio chunks generated from the start of the audio stream for
        # replay after restart.
        self.audio_input_chunks = []
        self.new_stream = True
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            channels=self._num_channels,
            rate=self._rate,
            input=True,
            frames_per_buffer=self.chunk_size,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
        )

    def __enter__(self):
        self.closed = False
        return self

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, *args, **kwargs):
        """Continuously collect data from the audio stream, into the buffer in
        chunksize."""

        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        """Stream Audio from microphone to API and to local buffer"""
        try:
            # Handle restart.
            print("restart generator")
            # Flip the bit of is_final so it can continue stream.
            self.is_final = False
            total_processed_time = self.last_start_time + self.is_final_offset
            processed_bytes_length = (
                int(total_processed_time * SAMPLE_RATE * 16 / 8) / 1000
            )
            self.last_start_time = total_processed_time
            # Send out bytes stored in self.audio_input_chunks that is after the
            # processed_bytes_length.
            if processed_bytes_length != 0:
                audio_bytes = b"".join(self.audio_input_chunks)
                # Lookback for unprocessed audio data.
                need_to_process_length = min(
                    int(len(audio_bytes) - processed_bytes_length),
                    int(MAX_LOOKBACK * SAMPLE_RATE * 16 / 8),
                )
                # Note that you need to explicitly use `int` type for substring.
                need_to_process_bytes = audio_bytes[(-1) * need_to_process_length :]
                yield need_to_process_bytes

            while not self.closed and not self.is_final:
                data = []
                # Use a blocking get() to ensure there's at least one chunk of
                # data, and stop iteration if the chunk is None, indicating the
                # end of the audio stream.
                chunk = self._buff.get()

                if chunk is None:
                    return
                data.append(chunk)
                # Now try to the rest of chunks if there are any left in the _buff.
                while True:
                    try:
                        chunk = self._buff.get(block=False)

                        if chunk is None:
                            return
                        data.append(chunk)

                    except queue.Empty:
                        break
                self.audio_input_chunks.extend(data)
                if data:
                    yield b"".join(data)
        finally:
            print("Stop generator")


def main():
    """start bidirectional streaming from microphone input to Dialogflow API"""
    # Create conversation.
    conversation = conversation_management.create_conversation(
        project_id=PROJECT_ID, conversation_profile_id=CONVERSATION_PROFILE_ID
    )

    conversation_id = conversation.name.split("conversations/")[1].rstrip()

    # Create end user participant.
    end_user = participant_management.create_participant(
        project_id=PROJECT_ID, conversation_id=conversation_id, role="END_USER"
    )
    participant_id = end_user.name.split("participants/")[1].rstrip()

    mic_manager = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
    print(mic_manager.chunk_size)
    sys.stdout.write(YELLOW)
    sys.stdout.write('\nListening, say "Quit" or "Exit" to stop.\n\n')
    sys.stdout.write("End (ms)       Transcript Results/Status\n")
    sys.stdout.write("=====================================================\n")

    with mic_manager as stream:
        while not stream.closed:
            terminate = False
            while not terminate:
                try:
                    print(f"New Streaming Analyze Request: {stream.restart_counter}")
                    stream.restart_counter += 1
                    # Send request to streaming and get response.
                    responses = participant_management.analyze_content_audio_stream(
                        conversation_id=conversation_id,
                        participant_id=participant_id,
                        sample_rate_herz=SAMPLE_RATE,
                        stream=stream,
                        timeout=RESTART_TIMEOUT,
                        language_code="en-US",
                        single_utterance=False,
                    )

                    # Now, print the final transcription responses to user.
                    for response in responses:
                        if response.message:
                            print(response)
                        if response.recognition_result.is_final:
                            print(response)
                            # offset return from recognition_result is relative
                            # to the beginning of audio stream.
                            offset = response.recognition_result.speech_end_offset
                            stream.is_final_offset = int(
                                offset.seconds * 1000 + offset.microseconds / 1000
                            )
                            transcript = response.recognition_result.transcript
                            # Half-close the stream with gRPC (in Python just stop yielding requests)
                            stream.is_final = True
                            # Exit recognition if any of the transcribed phrase could be
                            # one of our keywords.
                            if re.search(r"\b(exit|quit)\b", transcript, re.I):
                                sys.stdout.write(YELLOW)
                                sys.stdout.write("Exiting...\n")
                                terminate = True
                                stream.closed = True
                                break
                except DeadlineExceeded:
                    print("Deadline Exceeded, restarting.")

            if terminate:
                conversation_management.complete_conversation(
                    project_id=PROJECT_ID, conversation_id=conversation_id
                )
                break


if __name__ == "__main__":
    main()