このページは Cloud Translation API によって翻訳されました。

音声文字変換

音声文字変換を使用すると、ストリーミング音声データをリアルタイムで文字変換されたテキストに変換できます。エージェントアシストはテキストに基づいて候補を提示するため、音声データを使用する前に変換する必要があります。また、会話分析で文字起こしされたストリーミング音声を使用して、エージェントの会話に関するリアルタイムデータを収集することもできます（トピックモデリングなど）。

Agent Assist で使用するストリーミング音声を文字に変換するには、SIPREC 機能を使用する方法と、音声データをペイロードとして gRPC 呼び出しを行う方法の 2 つがあります。このページでは、gRPC 呼び出しを使用してストリーミング音声データを文字変換するプロセスについて説明します。

音声文字変換は、Speech-to-Text のストリーミング音声認識を使用して行われます。Speech-to-Text には、標準と拡張の複数の認識モデルがあります。音声文字変換は、テレフォニー モデルで使用する場合にのみ、GA レベルでサポートされます。

前提条件

Google Cloudでプロジェクトを作成します。
Dialogflow API を有効にします。
アカウントで Speech-to-Text の拡張モデルにアクセスできることを確認するには、Google の担当者にお問い合わせください。

会話プロファイルを作成する

会話プロファイルを作成するには、Agent Assist コンソールを使用するか、ConversationProfile リソースの create メソッドを直接呼び出します。

音声文字変換では、会話で音声データを送信する際に、ConversationProfile.stt_config をデフォルトの InputAudioConfig として構成することをおすすめします。

会話の実行時に文字起こしを取得する

会話の実行時に文字起こしを取得するには、会話の参加者を作成し、各参加者の音声データを送信する必要があります。

参加者を作成する

参加者には次の 3 種類があります。これらのロールの詳細については、リファレンスドキュメントをご覧ください。participant で create メソッドを呼び出し、role を指定します。文字起こしを取得するには、END_USER または HUMAN_AGENT の参加者のみが StreamingAnalyzeContent を呼び出すことができます。

音声データを送信して文字起こしを取得する

StreamingAnalyzeContent を使用して、次のパラメータで参加者の音声を Google に送信し、文字起こしを取得できます。

ストリームの最初のリクエストは InputAudioConfig である必要があります。（ここで構成されたフィールドは、ConversationProfile.stt_config の対応する設定をオーバーライドします）。2 回目のリクエストまで音声入力を送信しないでください。
- audioEncoding は AUDIO_ENCODING_LINEAR_16 または AUDIO_ENCODING_MULAW に設定する必要があります。
- model: 音声の文字起こしに使用する Speech-to-Text モデル。このフィールドは、telephony に設定します。バリアントは文字起こし品質に影響しないため、[音声モデルのバリアント] を指定しないか、[利用可能な最良のモデルを使用する] を選択できます。
- 最高の文字起こし品質を得るには、singleUtterance を false に設定する必要があります。singleUtterance が false の場合、END_OF_SINGLE_UTTERANCE は想定されませんが、StreamingAnalyzeContentResponse.recognition_result 内の isFinal==true を使用してストリームをハーフクローズできます。
- オプションの追加パラメータ: 次のパラメータはオプションです。これらのパラメータにアクセスするには、Google の担当者にお問い合わせください。
  - languageCode: 音声の language_code。デフォルト値は en-US です。
  - alternativeLanguageCodes: これはプレビュー機能です。音声で検出される可能性のある追加の言語。Agent Assist は language_code フィールドを使用して、音声の冒頭で言語を自動的に検出し、それ以降のすべての会話ターンでデフォルトとして使用します。alternativeLanguageCodes フィールドでは、Agent Assist が選択できるオプションをさらに指定できます。
  - phraseSets: Speech-to-Text モデル適応 phraseSet リソース名。音声文字変換でモデル適応を使用するには、まず Speech-to-Text API を使用して phraseSet を作成し、ここでリソース名を指定する必要があります。
音声ペイロードを含む 2 番目のリクエストを送信すると、ストリームから StreamingAnalyzeContentResponses が届き始めます。
- StreamingAnalyzeContentResponse.recognition_result で is_final が true に設定されている場合は、ストリームを半分閉じることができます（または、Python などの一部の言語では送信を停止できます）。
- ストリームをハーフクローズすると、サーバーは最終的な文字起こしを含むレスポンスと、Dialogflow の候補や Agent Assist の候補を返します。
最終的な文字起こしは、次の場所で確認できます。
- StreamingAnalyzeContentResponse.message.content。
- Pub/Sub 通知を有効にすると、Pub/Sub で文字起こしを確認することもできます。
前のストリームが閉じられたら、新しいストリームを開始します。
- 音声の再送信: is_final=true を含むレスポンスの最後の speech_end_offset 以降に生成された音声データは、最適な文字起こし品質を得るために、新しいストリームの開始時間から StreamingAnalyzeContent に再送信する必要があります。
次の図は、ストリームの仕組みを示しています。

ストリーミング認識リクエストのコードサンプル

次のコードサンプルは、ストリーミング音声文字変換リクエストを送信する方法を示しています。

Python

Agent Assist に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証の設定をご覧ください。

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Google Cloud Dialogflow API sample code using the StreamingAnalyzeContent
API.

Also please contact Google to get credentials of this project and set up the
credential file json locations by running:
export GOOGLE_APPLICATION_CREDENTIALS=<cred_json_file_location>

Example usage:
    export GOOGLE_CLOUD_PROJECT='cloud-contact-center-ext-demo'
    export CONVERSATION_PROFILE='FnuBYO8eTBWM8ep1i-eOng'
    export GOOGLE_APPLICATION_CREDENTIALS='/Users/ruogu/Desktop/keys/cloud-contact-center-ext-demo-78798f9f9254.json'
    python streaming_transcription.py

Then started to talk in English, you should see transcription shows up as you speak.

Say "Quit" or "Exit" to stop.
"""

import os
import re
import sys

from google.api_core.exceptions import DeadlineExceeded

import pyaudio

from six.moves import queue

import conversation_management
import participant_management

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
CONVERSATION_PROFILE_ID = os.getenv("CONVERSATION_PROFILE")

# Audio recording parameters
SAMPLE_RATE = 16000
CHUNK_SIZE = int(SAMPLE_RATE / 10)  # 100ms
RESTART_TIMEOUT = 160  # seconds
MAX_LOOKBACK = 3  # seconds

YELLOW = "\033[0;33m"


class ResumableMicrophoneStream:
    """Opens a recording stream as a generator yielding the audio chunks."""

    def __init__(self, rate, chunk_size):
        self._rate = rate
        self.chunk_size = chunk_size
        self._num_channels = 1
        self._buff = queue.Queue()
        self.is_final = False
        self.closed = True
        # Count the number of times the stream analyze content restarts.
        self.restart_counter = 0
        self.last_start_time = 0
        # Time end of the last is_final in millisec since last_start_time.
        self.is_final_offset = 0
        # Save the audio chunks generated from the start of the audio stream for
        # replay after restart.
        self.audio_input_chunks = []
        self.new_stream = True
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            channels=self._num_channels,
            rate=self._rate,
            input=True,
            frames_per_buffer=self.chunk_size,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
        )

    def __enter__(self):
        self.closed = False
        return self

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, *args, **kwargs):
        """Continuously collect data from the audio stream, into the buffer in
        chunksize."""

        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        """Stream Audio from microphone to API and to local buffer"""
        try:
            # Handle restart.
            print("restart generator")
            # Flip the bit of is_final so it can continue stream.
            self.is_final = False
            total_processed_time = self.last_start_time + self.is_final_offset
            processed_bytes_length = (
                int(total_processed_time * SAMPLE_RATE * 16 / 8) / 1000
            )
            self.last_start_time = total_processed_time
            # Send out bytes stored in self.audio_input_chunks that is after the
            # processed_bytes_length.
            if processed_bytes_length != 0:
                audio_bytes = b"".join(self.audio_input_chunks)
                # Lookback for unprocessed audio data.
                need_to_process_length = min(
                    int(len(audio_bytes) - processed_bytes_length),
                    int(MAX_LOOKBACK * SAMPLE_RATE * 16 / 8),
                )
                # Note that you need to explicitly use `int` type for substring.
                need_to_process_bytes = audio_bytes[(-1) * need_to_process_length :]
                yield need_to_process_bytes

            while not self.closed and not self.is_final:
                data = []
                # Use a blocking get() to ensure there's at least one chunk of
                # data, and stop iteration if the chunk is None, indicating the
                # end of the audio stream.
                chunk = self._buff.get()

                if chunk is None:
                    return
                data.append(chunk)
                # Now try to the rest of chunks if there are any left in the _buff.
                while True:
                    try:
                        chunk = self._buff.get(block=False)

                        if chunk is None:
                            return
                        data.append(chunk)

                    except queue.Empty:
                        break
                self.audio_input_chunks.extend(data)
                if data:
                    yield b"".join(data)
        finally:
            print("Stop generator")


def main():
    """start bidirectional streaming from microphone input to Dialogflow API"""
    # Create conversation.
    conversation = conversation_management.create_conversation(
        project_id=PROJECT_ID, conversation_profile_id=CONVERSATION_PROFILE_ID
    )

    conversation_id = conversation.name.split("conversations/")[1].rstrip()

    # Create end user participant.
    end_user = participant_management.create_participant(
        project_id=PROJECT_ID, conversation_id=conversation_id, role="END_USER"
    )
    participant_id = end_user.name.split("participants/")[1].rstrip()

    mic_manager = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
    print(mic_manager.chunk_size)
    sys.stdout.write(YELLOW)
    sys.stdout.write('\nListening, say "Quit" or "Exit" to stop.\n\n')
    sys.stdout.write("End (ms)       Transcript Results/Status\n")
    sys.stdout.write("=====================================================\n")

    with mic_manager as stream:
        while not stream.closed:
            terminate = False
            while not terminate:
                try:
                    print(f"New Streaming Analyze Request: {stream.restart_counter}")
                    stream.restart_counter += 1
                    # Send request to streaming and get response.
                    responses = participant_management.analyze_content_audio_stream(
                        conversation_id=conversation_id,
                        participant_id=participant_id,
                        sample_rate_herz=SAMPLE_RATE,
                        stream=stream,
                        timeout=RESTART_TIMEOUT,
                        language_code="en-US",
                        single_utterance=False,
                    )

                    # Now, print the final transcription responses to user.
                    for response in responses:
                        if response.message:
                            print(response)
                        if response.recognition_result.is_final:
                            print(response)
                            # offset return from recognition_result is relative
                            # to the beginning of audio stream.
                            offset = response.recognition_result.speech_end_offset
                            stream.is_final_offset = int(
                                offset.seconds * 1000 + offset.microseconds / 1000
                            )
                            transcript = response.recognition_result.transcript
                            # Half-close the stream with gRPC (in Python just stop yielding requests)
                            stream.is_final = True
                            # Exit recognition if any of the transcribed phrase could be
                            # one of our keywords.
                            if re.search(r"\b(exit|quit)\b", transcript, re.I):
                                sys.stdout.write(YELLOW)
                                sys.stdout.write("Exiting...\n")
                                terminate = True
                                stream.closed = True
                                break
                except DeadlineExceeded:
                    print("Deadline Exceeded, restarting.")

            if terminate:
                conversation_management.complete_conversation(
                    project_id=PROJECT_ID, conversation_id=conversation_id
                )
                break


if __name__ == "__main__":
    main()