拡張モデルを使用して音声を認識する

このページでは、Speech-to-Text に音声文字変換リクエストを送信するときに拡張音声認識モデルを指定する方法について説明します。

現在、電話通話モデルと動画モデルの 2 つの拡張モデルがあります。これらのモデルは、これらの特定のソースの音声データをより正確に文字変換できるように最適化されています。サポート対象言語のページで、拡張モデルがご希望の言語で使用可能かを確認してください。

Google では、データロギングを介して収集されたデータに基づいて拡張モデルを作成し、改善しています。拡張モデルを使用するためにデータロギングを有効にする必要はありませんが、有効にすれば、Google のこうしたモデル改善の取り組みにご協力いただくことになり、使用料も割引となります。

拡張認識モデルを使用するには、RecognitionConfig で、次のフィールドを設定します。

useEnhanced を true に設定します。
model フィールドに phone_call またはvideo の文字列を渡します。

Speech-to-Text では、speech:recognize、speech:longrunningrecognize、ストリーミングのどの音声認識方法でも拡張モデルがサポートされます。

次のサンプルコードは、音声文字変換リクエストで拡張モデルの使用を指定する方法を示しています。

プロトコル

詳細については、speech:recognize API エンドポイントをご覧ください。

同期音声認識を実行するには、POST リクエストを作成し、適切なリクエスト本文を指定します。次は、curl を使用した POST リクエストの例です。この例では、Google Cloud CLI を使用してアクセストークンを生成します。gcloud CLI のインストール手順については、クイックスタートをご覧ください。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "languageCode": "en-US",
        "enableWordTimeOffsets": false,
        "enableAutomaticPunctuation": true,
        "model": "phone_call",
        "useEnhanced": true
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav"
    }
}'

リクエスト本文の構成の詳細については、RecognitionConfig のリファレンスドキュメントをご覧ください。

リクエストが成功すると、サーバーは 200 OK HTTP ステータスコードと JSON 形式のレスポンスを返します。

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "Hi, I'd like to buy a Chromecast. I was wondering whether you could help me with that.",
          "confidence": 0.8930228
        }
      ],
      "resultEndTime": "5.640s"
    },
    {
      "alternatives": [
        {
          "transcript": " Certainly, which color would you like? We are blue black and red.",
          "confidence": 0.9101991
        }
      ],
      "resultEndTime": "10.220s"
    },
    {
      "alternatives": [
        {
          "transcript": " Let's go with the black one.",
          "confidence": 0.8818244
        }
      ],
      "resultEndTime": "13.870s"
    },
    {
      "alternatives": [
        {
          "transcript": " Would you like the new Chromecast Ultra model or the regular Chromecast?",
          "confidence": 0.94733626
        }
      ],
      "resultEndTime": "18.460s"
    },
    {
      "alternatives": [
        {
          "transcript": " Regular Chromecast is fine. Thank you. Okay. Sure. Would you like to ship it regular or Express?",
          "confidence": 0.9519095
        }
      ],
      "resultEndTime": "25.930s"
    },
    {
      "alternatives": [
        {
          "transcript": " Express, please.",
          "confidence": 0.9101229
        }
      ],
      "resultEndTime": "28.260s"
    },
    {
      "alternatives": [
        {
          "transcript": " Terrific. It's on the way. Thank you. Thank you very much. Bye.",
          "confidence": 0.9321616
        }
      ],
      "resultEndTime": "34.150s"
    }
 ]
}

Go

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Go API リファレンスドキュメントをご覧ください。

Speech-to-Text に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。


func enhancedModel(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	data, err := os.ReadFile("../testdata/commercial_mono.wav")
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 8000,
			LanguageCode:    "en-US",
			UseEnhanced:     true,
			// A model must be specified to use enhanced model.
			Model: "phone_call",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	})
	if err != nil {
		return fmt.Errorf("client.Recognize: %w", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Python

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Python API リファレンスドキュメントをご覧ください。


from google.cloud import speech


def transcribe_file_with_enhanced_model(audio_file: str) -> speech.RecognizeResponse:
    """Transcribe the given audio file using an enhanced model.
    Args:
        audio_file (str): Path to the local audio file to be transcribed.
            Example: "resources/commercial_mono.wav"
    Returns:
        speech.RecognizeResponse: The response containing the transcription results.
    """

    client = speech.SpeechClient()

    # audio_file = 'resources/commercial_mono.wav'
    with open(audio_file, "rb") as f:
        audio_content = f.read()

    audio = speech.RecognitionAudio(content=audio_content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=8000,
        language_code="en-US",
        use_enhanced=True,
        # A model must be specified to use enhanced model.
        model="phone_call",
    )

    response = client.recognize(config=config, audio=audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print(f"First alternative of result {i}")
        print(f"Transcript: {alternative.transcript}")

    return response

Java

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Java API リファレンスドキュメントをご覧ください。

/**
 * Transcribe the given audio file using an enhanced model.
 *
 * @param fileName the path to an audio file.
 */
public static void transcribeFileWithEnhancedModel(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speechClient = SpeechClient.create()) {
    // Get the contents of the local audio file
    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    // Configure request to enable enhanced models
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(8000)
            .setUseEnhanced(true)
            // A model must be specified to use enhanced model.
            .setModel("phone_call")
            .build();

    // Perform the transcription request
    RecognizeResponse recognizeResponse = speechClient.recognize(config, recognitionAudio);

    // Print out the results
    for (SpeechRecognitionResult result : recognizeResponse.getResultsList()) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternatives(0);
      System.out.format("Transcript: %s\n\n", alternative.getTranscript());
    }
  }
}

Node.js

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Node.js API リファレンスドキュメントをご覧ください。

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  languageCode: languageCode,
  useEnhanced: true,
  model: 'phone_call',
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
response.results.forEach(result => {
  const alternative = result.alternatives[0];
  console.log(alternative.transcript);
});

その他の言語

C#: クライアントライブラリページの C# の設定手順を行ってから、.NET の Speech-to-Text のリファレンスドキュメントをご覧ください。

PHP: クライアントライブラリページの PHP の設定手順を行ってから、PHP の Speech-to-Text のリファレンスドキュメントをご覧ください。

Ruby: クライアントライブラリページの Ruby の設定手順を行ってから、Ruby の Speech-to-Text のリファレンスドキュメントをご覧ください。

次のステップ

音声文字の同期変換をリクエストする方法を確認する。

拡張モデルを使用して音声を認識する コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。