轉錄多聲道音訊

本頁面說明如何使用 Speech-to-Text 轉錄包含多個頻道的音訊檔案。多聲道辨識功能適用於 Speech-to-Text 支援的大部分音訊編碼,但並非全部。如要瞭解每種編碼類型的音訊檔案可辨識的聲道數,請參閱audioChannelCount

音訊資料通常會為錄音中的每位說話者提供一個聲道。舉例來說,兩個人講電話的音訊可能包含兩個聲道,分別錄製兩人的聲音。

如要轉錄含有多個聲道的音訊資料,必須在對 Speech-to-Text API 的要求中提供聲道數量。在要求中,將要求中的 audioChannelCount 欄位設為音訊中的聲道數量。

傳送含有多個聲道的要求時,Speech-to-Text 會傳回結果,識別音訊中的不同聲道,並使用 channelTag 欄位標示每個結果的替代方案。

下列程式碼範例會示範如何對包含多頻道的音訊執行語音轉錄。

通訊協定

如要瞭解完整的詳細資訊,請參閱 speech:recognize API 端點。

如要執行同步語音辨識,請提出 POST 要求並提供適當的要求內容。以下為使用 curlPOST 要求示例。這個範例使用 Google Cloud CLI 產生存取權杖。如需安裝 gcloud CLI 的操作說明,請參閱快速入門導覽課程

以下範例說明如何使用 curl 傳送 POST 要求,其中要求主體會指定音訊樣本中的聲道數量。

curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data '{
    "config": {
        "encoding": "LINEAR16",
        "languageCode": "en-US",
        "audioChannelCount": 2,
        "enableSeparateRecognitionPerChannel": true
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/commercial_stereo.wav"
    }
}' "https://speech.googleapis.com/v1/speech:recognize" > multi-channel.txt

如果要求成功,伺服器會傳回 200 OK HTTP 狀態碼與 JSON 格式的回應,並另存成名為 multi-channel.json 的檔案。

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "hi I'd like to buy a Chromecast I'm always wondering whether you could help me with that",
          "confidence": 0.8991147
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": "certainly which color would you like we have blue black and red",
          "confidence": 0.9408236
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " let's go with the black one",
          "confidence": 0.98783094
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " would you like the new Chromecast Ultra model or the regular Chromecast",
          "confidence": 0.9573053
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " regular Chromecast is fine thank you",
          "confidence": 0.9671048
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " okay sure would you like to ship it regular or Express",
          "confidence": 0.9544821
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " express please",
          "confidence": 0.9487205
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " terrific it's on the way thank you",
          "confidence": 0.97655964
        }
      ],
      "channelTag": 2,
      "languageCode": "en-us"
    },
    {
      "alternatives": [
        {
          "transcript": " thank you very much bye",
          "confidence": 0.9735077
        }
      ],
      "channelTag": 1,
      "languageCode": "en-us"
    }
  ]
}

Go

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫,請參閱這篇文章。 詳情請參閱 Speech-to-Text Go API 參考說明文件

如要向語音轉文字服務進行驗證,請設定應用程式預設憑證。 詳情請參閱「為本機開發環境設定驗證」。


// transcribeMultichannel generates a transcript from a multichannel speech file and tags the speech from each channel.
func transcribeMultichannel(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	data, err := os.ReadFile("../testdata/commercial_stereo.wav")
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:                            speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz:                     44100,
			LanguageCode:                        "en-US",
			AudioChannelCount:                   2,
			EnableSeparateRecognitionPerChannel: true,
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	})
	if err != nil {
		return fmt.Errorf("Recognize: %w", err)
	}

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "Channel %v: %v\n", result.ChannelTag, alt.Transcript)
		}
	}
	return nil
}

Java

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫,請參閱這篇文章。 詳情請參閱 Speech-to-Text Java API 參考說明文件

如要向語音轉文字服務進行驗證,請設定應用程式預設憑證。 詳情請參閱「為本機開發環境設定驗證」。

/**
 * Transcribe a remote audio file with multi-channel recognition
 *
 * @param gcsUri the path to the audio file
 */
public static void transcribeMultiChannelGcs(String gcsUri) throws Exception {

  try (SpeechClient speechClient = SpeechClient.create()) {

    // Configure request to enable multiple channels
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(44100)
            .setAudioChannelCount(2)
            .setEnableSeparateRecognitionPerChannel(true)
            .build();

    // Set the remote path for the audio file
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speechClient.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }
    // Just print the first result here.
    for (SpeechRecognitionResult result : response.get().getResultsList()) {

      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);

      // Print out the result
      System.out.printf("Transcript : %s\n", alternative.getTranscript());
      System.out.printf("Channel Tag : %s\n", result.getChannelTag());
    }
  }
}

Node.js

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫,請參閱這篇文章。 詳情請參閱 Speech-to-Text Node.js API 參考說明文件

如要向語音轉文字服務進行驗證,請設定應用程式預設憑證。 詳情請參閱「為本機開發環境設定驗證」。

const speech = require('@google-cloud/speech').v1;

// Creates a client
const client = new speech.SpeechClient();

const config = {
  encoding: 'LINEAR16',
  languageCode: 'en-US',
  audioChannelCount: 2,
  enableSeparateRecognitionPerChannel: true,
};

const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

const [response] = await client.recognize(request);
const transcription = response.results
  .map(
    result =>
      ` Channel Tag: ${result.channelTag} ${result.alternatives[0].transcript}`
  )
  .join('\n');
console.log(`Transcription: \n${transcription}`);

Python

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫,請參閱這篇文章。 詳情請參閱 Speech-to-Text Python API 參考說明文件

如要向語音轉文字服務進行驗證,請設定應用程式預設憑證。 詳情請參閱「為本機開發環境設定驗證」。


from google.cloud import speech


def transcribe_file_with_multichannel(audio_file: str) -> speech.RecognizeResponse:
    """Transcribe the given audio file synchronously with multi channel.
    Args:
        audio_file (str): Path to the local audio file to be transcribed.
            Example: "resources/multi.wav"
    Returns:
         cloud_speech.RecognizeResponse: The full response object which includes the transcription results.
    """
    client = speech.SpeechClient()

    with open(audio_file, "rb") as f:
        audio_content = f.read()

    audio = speech.RecognitionAudio(content=audio_content)

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=44100,
        language_code="en-US",
        audio_channel_count=2,
        enable_separate_recognition_per_channel=True,
    )

    response = client.recognize(config=config, audio=audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print(f"First alternative of result {i}")
        print(f"Transcript: {alternative.transcript}")
        print(f"Channel Tag: {result.channel_tag}")

    return result

其他語言

C#: 請按照用戶端程式庫頁面的 C# 設定說明操作, 然後前往 .NET 適用的 Speech-to-Text 參考說明文件

PHP: 請按照用戶端程式庫頁面的 PHP 設定說明 操作,然後前往 PHP 適用的 Speech-to-Text 參考文件

Ruby: 請按照用戶端程式庫頁面的Ruby 設定說明操作, 然後前往 Ruby 適用的 Speech-to-Text 參考說明文件