本頁面由 Cloud Translation API 翻譯而成。

在錄音中偵測不同的說話者

本頁面說明如何在 Speech-to-Text 轉錄的音訊資料中，標籤不同的說話者。

有時候，音訊資料內的樣本會含有多人談話內容。例如，電話通話的音訊通常會有兩人以上的語音。理想情況下，電話通話的語音轉錄會加入相關資訊，讓您知道哪個時間點誰在說話。

說話者分段標記

Speech-to-Text 可以辨識同一個音訊片段中的多位說話者。傳送音訊轉錄要求至 Speech-to-Text 時，您可以加入參數，指示 Speech-to-Text 識別音訊範本中的不同說話者。此功能稱為「說話者分段標記」，可偵測說話者換人的時間點，並為音訊中偵測到的個別語音加上數字標籤。

在轉錄要求中啟用說話者區分功能後，Speech-to-Text 會嘗試區分音訊範例中的不同聲音。語音轉錄結果會以個別說話者指派到的編號，為每個字詞加上標記。同一位說話者所說的字詞會帶有相同的編號。語音轉錄結果包含的編號數字，取決於 Speech-to-Text 能在音訊範例中識別的說話者人數。

使用說話者區分功能時，Speech-to-Text 會產生轉錄稿中所有結果的累計總和。每個結果均會包含上個結果的字詞。因此，最終結果中的 words 陣列會提供完整的語音轉錄結果，並標示說話者。

請參閱語言支援頁面，確認這項功能是否支援你的語言。

在要求中啟用說話者分段標記

如要啟用說話者分段標記，請在 RecognitionFeatures 中設定 diarization_config 欄位。您必須根據轉錄稿中預期的說話者人數，設定 min_speaker_count 和 max_speaker_count 值。

Speech-to-Text 支援所有語音辨識方法的說話者區分功能：speech:recognize 和串流。

使用本機檔案

下列程式碼片段會示範如何使用本機檔案，在「語音轉文字」的語音轉錄要求中啟用說話者分段標記。

通訊協定

如要瞭解完整的詳細資訊，請參閱 speech:recognize API 端點。

如要執行同步語音辨識，請提出 POST 要求並提供適當的要求內容。以下為使用 curl 的 POST 要求示例。這個範例使用 Google Cloud CLI 產生存取權杖。如需安裝 gcloud CLI 的操作說明，請參閱快速入門導覽課程。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v2/projects/{project}/locations/{location}/recognizers/{recognizer}:recognize \
    --data '{
    "config": {
        "features": {
            "diarizationConfig": {
              "minSpeakerCount": 2,
              "maxSpeakerCount": 2
            },
        }
    },
    "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav"
}' > speaker-diarization.txt

如果要求成功，伺服器會傳回 200 OK HTTP 狀態碼與 JSON 格式的回應，並另存成名為 speaker-diarization.txt 的檔案。

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "hi I'd like to buy a Chromecast and I was wondering whether you could help me with that certainly which color would you like we have blue black and red uh let's go with the black one would you like the new Chromecast Ultra model or the regular Chrome Cast regular Chromecast is fine thank you okay sure we like to ship it regular or Express Express please terrific it's on the way thank you thank you very much bye",
          "confidence": 0.92142606,
          "words": [
            {
              "startOffset": "0s",
              "endOffset": "1.100s",
              "word": "hi",
              "speakerLabel": "2"
            },
            {
              "startOffset": "1.100s",
              "endOffset": "2s",
              "word": "I'd",
              "speakerLabel": "2"
            },
            {
              "startOffset": "2s",
              "endOffset": "2s",
              "word": "like",
              "speakerLabel": "2"
            },
            {
              "startOffset": "2s",
              "endOffset": "2.100s",
              "word": "to",
              "speakerLabel": "2"
            },
            ...
            {
              "startOffset": "6.500s",
              "endOffset": "6.900s",
              "word": "certainly",
              "speakerLabel": "1"
            },
            {
              "startOffset": "6.900s",
              "endOffset": "7.300s",
              "word": "which",
              "speakerLabel": "1"
            },
            {
              "startOffset": "7.300s",
              "endOffset": "7.500s",
              "word": "color",
              "speakerLabel": "1"
            },
            ...
          ]
        }
      ],
      "languageCode": "en-us"
    }
  ]
}

Go

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Go API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。


import (
	"context"
	"fmt"
	"io"
	"os"
	"strings"

	speech "cloud.google.com/go/speech/apiv1"
	"cloud.google.com/go/speech/apiv1/speechpb"
)

// transcribe_diarization_gcs_beta Transcribes a remote audio file using speaker diarization.
func transcribe_diarization(w io.Writer) error {

	ctx := context.Background()
	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	diarizationConfig := &speechpb.SpeakerDiarizationConfig{
		EnableSpeakerDiarization: true,
		MinSpeakerCount:          2,
		MaxSpeakerCount:          2,
	}

	recognitionConfig := &speechpb.RecognitionConfig{
		Encoding:          speechpb.RecognitionConfig_LINEAR16,
		SampleRateHertz:   8000,
		LanguageCode:      "en-US",
		DiarizationConfig: diarizationConfig,
	}

	// Get the contents of the local audio file
	content, err := os.ReadFile("../resources/commercial_mono.wav")
	if err != nil {
		return fmt.Errorf("error reading file %w", err)
	}
	audio := &speechpb.RecognitionAudio{
		AudioSource: &speechpb.RecognitionAudio_Content{Content: content},
	}

	longRunningRecognizeRequest := &speechpb.LongRunningRecognizeRequest{
		Config: recognitionConfig,
		Audio:  audio,
	}

	operation, err := client.LongRunningRecognize(ctx, longRunningRecognizeRequest)
	if err != nil {
		return fmt.Errorf("error running recognize %w", err)
	}

	response, err := operation.Wait(ctx)
	if err != nil {
		return err
	}

	// Speaker Tags are only included in the last result object, which has only one
	// alternative.
	alternative := response.Results[len(response.Results)-1].Alternatives[0]

	wordInfo := alternative.GetWords()[0]
	currentSpeakerTag := wordInfo.GetSpeakerTag()

	var speakerWords strings.Builder

	speakerWords.WriteString(fmt.Sprintf("Speaker %d: %s", wordInfo.GetSpeakerTag(), wordInfo.GetWord()))

	// For each word, get all the words associated with one speaker, once the speaker changes,
	// add a new line with the new speaker and their spoken words.
	for i := 1; i < len(alternative.Words); i++ {
		wordInfo := alternative.Words[i]
		if currentSpeakerTag == wordInfo.GetSpeakerTag() {
			speakerWords.WriteString(" ")
			speakerWords.WriteString(wordInfo.GetWord())
		} else {
			speakerWords.WriteString(fmt.Sprintf("\nSpeaker %d: %s",
				wordInfo.GetSpeakerTag(), wordInfo.GetWord()))
			currentSpeakerTag = wordInfo.GetSpeakerTag()
		}
	}
	fmt.Fprintf(w, speakerWords.String())
	return nil
}

Python

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Python API 參考說明文件。