透過模型調整機制來改善語音轉錄結果

總覽

您可以使用模型調整功能,讓 Speech-to-Text 優先辨識特定字詞或詞組,而非其他系統建議選項。舉例來說,假設音訊資料經常包含「天氣」一詞,當 Speech-to-Text 遇到「weather」一詞時,您希望系統轉錄為「weather」的頻率高於「whether」。在這種情況下,您可以使用模型調適功能,讓 Speech-to-Text 傾向辨識「weather」。

模型調整功能在下列情況中特別實用:

  • 提升音訊資料中經常出現的字詞或詞組的準確率。舉例來說,您可以提醒辨識模型使用者常用的語音指令。

  • 擴充 Speech-to-Text 辨識的字詞詞彙。 Speech-to-Text 包含的詞彙量非常大。不過,如果音訊資料經常包含一般語言使用中罕見的字詞 (例如專有名詞或特定領域的字詞),您可以使用模型調適功能新增這些字詞。

  • 如果提供的音訊包含噪音或不太清楚,可提高語音轉錄的準確度。

您也可以使用模型調適提升功能,微調辨識模型的偏誤。

提升字詞和詞組辨識準確度

如要提高語音轉文字在轉錄音訊資料時辨識「天氣」一詞的機率,可以在 SpeechAdaptation 資源的 PhraseSet 物件中傳遞「天氣」一詞。

提供多字詞詞組時,Speech-to-Text 更有可能依序辨識出這些字詞。提供詞組也有助於提高辨識出部分詞組 (包括個別字詞) 的可能性。如要瞭解這些詞組的數量和大小限制,請參閱「內容限制」頁面。

使用類別提升辨識準確度

類別代表自然語言中常見的概念,例如貨幣單位和日曆日期。如果有一大群字詞對應至相同概念,但並非一律包含相同字詞或詞組,您可以使用類別提升轉錄準確度。

舉例來說,假設音訊資料包含使用者說出街道地址的錄音,你可能錄到有人說「我家是 123 Main Street,左邊第四間房子」。在本例中,您希望 Speech-to-Text 將第一組數字序列「123」辨識為地址,而非序數「一百二十三」。不過,並非所有人都住在「中正路 123 號」。在 PhraseSet 資源中列出所有可能的街道地址並不實際。您可以使用類別指出應辨識門牌號碼,無論實際號碼為何。在本例中,Speech-to-Text 就能更準確地轉錄「123 Main Street」和「987 Grand Boulevard」等詞組,因為這兩者都會辨識為地址號碼。

類別權杖

如要在模型調整中使用類別,請在 PhraseSet 資源的 phrases 欄位中加入類別權杖。請參閱支援的類別符記清單,瞭解您的語言可用的符記。舉例來說,如要提升來源音訊中地址號碼的轉錄品質,請在 PhraseSet.$ADDRESSNUM

您可以在 phrases 陣列中將類別做為獨立項目使用,也可以在較長的多字詞組中嵌入一或多個類別權杖。舉例來說,您可以在字串中加入類別權杖,在較大的片語中指出地址號碼:["my address is $ADDRESSNUM"]。不過,如果音訊包含類似但不完全相同的詞組 (例如「我在 123 Main Street」),這項功能就無法提供協助。為協助辨識類似詞組,請務必另外加入類別權杖:["my address is $ADDRESSNUM", "$ADDRESSNUM"]。如果使用無效或格式錯誤的類別符記,Speech-to-Text 會忽略該符記,不會觸發錯誤,但仍會使用其餘片語做為情境。

自訂類別

您也可以建立自己的 CustomClass,也就是由自訂清單組成的類別,清單內含相關的項目或值。舉例來說,您想轉錄的音訊資料可能包含數百家區域餐廳的名稱。一般語音中較少出現餐廳名稱,因此辨識模型不太可能將其選為「正確」答案。您可以自訂調整辨識模型,這些名稱出現在音訊中時,就能偏向正確的辨識結果。

如要使用自訂類別,請建立 CustomClass 資源,其中包含每個餐廳名稱做為 ClassItem。自訂類別的功能與預先建構的類別權杖相同。phrase 可以包含預先建構的類別權杖和自訂類別。

使用增強功能微調轉錄結果

預設情況下,模型調整項的影響相對較小,尤其是單字詞組。模型調整功能可讓您為某些詞組指派較高的權重,藉此提高辨識模型偏誤。如果符合下列所有情況,建議您導入加速功能:

  1. 您已導入模型調整功能。
  2. 您想進一步調整模型調整機制對轉錄結果的影響程度。如要瞭解你的語言是否支援加速功能,請參閱語言支援頁面

舉例來說,如果許多人詢問「進入郡集市的票價」,且「集市」一詞的出現頻率高於「票價」,在這種情況下,您可以透過模型調整,在 PhraseSet 資源中新增「fair」和「fare」做為 phrases,提高模型辨識這兩個字詞的機率。這樣一來,語音轉文字服務就會更常辨識出「fair」和「fare」,而不是「hare」或「lair」等字詞。

不過,由於「fair」在音訊中出現的頻率較高,因此系統應該會更常辨識出「fair」而非「fare」。您可能已使用 Speech-to-Text API 轉錄音訊,但發現系統辨識正確字詞 (「fair」) 時發生大量錯誤。在這種情況下,您可能想使用「boost」功能,為「fair」指派比「fare」更高的升幅值。「fair」的加權值較高,因此 Speech-to-Text API 會偏向選擇「fair」,而非「fare」。如果沒有提升值,辨識模型會以相同機率辨識「fair」和「fare」。

商家宣傳廣告基本概念

使用提升功能時,您會在 PhraseSet 資源中為 phrase 項目指派加權值。Speech-to-Text 會參考這個加權值,從音訊資料中選取可能的轉錄文字。增強值越高,Speech-to-Text 從可能選項中選擇該字詞或詞組的機率便越高。

如果將增幅值指派給多字詞組,增幅會套用至整個詞組,且只會套用至整個詞組。舉例來說,您想為「My favorite exhibit at the American Museum of Natural History is the blue whale」(我最喜歡美國自然史博物館的藍鯨展覽) 這個詞組指派提升值。如果您將該詞組新增至 phrase 物件並指派增強值,辨識模型就更有可能逐字辨識出該詞組。

如果提升多字詞組的成效後,仍未獲得預期結果,建議您將組成該詞組的所有雙連字 (2 個字,依序排列) 新增為額外 phrase 項目,並為每個項目指派提升值。延續先前的例子,您可以考慮加入其他雙連詞和尾連詞 (超過兩個字),例如「我最喜歡」、「我最喜歡的展覽」、「最喜歡的展覽」、「我最喜歡美國自然史博物館的展覽」、「美國自然史博物館」和「藍鯨」。這樣一來,STT 辨識模型就更有可能辨識音訊中含有原始加強詞組部分內容的相關詞組,即使這些詞組與原始詞組不完全相同也沒關係。

設定增幅值

增幅值必須是浮點值,且大於 0。升幅值的實際上限為 20。為獲得最佳結果,請調整升幅值,直到取得準確的轉錄結果為止。

如果提升值較高,偽陰性情形就會較少。偽陰性是指音訊中出現的字詞或詞組,但語音轉文字服務未正確辨識。不過,提高信賴度也可能增加偽陽性機率,也就是說,即使音訊中沒有出現該字詞或詞組,轉錄稿中仍可能出現。

使用模型適應的範例

以下範例逐步說明如何使用模型調整功能,轉錄某人說「The word is fare」的錄音內容。在本例中,如果沒有語音適應功能,Speech-to-Text 會辨識出「fair」一字。透過語音調整功能,Speech-to-Text 可以識別出「fare」一字。

事前準備

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Speech-to-Text APIs.

    Enable the APIs

  5. Make sure that you have the following role or roles on the project: Cloud Speech Administrator

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      前往「IAM」頁面
    2. 選取專案。
    3. 按一下「授予存取權」
    4. 在「New principals」(新增主體) 欄位中,輸入您的使用者 ID。 這通常是 Google 帳戶的電子郵件地址。

    5. 在「Select a role」(選取角色) 清單中,選取角色。
    6. 如要授予其他角色,請按一下 「新增其他角色」,然後新增每個其他角色。
    7. 按一下 [Save]
  6. Install the Google Cloud CLI.

  7. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  8. To initialize the gcloud CLI, run the following command:

    gcloud init
  9. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  10. Make sure that billing is enabled for your Google Cloud project.

  11. Enable the Speech-to-Text APIs.

    Enable the APIs

  12. Make sure that you have the following role or roles on the project: Cloud Speech Administrator

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      前往「IAM」頁面
    2. 選取專案。
    3. 按一下「授予存取權」
    4. 在「New principals」(新增主體) 欄位中,輸入您的使用者 ID。 這通常是 Google 帳戶的電子郵件地址。

    5. 在「Select a role」(選取角色) 清單中,選取角色。
    6. 如要授予其他角色,請按一下 「新增其他角色」,然後新增每個其他角色。
    7. 按一下 [Save]
  13. Install the Google Cloud CLI.

  14. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  15. To initialize the gcloud CLI, run the following command:

    gcloud init
  16. 用戶端程式庫可以使用應用程式預設憑證,輕鬆向 Google API 進行驗證,然後傳送要求給這些 API。使用應用程式預設憑證,您可以在本機測試及部署應用程式,不必變更基礎程式碼。詳情請參閱「 驗證以使用用戶端程式庫」。

  17. If you're using a local shell, then create local authentication credentials for your user account:

    gcloud auth application-default login

    You don't need to do this if you're using Cloud Shell.

    If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

  18. 此外,請確認您已安裝用戶端程式庫

    使用 PhraseSet 改善轉錄品質

    1. 下列範例會使用「fare」一詞建構 PhraseSet,並在辨識要求中將其新增為 inline_phrase_set

    Python

    import os
    
    from google.cloud.speech_v2 import SpeechClient
    from google.cloud.speech_v2.types import cloud_speech
    
    PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
    
    
    def adaptation_v2_inline_phrase_set(audio_file: str) -> cloud_speech.RecognizeResponse:
        """Enhances speech recognition accuracy using an inline phrase set.
        The inline custom phrase set helps the recognizer produce more accurate transcriptions for specific terms.
        Phrases are given a boost to increase their chances of being recognized correctly.
        Args:
            audio_file (str): Path to the local audio file to be transcribed.
        Returns:
            cloud_speech.RecognizeResponse: The full response object which includes the transcription results.
        """
    
        # Instantiates a client
        client = SpeechClient()
    
        # Reads a file as bytes
        with open(audio_file, "rb") as f:
            audio_content = f.read()
    
        # Build inline phrase set to produce a more accurate transcript
        phrase_set = cloud_speech.PhraseSet(
            phrases=[{"value": "fare", "boost": 10}, {"value": "word", "boost": 20}]
        )
        adaptation = cloud_speech.SpeechAdaptation(
            phrase_sets=[
                cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                    inline_phrase_set=phrase_set
                )
            ]
        )
        config = cloud_speech.RecognitionConfig(
            auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
            adaptation=adaptation,
            language_codes=["en-US"],
            model="short",
        )
    
        # Prepare the request which includes specifying the recognizer, configuration, and the audio content
        request = cloud_speech.RecognizeRequest(
            recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
            config=config,
            content=audio_content,
        )
    
        # Transcribes the audio into text
        response = client.recognize(request=request)
    
        for result in response.results:
            print(f"Transcript: {result.alternatives[0].transcript}")
    
        return response
    
    
    1. 這個範例會建立具有相同詞組的 PhraseSet 資源,然後在辨識要求中參照該資源:

    Python

    import os
    
    from google.cloud.speech_v2 import SpeechClient
    from google.cloud.speech_v2.types import cloud_speech
    
    PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
    
    
    def adaptation_v2_phrase_set_reference(
        audio_file: str,
        phrase_set_id: str,
    ) -> cloud_speech.RecognizeResponse:
        """Transcribe audio files using a PhraseSet.
        Args:
            audio_file (str): Path to the local audio file to be transcribed.
            phrase_set_id (str): The unique ID of the PhraseSet to use.
        Returns:
            cloud_speech.RecognizeResponse: The full response object which includes the transcription results.
        """
    
        # Instantiates a client
        client = SpeechClient()
    
        # Reads a file as bytes
        with open(audio_file, "rb") as f:
            audio_content = f.read()
    
        # Creating operation of creating the PhraseSet on the cloud.
        operation = client.create_phrase_set(
            parent=f"projects/{PROJECT_ID}/locations/global",
            phrase_set_id=phrase_set_id,
            phrase_set=cloud_speech.PhraseSet(phrases=[{"value": "fare", "boost": 10}]),
        )
        phrase_set = operation.result()
    
        # Add a reference of the PhraseSet into the recognition request
        adaptation = cloud_speech.SpeechAdaptation(
            phrase_sets=[
                cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                    phrase_set=phrase_set.name
                )
            ]
        )
    
        # Automatically detect audio encoding. Use "short" model for short utterances.
        config = cloud_speech.RecognitionConfig(
            auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
            adaptation=adaptation,
            language_codes=["en-US"],
            model="short",
        )
        #  Prepare the request which includes specifying the recognizer, configuration, and the audio content
        request = cloud_speech.RecognizeRequest(
            recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
            config=config,
            content=audio_content,
        )
        # Transcribes the audio into text
        response = client.recognize(request=request)
    
        for result in response.results:
            print(f"Transcript: {result.alternatives[0].transcript}")
    
        return response
    
    

    使用 CustomClass 改善轉錄結果

    1. 下列範例會使用「fare」項目和名稱「fare」建構 CustomClass。然後在辨識要求的 inline_phrase_set 中參照 CustomClass

    Python

    import os
    
    from google.cloud.speech_v2 import SpeechClient
    from google.cloud.speech_v2.types import cloud_speech
    
    PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
    
    
    def adaptation_v2_inline_custom_class(
        audio_file: str,
    ) -> cloud_speech.RecognizeResponse:
        """Transcribe audio file using inline custom class.
        The inline custom class helps the recognizer produce more accurate transcriptions for specific terms.
        Args:
            audio_file (str): Path to the local audio file to be transcribed.
        Returns:
            cloud_speech.RecognizeResponse: The response object which includes the transcription results.
        """
        # Instantiates a client
        client = SpeechClient()
    
        # Reads a file as bytes
        with open(audio_file, "rb") as f:
            audio_content = f.read()
    
        # Define an inline custom class to enhance recognition accuracy with specific items like "fare" etc.
        custom_class_name = "your-class-name"
        custom_class = cloud_speech.CustomClass(
            name=custom_class_name,
            items=[{"value": "fare"}],
        )
    
        # Build inline phrase set to produce a more accurate transcript
        phrase_set = cloud_speech.PhraseSet(
            phrases=[{"value": custom_class_name, "boost": 20}]
        )
        adaptation = cloud_speech.SpeechAdaptation(
            phrase_sets=[
                cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                    inline_phrase_set=phrase_set
                )
            ],
            custom_classes=[custom_class],
        )
        config = cloud_speech.RecognitionConfig(
            auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
            adaptation=adaptation,
            language_codes=["en-US"],
            model="short",
        )
    
        # Prepare the request which includes specifying the recognizer, configuration, and the audio content
        request = cloud_speech.RecognizeRequest(
            recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
            config=config,
            content=audio_content,
        )
    
        # Transcribes the audio into text
        response = client.recognize(request=request)
    
        for result in response.results:
            print(f"Transcript: {result.alternatives[0].transcript}")
    
        return response
    
    
    1. 這個範例會建立具有相同項目的 CustomClass 資源。接著,它會建立 PhraseSet 資源,並以片語參照 CustomClass 資源名稱。然後在辨識要求中參照 PhraseSet 資源:

    Python

    import os
    
    from google.cloud.speech_v2 import SpeechClient
    from google.cloud.speech_v2.types import cloud_speech
    
    PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
    
    
    def adaptation_v2_custom_class_reference(
        audio_file: str, phrase_set_id: str, custom_class_id: str
    ) -> cloud_speech.RecognizeResponse:
        """Transcribe audio file using a custom class.
        Args:
            audio_file (str): Path to the local audio file to be transcribed.
            phrase_set_id (str): The unique ID of the phrase set to use.
            custom_class_id (str): The unique ID of the custom class to use.
        Returns:
            cloud_speech.RecognizeResponse: The full response object which includes the transcription results.
        """
        # Instantiates a speech client
        client = SpeechClient()
    
        # Reads a file as bytes
        with open(audio_file, "rb") as f:
            audio_content = f.read()
    
        # Create a custom class to improve recognition accuracy for specific terms
        custom_class = cloud_speech.CustomClass(items=[{"value": "fare"}])
        operation = client.create_custom_class(
            parent=f"projects/{PROJECT_ID}/locations/global",
            custom_class_id=custom_class_id,
            custom_class=custom_class,
        )
        custom_class = operation.result()
    
        # Create a persistent PhraseSet to reference in a recognition request
        created_phrase_set = cloud_speech.PhraseSet(
            phrases=[
                {
                    "value": f"${{{custom_class.name}}}",
                    "boost": 20,
                },  # Using custom class reference
            ]
        )
        operation = client.create_phrase_set(
            parent=f"projects/{PROJECT_ID}/locations/global",
            phrase_set_id=phrase_set_id,
            phrase_set=created_phrase_set,
        )
        phrase_set = operation.result()
    
        # Add a reference of the PhraseSet into the recognition request
        adaptation = cloud_speech.SpeechAdaptation(
            phrase_sets=[
                cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                    phrase_set=phrase_set.name
                )
            ]
        )
        # Automatically detect the audio's encoding with short audio model
        config = cloud_speech.RecognitionConfig(
            auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
            adaptation=adaptation,
            language_codes=["en-US"],
            model="short",
        )
    
        # Create a custom class to reference in a PhraseSet
        request = cloud_speech.RecognizeRequest(
            recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
            config=config,
            content=audio_content,
        )
    
        # Transcribes the audio into text
        response = client.recognize(request=request)
    
        for result in response.results:
            print(f"Transcript: {result.alternatives[0].transcript}")
    
        return response
    
    

    清除所用資源

    如要避免系統向您的 Google Cloud 帳戶收取本頁所用資源的費用,請按照下列步驟操作。

    1. Optional: Revoke the authentication credentials that you created, and delete the local credential file.

      gcloud auth application-default revoke
    2. Optional: Revoke credentials from the gcloud CLI.

      gcloud auth revoke

    控制台

  19. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  20. In the project list, select the project that you want to delete, and then click Delete.
  21. In the dialog, type the project ID, and then click Shut down to delete the project.
  22. gcloud

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

    後續步驟