自动语音识别 (ASR),也称为机器转录或 Speech-to-Text (STT),使用机器学习将包含语音的音频转换为文本。ASR 有许多应用,包括但不限于字幕、虚拟助理、交互式语音响应 (IVR)、口录等。然而,机器学习系统很少 100% 准确,并且 ASR 也不例外。如果您计划将 ASR 用于关键系统,请务必衡量其准确率或整体质量,以了解其在集成它的更广泛系统中的表现情况。
测量准确率后,您可以调整系统,以针对特定情况提供更高的准确率。在 Google 的 Cloud Speech-to-Text API 中,可以通过选择最合适的识别模型以及使用我们的 Speech Adaptation API 来完成准确率调整。我们提供针对各种应用场景定制的各种模型,例如长音频、医疗或电话对话。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-18。"],[],[],null,["# Measure and improve speech accuracy\n\nOverview\n--------\n\n[Automated Speech Recognition (ASR)](https://en.wikipedia.org/wiki/Speech_recognition), also known as machine transcription or Speech-to-Text (STT), uses machine learning to turn audio containing speech into text. ASR has many applications from subtitling, to virtual assistants, to [Interactive Voice Responses (IVRs)](https://en.wikipedia.org/wiki/Interactive_voice_response), to dictation, and more. However, machine learning systems are rarely 100% accurate, and ASR is no exception. If you plan to rely on ASR for critical systems, it's very important to measure its accuracy or overall quality to understand how it performs in your broader system that integrates it.\n\nOnce you measure your accuracy, it's possible to tune the systems to provide even greater accuracy for your specific situation. In Google's [Cloud Speech-to-Text API](/speech-to-text), accuracy tuning can be done by choosing the most appropriate recognition model and by using our Speech Adaptation API. We offer a wide variety of models tailored for different use cases, such as long-form audio, medical or over-the-phone conversations.\n\nDefining speech accuracy\n------------------------\n\nSpeech accuracy can be measured in a variety of ways. It might be useful for you to use multiple metrics, depending on your needs. However, the industry standard method for comparison is [Word Error Rate (WER)](https://en.wikipedia.org/wiki/Word_error_rate), often abbreviated as WER. WER measures the percentage of incorrect word transcriptions in the entire set. A lower WER means that the system is more accurate.\n\nYou might also see the term, *ground truth*, used in the context of ASR accuracy. Ground truth is the 100% accurate transcription, typically human-provided, that you use to compare and measure accuracy.\n\n### Word Error Rate (WER)\n\nWER is the combination of three types of transcription errors, which can occur:\n\n- **Insertion Error (I):** Words present in the hypothesis transcript that aren't present in the ground truth.\n- **Substitution errors (S):** Words that are present in both the hypothesis and ground truth but aren't transcribed correctly.\n- **Deletion errors (D):** Words that are missing from the hypothesis but present in the ground truth.\n\n\\\\\\[WER = {S+R+Q \\\\over N}\\\\\\]\n\nTo find the WER, add the total number of each one of these errors, and divide by the total number of words (N) in the ground truth transcript. The WER can be greater than 100% in situations with very low accuracy, for example, when a large amount of new text is inserted.\nNote: Substitution is essentially deletion followed by insertion, and some substitutions are less severe than others. For example, there might be a difference in substituting a single letter as opposed to a word.\n\n### Relation of WER to a confidence score\n\nThe WER metric is independent from a [confidence score](/speech-to-text/docs/speech-to-text-requests#confidence-values), and they usually don't correlate with each other. A confidence score is based on likelihood, while the WER is based on whether the word is correctly identified or not. If the word isn't correctly identified, this means that even minor grammatical errors can cause a high WER. A word that is correctly identified leads to a low WER, which can still lead to a low likelihood, which drives the confidence low if the word isn't that frequent or the audio is very noisy.\n\nSimilarly, a word that is frequently used can have a high likelihood to get transcribed by the ASR correctly, which drives the confidence score high. For example, when a difference is identified between \"I\" and \"eye\", a high confidence might occur, because \"I\" is a more popular word, but the WER metric is lowered by it.\n\nIn summary, the confidence and WER metrics are independent and shouldn't be expected to correlate.\n\n### Normalization\n\nWhen computing the WER metric, the machine transcription is compared to a human-provided ground truth transcription. The text from both transcriptions is normalized before comparison is done. Punctuation is removed, and capitalization is ignored when comparing the machine transcription with the human-provided ground-truth transcription.\n\n### Ground-truth conventions\n\nIt is important to recognize that there isn't a single human-agreed transcription format for any given audio. There are many aspects to consider. For example, audio might have other non-speech vocalizations, like \"huh\", \"yep\", \"umm\". Some Cloud STT models, like [\"medical_conversation\"](/speech-to-text/docs/medical-models), do include these vocalizations, while others don't. Therefore, it is important that ground-truth conventions match the conventions of the model being evaluated. The following high-level guidelines are used to prepare a ground-truth text transcription for a given audio.\n\n- In addition to standard letters, you can use the digits 0-9.\n- Don't use symbols like \"@\", \"#\", \"$\", \".\". Use words like \"at\", \"hash\", \"dollar\", \"dot\".\n- Use \"%\" but only when preceded by a number; otherwise, use the word \"percent\".\n- Use \"\\\\$\" only when followed by a number, like \"Milk is \\\\$3.99\".\n\n- Use words for numbers less than 10.\n\n - For example, \"I have four cats and 12 hats.\"\n- Use numbers for measures, currency, and large factors like million, billion, or trillion. For example, \"7.5 million\" instead of \"seven and a half million.\"\n\n- Don't use abbreviations in the following cases:\n\nMeasuring speech accuracy\n-------------------------\n\nThe following steps get you started with determining accuracy using your audio:\n\n### Gather test audio files\n\nGather a representative sample of audio files to measure their quality. This sample should be random and should be as close to the target environment as possible. For example, if you want to transcribe conversations from a call center to aid in quality assurance, you should randomly select a few actual calls recorded on the same equipment that your production audio comes through. If your audio is recorded on your cell phone or computer microphone and isn't representative of your use case, then don't use the recorded audio.\n\nRecord at least 30 minutes of audio to get a statistically significant accuracy metric. We recommend using between 30 minutes and 3 hours of audio. In this lab, the audio is provided for you.\n\n### Get ground truth transcriptions\n\nGet accurate transcriptions of the audio. This usually involves a single or a double-pass human transcription of the target audio. Your goal is to have a 100% accurate transcription to measure the automated results against.\n\nIt's important when getting ground truth transcriptions to match the transcription conventions of your target ASR system as closely as possible. For example, ensure that punctuation, numbers, and capitalization are consistent.\n\nObtain a machine transcription, and fix any issues in the text that you notice.\n\n### Get the machine transcription\n\nSend the audio to Google Speech-to-Text API, and get your hypothesis transcription by using [Speech-to-Text UI](/speech-to-text/docs/transcribe-console).\n\n### Pair ground truth to the audio\n\nIn the UI tool, click 'Attach Ground Truth' to associate a given audio file with the provided ground truth. After finishing the attachment, you can see your WER metric and the visualization of all the differences."]]