[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-07-24。"],[],[],null,["# Overview of custom speech models\n\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nCustom Speech-to-Text models help you fine-tune speech recognition models to your specific needs. This service is designed to enhance the accuracy and relevance of speech recognition service in diverse environments and use cases, using your domain-specific audio and text data.\n\nAccessible in both our Google Cloud console and API, Custom Speech-to-Text models allow to train, evaluate and deploy a dedicated speech model in a no-code integrated environment. For training you can provide audio data only that are representative of your audio conditions, without reference transcriptions as a training set. However, you need to provide audio data and their reference transcriptions as part of your evaluation set.\n\nCreating and using a custom Speech-to-Text model involves the following steps:\n\n1. Prepare and upload training data in a Cloud Storage bucket.\n2. Train a new custom model.\n3. Deploy and manage your custom model using endpoints.\n4. Use and evaluate your custom model in your application.\n\nHow does it work\n----------------\n\nYou can use Custom Speech-to-Text models to augment a base transcription model to improve transcription recognition. Some audio conditions, including sirens, music, and excessive background noise can pose acoustic challenges. Certain accents or unusual vocabulary, such product names can as well.\n\nEvery Custom Speech-to-Text model uses pre-trained, Conformer-based architecture as a base model trained with proprietary data of commonly spoken language. During the training process, the base model is fine-tuned by adapting a significant percentage of the original weights to improve recognition of domain-specific vocabulary and audio conditions specific to your application.\n\nFor the effective training of a Custom Speech-to-Text model, you must provide:\n\n- Minimum 100 audio-hours of training data, either audio-only or audio with the corresponding text transcript as ground-truth. This data is crucial for the initial training phase, so the model comprehensively learns the nuances of the speech patterns and vocabulary. For details, see [Create a ground-truth dataset](/speech-to-text/v2/docs/custom-speech-models/prepare-data#ground-truth_annotation_guidelines).\n- A separate dataset of at least 10 audio-hours of validation data, with the corresponding text transcript as ground-truth. You can learn more about the expected format and ground-truth conventions to be followed in our [data preparation instructions](/speech-to-text/v2/docs/custom-speech-models/prepare-data).\n\nFollowing a successful training, you can deploy a Custom Speech-to-Text model in an endpoint with one click, and use it directly through the Cloud Speech-to-Text V2 API for inference and benchmarking.\n\nSupported models, languages and regions\n---------------------------------------\n\nCustom Speech-to-Text models support the following combinations of models and languages and locales for training:\n\nAdditionally, to comply with your data residency requirements we offer training and deployment hardware in different regions. Dedicated hardware is supported in the following combinations of models and regions:\n\nQuota\n-----\n\nFor Custom Speech-to-Text model training, each Google Cloud project should have enough default quota to run multiple training jobs concurrently and is intended to meet the needs of most projects without additional adjustments. However if you need to run a higher number of concurrent training jobs or require more extensive labeling or compute resources, then request additional quota.\n\nFor a Custom Speech-to-Text model serving an endpoint deployment, each endpoint has a theoretical limit of [20 queries per second (QPS)](/speech-to-text/quotas). If higher throughput is required, request additional serving quota.\n\nPricing\n-------\n\nCreating and using a Custom Speech-to-Text model involves certain costs which are primarily based on the resources used during the training and subsequent deployment of the model. Specifically, the Custom Speech-to-Text model will incur the following costs in a typical model lifecycle:\n\n- **Training**: You will be charged for the number of model-training hours. This time is proportional to the amount of audio-hours in the training dataset. As a rule, training takes a tenth of the number of audio-hours in the dataset.\n- **Deployment**: You will be charged for each hour that a model is deployed on an endpoint.\n- **Inference**: You will be charged for the number of streamed seconds of audio for transcription, in alignment with the general Speech-to-Text billing.\n\nUnderstanding these costs is crucial for effective budgeting and resource allocation. For more information, in the Custom Speech-to-Text models section, see [Cloud Speech-to-Text pricing](https://cloud.google.com/speech-to-text/pricing).\n\nWhat's next\n-----------\n\nFollow the resources to take advantage of custom speech models in your application:\n\n- [Prepare your training data](/speech-to-text/v2/docs/custom-speech-models/prepare-data)\n- [Train and manage your custom models](/speech-to-text/v2/docs/custom-speech-models/train-model)\n- [Deploy and manage model endpoints](/speech-to-text/v2/docs/custom-speech-models/deploy-model)\n- [Use your custom models](/speech-to-text/v2/docs/custom-speech-models/use-model)\n- [Evaluate your custom models](/speech-to-text/v2/docs/custom-speech-models/evaluate-model)"]]