Stay organized with collections
Save and categorize content based on your preferences.
Prepare training data
Cloud Translation trains custom models by using matching pairs of segments in the
source and target languages. It treats each segment pair as an independent
training item, without assuming any correlation between separate pairs.
The segment pairs that are used to train your custom model must be in the
tab-separated values (.tsv) or Translation Memory eXchange (.tmx) format.
For more information, see Prepare example translations.
Segment pairs are always de-duplicated across all imported pairs. A segment pair
is a duplicate of another when their source segment matches another source
segment. Cloud Translation doesn't allow you to import files with the same
content.
Data split
AutoML Translation uses the segment pairs that you provide to for
different purposes while creating your custom model:
Train - Segment pairs to train the model. Allocate most of your data for
this purpose.
Validation - Segment pairs to validate the results that the model returns
during training.
Test - Segment pairs to generate the final evaluation metrics of your
model. Indicates how the model might perform in production.
You can control which segment pairs AutoML Translation uses for each
purpose by uploading separate files for the training, validation, and testing
sets. If you don't explicitly specify which files to use for these three
purposes, AutoML Translation automatically divides your segment pairs
into three sets. AutoML Translation uses approximately 80% of your data
for training, 10% for validation, and 10% for testing. AutoML Translation
randomly assigns your segment pairs into the three sets. You can have a maximum
of 10,000 segment pairs each for the validation and testing sets. After 10,000
pairs, segment pairs are pushed to the training set.
If you do multiple data imports into the same dataset, you can manually specify
the data split for one import and use the automatic split for another. Data is
always re-balanced with respect to your manual division after each import and
file deletion.
Data requirements
Your training data must conform to the following requirements:
If you let AutoML Translation automatically split your data, you must
submit at least 1,000 segment pairs to train a custom model.
If you manually split your data, you must provide at least three segment
pairs for the TRAIN set, and you must have at least 100 segment pairs each
for the VALIDATION and TEST sets.
You must provide at least three segment pairs for the TRAIN set, and you
must have at least 100 segment pairs each for the VALIDATION and TEST
sets.
You cannot provide more than 10,000 segment pairs each for the VALIDATION
and TEST sets.
Your dataset cannot exceed the maximum of 15 million segment pairs.
Data recommendations
The following recommendations can help you increase the quality of your model:
Use at least 5,000 segment pairs for TRAIN, 500 segment pairs for
VALIDATION, and 500 segment pairs for TEST. That said, use more data if
possible. Having more data for the TRAIN set helps the model learn patterns,
and having more data for the VALIDATION and TEST sets help verify that the
model can be generalized to a wider variety of scenarios in your domain.
Keep segments to roughly 200 words or less. AutoML Translation might drop
segment pairs larger than that. For more information, see Import
issues.
Fix common source data issues, as described in the "Clean up messy data" part
in the data
preparation
section of the overview.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-28 UTC."],[],[],null,["# Prepare training data\n=====================\n\nCloud Translation trains custom models by using matching pairs of segments in the\nsource and target languages. It treats each segment pair as an independent\ntraining item, without assuming any correlation between separate pairs.\n\nThe segment pairs that are used to train your custom model must be in the\ntab-separated values (`.tsv`) or Translation Memory eXchange (`.tmx`) format.\nFor more information, see [Prepare example translations](/translate/docs/advanced/custom-translations#file-prep).\n\nSegment pairs are always de-duplicated across all imported pairs. A segment pair\nis a duplicate of another when their source segment matches another source\nsegment. Cloud Translation doesn't allow you to import files with the same\ncontent.\n\nData split\n----------\n\nAutoML Translation uses the segment pairs that you provide to for\ndifferent purposes while creating your custom model:\n\n- **Train** - Segment pairs to train the model. Allocate most of your data for this purpose.\n- **Validation** - Segment pairs to validate the results that the model returns during training.\n- **Test** - Segment pairs to generate the final evaluation metrics of your model. Indicates how the model might perform in production.\n\nYou can control which segment pairs AutoML Translation uses for each\npurpose by uploading separate files for the training, validation, and testing\nsets. If you don't explicitly specify which files to use for these three\npurposes, AutoML Translation automatically divides your segment pairs\ninto three sets. AutoML Translation uses approximately 80% of your data\nfor training, 10% for validation, and 10% for testing. AutoML Translation\nrandomly assigns your segment pairs into the three sets. You can have a maximum\nof 10,000 segment pairs each for the validation and testing sets. After 10,000\npairs, segment pairs are pushed to the training set.\n\nIf you do multiple data imports into the same dataset, you can manually specify\nthe data split for one import and use the automatic split for another. Data is\nalways re-balanced with respect to your manual division after each import and\nfile deletion.\n| **Note:** If you exceed 100,000 segment pairs, you might have an unbalanced data split because you can only include a maximum of 10,000 segment pairs for the validation and test sets. AutoML Translation warns you but still allows training to proceed.\n\nData requirements\n-----------------\n\nYour training data must conform to the following requirements:\n\n- If you let AutoML Translation automatically split your data, you must submit at least 1,000 segment pairs to train a custom model.\n- If you manually split your data, you must provide at least three segment pairs for the `TRAIN` set, and you must have at least 100 segment pairs each for the `VALIDATION` and `TEST` sets.\n- You must provide at least three segment pairs for the `TRAIN` set, and you must have at least 100 segment pairs each for the `VALIDATION` and `TEST` sets.\n- You cannot provide more than 10,000 segment pairs each for the `VALIDATION` and `TEST` sets.\n- Your dataset cannot exceed the maximum of 15 million segment pairs.\n\n| **Important:** Do not use this product with any personal identifiable information (PII) or any data or images that have been obtained from a website or online service (including mobile apps) directed to children or from children under the age of 13, as covered by the Children's Online Privacy Protection Act, regardless of any parental consent. Data should be sanitized before sending it to the API. For support with removing PII or sensitive data visit \u003chttps://cloud.google.com/dlp\u003e\n\nData recommendations\n--------------------\n\nThe following recommendations can help you increase the quality of your model:\n\n- Use at least 5,000 segment pairs for `TRAIN`, 500 segment pairs for `VALIDATION`, and 500 segment pairs for `TEST`. That said, use more data if possible. Having more data for the `TRAIN` set helps the model learn patterns, and having more data for the `VALIDATION` and `TEST` sets help verify that the model can be generalized to a wider variety of scenarios in your domain.\n- Keep segments to roughly 200 words or less. AutoML Translation might drop segment pairs larger than that. For more information, see [Import\n issues](/translate/docs/advanced/automl-datasets#issues).\n- Fix common source data issues, as described in the \"Clean up messy data\" part in the [data\n preparation](/translate/docs/advanced/automl-beginner#data-preparation) section of the overview.\n\nWhat's next\n-----------\n\n- [Create a dataset and import your data](/translate/docs/advanced/custom-translations#file-prep)."]]