You can use a glossary, which is optional, with the Translation service to define terminology that is specific to your domain. With a glossary, you can add term pairs, which include a source language term and target language term. The term pairs ensure that your terminology is consistently translated.
Use a glossary in the following cases:
- Product names. Don't translate words that refer to product names. For example, "Google Home" must translate to "Google Home."
- Ambiguous words. Specify the meaning of vague words and homonyms. For example, "bat" can mean a piece of sports equipment or an animal.
- Borrowed words. Clarify the meaning of words adopted from a different language. For example, "bouillabaisse" in French translates to "bouillabaisse" in English, a fish stew dish.
The terms in a glossary can be single tokens (words) or short phrases, usually fewer than five words. Translation ignores any matching glossary entries if the words are glossary stopwords.
The steps for using a glossary are the following:
- Meet the prerequisites of the Before you begin section
- Prepare your environment for a glossary.
- Create a glossary file.
- Save a copy of your glossary file.
- To make the glossary file available to the Translation API,
choose one of the following:
- Create a unidirectional glossary resource for a pair of source and target terms in a specific language.
- Create an equivalent term sets glossary resource in multiple languages on each row.
- Specify which glossary to use when you request a translation. To get a list of the available glossaries, see List glossaries.
A summary of the glossary methods includes the following:
Method | Description |
---|---|
CreateGlossary |
To create a glossary. |
GetGlossary |
To return a stored glossary. |
DeleteGlossary |
To delete a glossary that you no longer need. |
ListGlossaries |
To poll the status of Translation operations. |
Before you begin
Before using a glossary to translate a different language into English text, you must do the following:
Create a project, which is stored in the
platform
namespace. The following sample YAML file creates a project:apiVersion: resourcemanager.gdc.goog/v1 kind: Project metadata: labels: atat.config.google.com/clin-number: CLIN_NUMBER atat.config.google.com/task-order-number: TASK_ORDER_NUMBER name: translation-glossary-project namespace: platform
To get the permissions you need to use a glossary, ask your Project IAM Admin to grant you the following roles in your project namespace:
- AI Translation Developer: accesses the Vertex AI Translation service. Request the AI Translation Developer (
ai-translation-developer
) role. - Project Bucket Admin: manages the storage buckets and objects within buckets to create or upload the file to storage. Request the Project Bucket Admin (
project-bucket-admin
) role.
- AI Translation Developer: accesses the Vertex AI Translation service. Request the AI Translation Developer (
Prepare your environment
Follow these steps to prepare your environment to use a glossary:
- Create a storage bucket
in your project, and select the
Standard
class. - You must grant
read
permissions on the storage bucket to the service account (g-vai-translation-sie-sa
) used by the Translation service. Enter the following sample code to create the storage bucket, role, and role binding:
Create the storage bucket.
apiVersion: object.gdc.goog/v1 kind: Bucket metadata: name: glossary-bucket namespace: translation-glossary-project spec: description: bucket for translation glossary storageClass: Standard bucketPolicy: lockingPolicy: defaultObjectRetentionDays: 90
Create the role.
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: ai-translation-glossary-reader namespace: translation-glossary-project rules: - apiGroups: - object.gdc.goog resources: - buckets verbs: - read-object
Create the role binding.
apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: ai-translation-glossary-reader-rolebinding namespace: translation-glossary-project roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: ai-translation-glossary-reader subjects: - kind: ServiceAccount name: g-vai-translation-sie-sa namespace: g-vai-translation-sie
Save a copy of your glossary
Backing up the glossary isn't supported. Therefore, save a copy of the input file.
If the glossary gets erased for any reason, upload your saved input file to the
storage bucket,
and re-run the CreateGlossary
request to restore your glossary.
Create a glossary file
You must create a glossary file to store your source language and target language terms. There are two different glossary layouts that you can use to define your terms. The glossary layouts include the following:
Glossary layout | Description | File Format |
---|---|---|
Unidirectional glossary | Specifies the expected translation for a pair of source and target terms in a specific language. | TSV and CSV, TMX |
Equivalent term sets glossary | Specifies the expected translation in multiple languages on each row. |
Unidirectional glossary (TSV and CSV file formats)
The Translation API accepts Tab-Separated Values (TSVs) and
Comma-Separated Values (CSVs). Each row contains a pair of terms separated by a
tab (\t
) or a comma (,
). In the following example, the first column shows
the source language term, and the second column shows the target language term:
When you create a glossary resource, you can define a header row. The glossary resource makes the glossary file available to the Translation API.
Unidirectional glossary (TMX file format)
The Translation API accepts the Translation Memory eXchange (TMX) format, which is a standard XML format for providing the source and the target term pairs that are translated.
The Translation API supports input files in a format based on TMX version 1.4. This example illustrates the required structure:
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
<header segtype="sentence" o-tmf="UTF-8" adminlang="en" srclang="en" datatype="PlainText"/>
<body>
<tu>
<tuv xml:lang="en">
<seg>account</seg>
</tuv>
<tuv xml:lang="es">
<seg>cuenta</seg>
</tuv>
</tu>
<tu>
<tuv xml:lang="en">
<seg>directions</seg>
</tuv>
<tuv xml:lang="es">
<seg>indicaciones</seg>
</tuv>
</tu>
</body>
</tmx>
If your file contains XML tags that aren't shown in this example, the Translation API ignores the XML tags.
To ensure that your TMX file is processed successfully by the Translation API, ensure that your TMX file contains the following elements:
The
<header>
element of a valid TMX file must identify the source language using thesrclang
attribute.All the
<tu>
elements must contain a pair of<tuv>
elements with the same source and target languages.- Each
<tuv>
element must identify the language of the contained text using thexml:lang
attribute, and ISO-639-1 codes are used to identify the source and target languages. - If a
<tu>
element contains more than two<tuv>
elements, the Translation API processes only the first<tuv>
matching the source language and the first matching the target language and ignores the rest. - If a
<tu>
element doesn't have a matching pair of<tuv>
elements, the Translation API skips over the invalid<tu>
element.
- Each
The Translation API strips the markup tags from a
<seg>
element before processing it. If a<tuv>
element contains more than one<seg>
element, the Translation API concatenates the text into a single element with a space between them.
After you have identified the glossary terms in your unidirectional glossary, make the glossary file available to the Translation API by creating and importing a glossary resource.
Equivalent term sets glossary
For equivalent term sets, the Translation API accepts glossary files using the CSV format. To define equivalent term sets, create a multi-column CSV file in which each row lists a single glossary term in multiple languages.
The header is the first row in the file, which identifies the language for each
column. The header row uses the ISO-639-1 or BCP-47 standard language codes. The
Translation API doesn't use part-of-speech (pos
)
information, and specific position values aren't validated.
Each subsequent row contains equivalent glossary terms in the languages identified in the header. You can leave columns blank if the term isn't available in all languages.
After you have identified the glossary terms in your equivalent term set, make the glossary file available to the Translation API by creating and importing a glossary resource.
Create an equivalent term sets glossary resource
To create an equivalent term sets glossary, make the following replacements before using any of the request data:
- PROJECT_ID: Your Distributed Cloud project ID.
- GLOSSARY_ID: Your glossary ID, which is your resource name.
- BUCKET_NAME: Name of bucket where your glossary file is located.
- GLOSSARY_FILENAME: Filename of your glossary.
This is the syntax for the HTTP request.
POST https://ENDPOINT/v3/projects/PROJECT_ID/glossaries
To send a request, this is an example of a JSON body.
{
"name":"projects/PROJECT_ID/glossaries/GLOSSARY_ID",
"language_codes_Set": {
"language_codes": ["en", "en-GB", "ru", "fr", "pt-BR", "pt-PT", "es"]
},
"input_config": {
"s3_source": {
"input_uri": "s3://BUCKET_NAME/FILE_BASENAME/GLOSSARY_FILENAME"
}
}
}
To send your request, choose one of these options:
curl
Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS
environment variable to the path for your service account private key file.
Save the request body in a file named request.json
, and run the following command:
curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"
You should receive a JSON response similar to the following:
{
"name": "projects/PROJECT_ID/operations/GLOSSARY_ID,
"metadata": {
"@type": "type.googleapis.com/google.cloud.translation.v3.CreateGlossaryMetadata",
"name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID",
"state": "RUNNING",
"submitTime": TIME
}
}
PowerShell
Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS
environment
variable to the path for your service account private key file.
Save the request body in a file named request.json
, and run the following
command:
$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"
| Select-Object -Expand Content
You should receive a JSON response similar to the following:
{
"name": "projects/PROJECT_ID/operations/GLOSSARY_ID",
"metadata": {
"@type": "type.googleapis.com/google.cloud.translation.v3.CreateGlossaryMetadata",
"name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID",
"state": "RUNNING",
"submitTime": TIME
}
}
Python
from google.cloud import translate_v3 as translate
def create_glossary(
project_id=PROJECT_ID,
input_uri=INPUT_URI,
glossary_id=GLOSSARY_ID,
timeout=180,
):
"""
Create an equivalent term sets glossary. A glossary can consist of a word or short phrases.
"""
client = translate.TranslationServiceClient()
# Supported language codes
source_lang_code = "en"
target_lang_code = "ja"#### Equivalent term sets {:#EQUIVTERMSETS}
Create a unidirectional glossary resource
To create a unidirectional glossary, specify a language pair (language_pair
)
with a source language (source_language_code
) and a target language
(target_language_code
). The following example uses the REST API and command
line, but you can also use the client
libraries to
create a unidirectional glossary.
When you create a new glossary, you must specify a project ID and a glossary ID.
projects/PROJECT_ID/glossaries/GLOSSARY_ID
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Distributed Cloud project ID.
- GLOSSARY_ID: Your glossary ID, which is your resource name.
- BUCKET_NAME: Name of bucket where your glossary file is located.
- GLOSSARY_FILENAME: Filename of your glossary.
The following is a sample HTTP request:
POST https://ENDPOINT/v3/projects/PROJECT_ID/glossaries
The following is a sample JSON body:
{
"name":"projects/PROJECT_ID/glossaries/GLOSSARY_ID,
"language_pair": {
"source_language_code": "en",
"target_language_code": "ru"
},
"{"input_config": {
"s3_source": {
"input_uri": "s3://BUCKET_NAME/FILE_BASENAME/GLOSSARY_FILENAME"
}
}
}
To send your request, do the following:
- Ensure the
GOOGLE_APPLICATION_CREDENTIALS
environment variable is set to the path of your service account's private key file. - Save the request body in a file named
request.json
. - Run one of the following commands:
curl
curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"
PowerShell
$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest
-Method POST
-Headers $headers
-ContentType: "application/json; charset=utf-8"
-InFile request.json
-Uri "https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"
| Select-Object -Expand Content
You should receive a JSON response similar to the following:
{
"name": "projects/PROJECT_ID/operations/operation-id",
"metadata": {
"@type": "type.googleapis.com/google.cloud.translation.v3.CreateGlossaryMetadata",
"name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID",
"state": "RUNNING",
"submitTime": TIME
}
}
Create a glossary
The CreateGlossary
method creates a glossary and returns the identifier to the
long-running operation that
generates the glossary. To call the CreateGlossary
method, specify your
project ID and the endpoint.
curl
curl -X POST http://ENDPOINT/v3/projects/PROJECT_ID/glossaries -d '{"parent": "projects/PROJECT_ID, "glossary" : {"input_config": {"s3_source": {"input_uri": "s3://BUCKET_PROPAGATED_NAME/GLOSSARY_FILE_NAME"}}, "language_codes_set": {"language_codes": "en", "language_codes": "es"}, "display_name": "glossary_display_name"}}'
Creating a glossary resource is a long-running operation. Depending on the file size, it typically takes less than 10 minutes to complete. Poll the status of this operation to see if it has completed.
Get a glossary
The GetGlossary
method returns a stored glossary. If the glossary doesn't
exist, the NOT_FOUND
value is returned. To call the GetGlossary
method,
specify your project ID and the glossary ID. Both the
CreateGlossary
and ListGlossaries
methods return the glossary ID.
curl
curl -X GET http://ENDPOINT/v3/projects/test-project/glossaries/GLOSSARY_ID
Python
Use the following Python code sample to get information about a specific glossary:
from google.cloud import translate_v3 as translate
def get_glossary(project_id="PROJECT_ID", glossary_id="GLOSSARY_ID"):
"""Get a particular glossary based on the glossary ID."""
client = translate.TranslationServiceClient()
name = client.glossary_path(project_id, glossary_id)
response = client.get_glossary(name=name)
print(u"Glossary name: {}".format(response.name))
print(u"Input URI: {}".format(response.input_config.s3_source.input_uri))
Delete a glossary
The DeleteGlossary
method deletes a glossary. If the glossary doesn't exist,
the NOT_FOUND
value is returned. To call the DeleteGlossary
method, specify
your project ID and the endpoint.
curl
curl -X POST http://ENDPOINT/v3/projects/PROJECT_ID/glossaries/GLOSSARY_ID -d '{"name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID"}'
List glossaries
The ListGlossaries
method returns a list of glossary IDs in a project. If the
project doesn't exist, the NOT_FOUND
value is returned. To call the
ListGlossaries
method, specify your project ID and the
endpoint.
curl
curl -X GET http://ENDPOINT/v3/projects/PROJECT_ID/glossaries?page_size=10
Glossary file and term limits
Description | Limit |
---|---|
Maximum file size | 10.4 million (10,485,760) UTF-8 bytes |
Maximum length of a glossary term | 1,024 UTF-8 bytes |
Maximum number of glossary resources for a project | 10,000 |