Define terms to translate

You can use a glossary, which is optional, with the Translation service to define terminology that is specific to your domain. With a glossary, you can add term pairs, which include a source language term and target language term. The term pairs ensure that your terminology is consistently translated.

Use a glossary in the following cases:

  • Product names. Don't translate words that refer to product names. For example, "Google Home" must translate to "Google Home."
  • Ambiguous words. Specify the meaning of vague words and homonyms. For example, "bat" can mean a piece of sports equipment or an animal.
  • Borrowed words. Clarify the meaning of words adopted from a different language. For example, "bouillabaisse" in French translates to "bouillabaisse" in English, a fish stew dish.

The terms in a glossary can be single tokens (words) or short phrases, usually fewer than five words. Translation ignores any matching glossary entries if the words are glossary stopwords.

The steps for using a glossary are the following:

  1. Meet the prerequisites of the Before you begin section
  2. Prepare your environment for a glossary.
  3. Create a glossary file.
  4. Save a copy of your glossary file.
  5. To make the glossary file available to the Translation API, choose one of the following:
  6. Specify which glossary to use when you request a translation. To get a list of the available glossaries, see List glossaries.

A summary of the glossary methods includes the following:

Method Description
CreateGlossary To create a glossary.
GetGlossary To return a stored glossary.
DeleteGlossary To delete a glossary that you no longer need.
ListGlossaries To poll the status of Translation operations.

Before you begin

Before using a glossary to translate a different language into English text, you must do the following:

  1. Create a project, which is stored in the platform namespace. The following sample YAML file creates a project:

      apiVersion: resourcemanager.gdc.goog/v1
      kind: Project
      metadata:
        labels:
          atat.config.google.com/clin-number: CLIN_NUMBER
          atat.config.google.com/task-order-number: TASK_ORDER_NUMBER
        name: translation-glossary-project
        namespace: platform
    
  2. To get the permissions you need to use a glossary, ask your Project IAM Admin to grant you the following roles in your project namespace:

    • AI Translation Developer: accesses the Vertex AI Translation service. Request the AI Translation Developer (ai-translation-developer) role.
    • Project Bucket Admin: manages the storage buckets and objects within buckets to create or upload the file to storage. Request the Project Bucket Admin (project-bucket-admin) role.

Prepare your environment

Follow these steps to prepare your environment to use a glossary:

  1. Create a storage bucket in your project, and select the Standard class.
  2. You must grant read permissions on the storage bucket to the service account ( g-vai-translation-sie-sa) used by the Translation service.
  3. Enter the following sample code to create the storage bucket, role, and role binding:

    1. Create the storage bucket.

        apiVersion: object.gdc.goog/v1
        kind: Bucket
        metadata:
          name: glossary-bucket
          namespace: translation-glossary-project
        spec:
          description: bucket for translation glossary
          storageClass: Standard
          bucketPolicy:
            lockingPolicy:
              defaultObjectRetentionDays: 90
      
    2. Create the role.

        apiVersion: rbac.authorization.k8s.io/v1
        kind: Role
        metadata:
          name: ai-translation-glossary-reader
          namespace: translation-glossary-project
        rules:
          -
            apiGroups:
              - object.gdc.goog
            resources:
              - buckets
            verbs:
              - read-object
      
    3. Create the role binding.

        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
        metadata:
          name: ai-translation-glossary-reader-rolebinding
          namespace: translation-glossary-project
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: Role
          name: ai-translation-glossary-reader
        subjects:
          -
            kind: ServiceAccount
            name: g-vai-translation-sie-sa
            namespace: g-vai-translation-sie
      

Save a copy of your glossary

Backing up the glossary isn't supported. Therefore, save a copy of the input file. If the glossary gets erased for any reason, upload your saved input file to the storage bucket, and re-run the CreateGlossary request to restore your glossary.

Create a glossary file

You must create a glossary file to store your source language and target language terms. There are two different glossary layouts that you can use to define your terms. The glossary layouts include the following:

Glossary layout Description File Format
Unidirectional glossary Specifies the expected translation for a pair of source and target terms in a specific language. TSV and CSV,
TMX
Equivalent term sets glossary Specifies the expected translation in multiple languages on each row.

Unidirectional glossary (TSV and CSV file formats)

The Translation API accepts Tab-Separated Values (TSVs) and Comma-Separated Values (CSVs). Each row contains a pair of terms separated by a tab (\t) or a comma (,). In the following example, the first column shows the source language term, and the second column shows the target language term:

Equivalent glossary terms example

When you create a glossary resource, you can define a header row. The glossary resource makes the glossary file available to the Translation API.

Unidirectional glossary (TMX file format)

The Translation API accepts the Translation Memory eXchange (TMX) format, which is a standard XML format for providing the source and the target term pairs that are translated.

The Translation API supports input files in a format based on TMX version 1.4. This example illustrates the required structure:

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
  <header segtype="sentence" o-tmf="UTF-8" adminlang="en" srclang="en" datatype="PlainText"/>
  <body>
    <tu>
      <tuv xml:lang="en">
        <seg>account</seg>
      </tuv>
      <tuv xml:lang="es">
        <seg>cuenta</seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg>directions</seg>
      </tuv>
      <tuv xml:lang="es">
        <seg>indicaciones</seg>
      </tuv>
    </tu>
  </body>
</tmx>

If your file contains XML tags that aren't shown in this example, the Translation API ignores the XML tags.

To ensure that your TMX file is processed successfully by the Translation API, ensure that your TMX file contains the following elements:

  • The <header> element of a valid TMX file must identify the source language using the srclang attribute.

  • All the <tu> elements must contain a pair of <tuv> elements with the same source and target languages.

    • Each <tuv> element must identify the language of the contained text using the xml:lang attribute, and ISO-639-1 codes are used to identify the source and target languages.
    • If a <tu> element contains more than two <tuv> elements, the Translation API processes only the first <tuv> matching the source language and the first matching the target language and ignores the rest.
    • If a <tu> element doesn't have a matching pair of <tuv> elements, the Translation API skips over the invalid <tu> element.
  • The Translation API strips the markup tags from a <seg> element before processing it. If a <tuv> element contains more than one <seg> element, the Translation API concatenates the text into a single element with a space between them.

After you have identified the glossary terms in your unidirectional glossary, make the glossary file available to the Translation API by creating and importing a glossary resource.

Equivalent term sets glossary

For equivalent term sets, the Translation API accepts glossary files using the CSV format. To define equivalent term sets, create a multi-column CSV file in which each row lists a single glossary term in multiple languages.

Equivalent glossary terms example

The header is the first row in the file, which identifies the language for each column. The header row uses the ISO-639-1 or BCP-47 standard language codes. The Translation API doesn't use part-of-speech (pos) information, and specific position values aren't validated.

Each subsequent row contains equivalent glossary terms in the languages identified in the header. You can leave columns blank if the term isn't available in all languages.

After you have identified the glossary terms in your equivalent term set, make the glossary file available to the Translation API by creating and importing a glossary resource.

Create an equivalent term sets glossary resource

To create an equivalent term sets glossary, make the following replacements before using any of the request data:

  • PROJECT_ID: Your Distributed Cloud project ID.
  • GLOSSARY_ID: Your glossary ID, which is your resource name.
  • BUCKET_NAME: Name of bucket where your glossary file is located.
  • GLOSSARY_FILENAME: Filename of your glossary.

This is the syntax for the HTTP request.

POST https://ENDPOINT/v3/projects/PROJECT_ID/glossaries

To send a request, this is an example of a JSON body.

{
  "name":"projects/PROJECT_ID/glossaries/GLOSSARY_ID",
  "language_codes_Set": {
    "language_codes": ["en", "en-GB", "ru", "fr", "pt-BR", "pt-PT", "es"]
  },
  "input_config": {
    "s3_source": {
      "input_uri": "s3://BUCKET_NAME/FILE_BASENAME/GLOSSARY_FILENAME"
    }
  }
}

To send your request, choose one of these options:

curl

Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Save the request body in a file named request.json, and run the following command:

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/operations/GLOSSARY_ID,
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.CreateGlossaryMetadata",
    "name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID",
    "state": "RUNNING",
    "submitTime": TIME
  }
}

PowerShell

Ensure you have set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path for your service account private key file.

Save the request body in a file named request.json, and run the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
  -Method POST `
  -Headers $headers `
  -ContentType: "application/json; charset=utf-8" `
  -InFile request.json `
  -Uri "https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"
  | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/operations/GLOSSARY_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.CreateGlossaryMetadata",
    "name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID",
    "state": "RUNNING",
    "submitTime": TIME
  }
}

Python

from google.cloud import translate_v3 as translate

def create_glossary(
    project_id=PROJECT_ID,
    input_uri=INPUT_URI,
    glossary_id=GLOSSARY_ID,
    timeout=180,
):
    """
    Create an equivalent term sets glossary. A glossary can consist of a word or short phrases.
    """
    client = translate.TranslationServiceClient()

    # Supported language codes
    source_lang_code = "en"
    target_lang_code = "ja"#### Equivalent term sets {:#EQUIVTERMSETS}

Create a unidirectional glossary resource

To create a unidirectional glossary, specify a language pair (language_pair) with a source language (source_language_code) and a target language (target_language_code). The following example uses the REST API and command line, but you can also use the client libraries to create a unidirectional glossary.

When you create a new glossary, you must specify a project ID and a glossary ID.

projects/PROJECT_ID/glossaries/GLOSSARY_ID

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Your Distributed Cloud project ID.
  • GLOSSARY_ID: Your glossary ID, which is your resource name.
  • BUCKET_NAME: Name of bucket where your glossary file is located.
  • GLOSSARY_FILENAME: Filename of your glossary.

The following is a sample HTTP request:

POST https://ENDPOINT/v3/projects/PROJECT_ID/glossaries

The following is a sample JSON body:

{
  "name":"projects/PROJECT_ID/glossaries/GLOSSARY_ID,
  "language_pair": {
    "source_language_code": "en",
    "target_language_code": "ru"
    },
  "{"input_config": {
    "s3_source": {
      "input_uri": "s3://BUCKET_NAME/FILE_BASENAME/GLOSSARY_FILENAME"
    }
  }
}

To send your request, do the following:

  1. Ensure the GOOGLE_APPLICATION_CREDENTIALS environment variable is set to the path of your service account's private key file.
  2. Save the request body in a file named request.json.
  3. Run one of the following commands:

curl

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"

PowerShell

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest
  -Method POST
  -Headers $headers
  -ContentType: "application/json; charset=utf-8"
  -InFile request.json
  -Uri "https://ENDPOINT/v3/projects/PROJECT_ID/glossaries"
  | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
    "name": "projects/PROJECT_ID/operations/operation-id",
    "metadata": {
        "@type": "type.googleapis.com/google.cloud.translation.v3.CreateGlossaryMetadata",
        "name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID",
        "state": "RUNNING",
        "submitTime": TIME
      }
}

Create a glossary

The CreateGlossary method creates a glossary and returns the identifier to the long-running operation that generates the glossary. To call the CreateGlossary method, specify your project ID and the endpoint.

curl

curl -X POST http://ENDPOINT/v3/projects/PROJECT_ID/glossaries -d '{"parent": "projects/PROJECT_ID, "glossary" : {"input_config": {"s3_source": {"input_uri": "s3://BUCKET_PROPAGATED_NAME/GLOSSARY_FILE_NAME"}}, "language_codes_set": {"language_codes": "en", "language_codes": "es"}, "display_name": "glossary_display_name"}}'

Creating a glossary resource is a long-running operation. Depending on the file size, it typically takes less than 10 minutes to complete. Poll the status of this operation to see if it has completed.

Get a glossary

The GetGlossary method returns a stored glossary. If the glossary doesn't exist, the NOT_FOUND value is returned. To call the GetGlossary method, specify your project ID and the glossary ID. Both the CreateGlossary and ListGlossaries methods return the glossary ID.

curl

curl -X GET http://ENDPOINT/v3/projects/test-project/glossaries/GLOSSARY_ID

Python

Use the following Python code sample to get information about a specific glossary:

from google.cloud import translate_v3 as translate

def get_glossary(project_id="PROJECT_ID", glossary_id="GLOSSARY_ID"):
    """Get a particular glossary based on the glossary ID."""

client = translate.TranslationServiceClient()

name = client.glossary_path(project_id, glossary_id)

response = client.get_glossary(name=name)
print(u"Glossary name: {}".format(response.name))
print(u"Input URI: {}".format(response.input_config.s3_source.input_uri))

Delete a glossary

The DeleteGlossary method deletes a glossary. If the glossary doesn't exist, the NOT_FOUND value is returned. To call the DeleteGlossary method, specify your project ID and the endpoint.

curl

curl -X POST http://ENDPOINT/v3/projects/PROJECT_ID/glossaries/GLOSSARY_ID -d '{"name": "projects/PROJECT_ID/glossaries/GLOSSARY_ID"}'

List glossaries

The ListGlossaries method returns a list of glossary IDs in a project. If the project doesn't exist, the NOT_FOUND value is returned. To call the ListGlossaries method, specify your project ID and the endpoint.

curl

curl -X GET http://ENDPOINT/v3/projects/PROJECT_ID/glossaries?page_size=10

Glossary file and term limits

Description Limit
Maximum file size 10.4 million (10,485,760) UTF-8 bytes
Maximum length of a glossary term 1,024 UTF-8 bytes
Maximum number of glossary resources for a project 10,000