Translate documents

Translation and OCR combine to provide the Document Vision Service (DVS) and document processing feature, which use the Translate Document API for directly translating formatted documents such as PDF files. Compared to plain text translations, the feature preserves the original formatting and layout in your translated documents, helping you retain much of the original context, like paragraph breaks. DVS supports document inline translations, from storage buckets, and in batch.

This document guides the Application Operator (AO) through the process of using the Vertex AI Translate Document pre-trained API on Google Distributed Cloud (GDC) air-gapped.

Supported formats

DVS supports the following input file types and their associated output file types.

Inputs Document MIME type Output
PDF application/pdf PDF, DOCX
DOC application/msword DOC, DOCX
DOCX application/vnd.openxmlformats-officedocument.wordprocessingml.document DOCX
PPT application/vnd.ms-powerpoint PPT, PPTX
PPTX application/vnd.openxmlformats-officedocument.presentationml.presentation PPTX
XLS application/vnd.ms-excel XLS, XLSX
XLSX application/vnd.openxmlformats-officedocument.spreadsheetml.sheet XLSX

Original and scanned PDF document translations

DVS supports both original and scanned PDF files, including translations to or from right-to-left languages. Also, DVS preserves hyperlinks, font size, and font color from files.

Before you begin

Follow these steps before trying DVS:

  1. Create the dvs-project project. For information about creating and using projects, see Create a project.

    Alternatively, you can create the project using a custom resource (CR):

    apiVersion: resourcemanager.gdc.goog/v1
    kind: Project
    metadata:
      labels:
        atat.config.google.com/clin-number: CLIN_NUMBER
        atat.config.google.com/task-order-number: TASK_ORDER_NUMBER
      name: dvs-project
      namespace: platform
    
  2. Ask your Project IAM Admin to grant you the AI Translation Developer (ai-translation-developer) role in the dvs-project project namespace. For more information, see Grant access to project resources.

  3. Enable both the Translation and OCR pre-trained APIs.

  4. Download the gdcloud command-line interface (CLI).

  5. Install Vertex AI client libraries. You must download the Vision and Translation client libraries according to your operating system.

Set up your service account

Set up your service account with a name, project ID, and service key.

  ${HOME}/gdcloud init  # set URI and project

  ${HOME}/gdcloud auth login

  ${HOME}/gdcloud iam service-accounts create SERVICE_ACCOUNT  --project=PROJECT_ID

  ${HOME}/gdcloud iam service-accounts keys create "SERVICE_KEY".json --project=PROJECT_ID --iam-account=SERVICE_ACCOUNT

Replace the following:

  • SERVICE_ACCOUNT: the name you want to give to your service account.
  • PROJECT_ID: your project ID number.
  • SERVICE_KEY: the name of the JSON file for the service key.

Grant access to project resources

Grant access to the Translation API service account by providing your project ID, name of your service account, and the role ai-translation-developer.

Ask your Project IAM Admin to grant you the AI Translation Developer (ai-translation-developer) role. To learn how to grant permissions to a subject, see Grant and revoke access.

  ${HOME}/gdcloud iam service-accounts add-iam-policy-binding --project=PROJECT_ID --iam-account=SERVICE_ACCOUNT --role=role/ai-translation-developer

Authenticate the request

You must get a token to authenticate the requests to the Translation pre-trained services. Follow these steps:

gdcloud CLI

Export the identity token for the specified account to an environment variable:

export TOKEN="$($HOME/gdcloud auth print-identity-token --audiences=https://ENDPOINT)"

Replace ENDPOINT with the Translation endpoint. For more information, view service statuses and endpoints.

Python

  1. Install the google-auth client library.

    pip install google-auth
    
  2. Save the following code to a Python script.

    import google.auth
    from google.auth.transport import requests
    
    api_endpoint = "https://ENDPOINT"
    
    creds, project_id = google.auth.default()
    creds = creds.with_gdch_audience(api_endpoint)
    
    def test_get_token():
      req = requests.Request()
      creds.refresh(req)
      print(creds.token)
    
    if __name__=="__main__":
      test_get_token()
    

    Replace ENDPOINT with the Translation endpoint. For more information, view service statuses and endpoints.

  3. Run the script to fetch the token.

For any curl request, you must replace TOKEN with the fetched token in the header as in the following example:

-H "Authorization: Bearer TOKEN"

Translate documents online

DVS in Distributed Cloud provides the following two types of document translations:

Translate a document from a storage bucket

To translate a document that is stored in a bucket, follow these steps:

Prepare your environment

Before using the Translation API, follow these steps:

  1. Create a storage bucket in the dvs-project project, using the Standard class.
  2. Grant read and write permissions on the bucket to the Vertex AI Translation system service account (g-vai-translation-sie-sa) used by the Translation service.

Alternatively, you can follow these steps to create the storage bucket, role, and role binding using custom resources (CR):

  1. Create the storage bucket by deploying a Bucket CR in the dvs-project namespace:

    apiVersion: object.gdc.goog/v1
    kind: Bucket
    metadata:
      name: dvs-bucket
      namespace: dvs-project
    spec:
      description: bucket for document vision service
      storageClass: Standard
      bucketPolicy:
        lockingPolicy:
          defaultObjectRetentionDays: 90
    
  2. Create the role by deploying a Role CR in the dvs-project namespace:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: dvs-reader-writer
      namespace: dvs-project
    rules:
      -
        apiGroups:
          - object.gdc.goog
        resources:
          - buckets
        verbs:
          - read-object
          - write-object
    
  3. Create the role binding by deploying a RoleBinding CR in the dvs-project namespace:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: dvs-reader-writer-rolebinding
      namespace: dvs-project
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: dvs-reader-writer
    subjects:
      -
        kind: ServiceAccount
        name: g-vai-translation-sie-sa
        namespace: g-vai-translation-sie
    

Upload files to the storage bucket

You must upload your documents to the storage bucket to let the Translation service process the files.

To upload files to the storage bucket, follow these steps:

  1. Configure the gdcloud CLI storage by following the instructions from Configure the gdcloud CLI for object storage.
  2. Upload your document to the storage bucket you created. For more information about how to upload objects to storage buckets, see Upload and download storage objects in projects.

The following example translates a file from a bucket and outputs the result to another bucket path. The response also returns a byte stream. You can specify the MIME type; if you don't, DVS determines it by using the input file's extension.

DVS supports language auto-detection for documents stored in buckets. If you don't specify a source language code, DVS detects the language for you. The detected language is included in the output in the detectedLanguageCode field.

HTTP

The following example uses the curl tool to make an HTTP call with an input PDF document in a storage bucket.

  1. Save the following request.json file:

    cat <<- EOF > request.json
    {
      "parent": "projects/PROJECT_ID",
      "source_language_code": "SOURCE_LANGUAGE",
      "target_language_code": "TARGET_LANGUAGE",
      "document_input_config": {
        "mime_type": "application/pdf",
        "s3_source": {
          "input_uri": "s3://INPUT_FILE_PATH"
        }
      },
      "document_output_config": {
        "mime_type": "application/pdf"
      },
      "enable_rotation_correction": "true"
    }
    EOF
    

    Replace the following:

    • PROJECT_ID: The ID of the project that you want to use.
    • SOURCE_LANGUAGE: the language in which your document is written. For a list of supported languages, see Get supported languages.
    • TARGET_LANGUAGE: the language or languages into which you want to translate your document. For a list of supported languages, see Get supported languages.
    • INPUT_FILE_PATH: the path of your document file in the storage bucket.
  2. Use the curl tool to call the endpoint and take the request from the request.json file:

    curl -vv --cacert CACERT --data-binary @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" https://ENDPOINT.GDC_URL:443/v3/projects/PROJECT_ID:translateDocument < request.json
    

    Replace the following:

    • CACERT: the path to find the CA certificate.
    • TOKEN: the token you obtained when you authenticated the gdcloud CLI.
    • ENDPOINT: the Translation endpoint that you use for your organization.
    • GDC_URL: the URL of your organization in Distributed Cloud, for example, org-1.zone1.gdch.test.
    • PROJECT_ID: The ID of the project that you want to use.

You obtain the output following the command.

Translate a document inline

The following example sends a document inline as part of the request. You must include the MIME type for inline document translations.

DVS supports language auto-detection for inline text translations. If you don't specify a source language code, DVS detects the language for you. The detected language is included in the output in the detectedLanguageCode field.

HTTP

The following example uses the curl tool to make an HTTP call with an inline PDF document.

echo '{"parent": "projects/PROJECT_ID","source_language_code": "SOURCE_LANGUAGE", "target_language_code": "TARGET_LANGUAGE", "document_input_config": { "mime_type": "application/pdf", "content": "'$(base64 -w 0 INPUT_FILE_PATH)'" }, "document_output_config": { "mime_type": "application/pdf" }, "enable_rotation_correction": "true"}' | curl --cacert CACERT --data-binary @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" https://ENDPOINT.GDC_URL/v3/projects/PROJECT_ID:translateDocument

Replace the following:

  • PROJECT_ID: The ID of the project that you want to use.
  • SOURCE_LANGUAGE: the language in which your document is written. For a list of supported languages, see Get supported languages.
  • TARGET_LANGUAGE: the language or languages into which you want to translate your document. For a list of supported languages, see Get supported languages.
  • INPUT_FILE_PATH: the path of your document file locally.
  • ENDPOINT: the Translation endpoint that you use for your organization.
  • GDC_URL: the URL of your organization in Distributed Cloud, for example, org-1.zone1.gdch.test.
  • TOKEN: the token you obtained when you authenticated the gdcloud CLI.

You obtain the output following the command.

Translate documents in batch

Batch translation lets you translate multiple files into multiple languages in a single request. For each request, you can send up to 100 files with a total content size of up to 1 GB or 100 million Unicode codepoints, whichever limit is hit first. You can specify a particular translation model for each language.

For more information, see batchTranslateDocument.

Translate multiple documents

The following example includes multiple input configurations. Each input configuration is a pointer to a file in a storage bucket.

REST

Before using any of the request data, make the following replacements:

  • ENDPOINT: the Translation endpoint that you use for your organization. For more information, view service status and endpoints.
  • PROJECT_ID: the numeric or alphanumeric ID of your project.
  • SOURCE_LANGUAGE: The language code of the input documents. Set to one of the language codes listed in Supported languages.
  • TARGET_LANGUAGE: The target language or languages to translate the input documents to. Use the language codes listed in Supported languages.
  • INPUT_FILE_PATH: The storage bucket location and filename of one or more input documents.
  • OUTPUT_FILE_PREFIX: The storage bucket location where all output documents are stored.

HTTP method and URL:

https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument

Request JSON body:

{
  "source_language_code": "SOURCE_LANGUAGE",
  "target_language_codes": ["TARGET_LANGUAGE", ...],
  "input_configs": [
    {
      "s3_source": {
        "input_uri": "s3://INPUT_FILE_PATH_1"
      }
    },
    {
      "s3_source": {
        "input_uri": "s3://INPUT_FILE_PATH_2"
      }
    },
    ...
  ],
  "output_config": {
    "s3_destination": {
      "output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
    }
  }
}

To send your request, save the request body in a file named request.json, and run the following command:

curl -X POST \
   -H "Authorization: Bearer TOKEN" \
   -H "Content-Type: application/json; charset=utf-8" \
   -d @request.json \
   "https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"

The response contains the ID for a long-running operation.

{
  "name": "projects/PROJECT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
    "state": "RUNNING"
  }
}

Translate and convert an original PDF file

The following example translates and converts an original PDF file to a DOCX file. You can specify multiple inputs of various file types; they don't all have to be original PDF files. However, scanned PDF files cannot be included when including a conversion; the request is rejected and no translations are done. Only original PDF files are translated and converted to DOCX files. For example, if you include PPTX files, they are translated and returned as PPTX files.

If you regularly translate a mix of scanned and original PDF files, we recommend that you organize them into separate buckets. That way, when you request a batch translation and conversion, you can exclude the bucket that contains scanned PDF files instead of having to exclude individual files.

REST

Before using any of the request data, make the following replacements:

  • ENDPOINT: the Translation endpoint that you use for your organization. For more information, view service status and endpoints.
  • PROJECT_ID: the numeric or alphanumeric ID of your project.
  • SOURCE_LANGUAGE: The language code of the input documents. Set to one of the language codes listed in Supported languages.
  • TARGET_LANGUAGE: The target language or languages to translate the input documents to. Use the language codes listed in Supported languages.
  • INPUT_FILE_PATH: The storage bucket location and filename of one or more input documents.
  • OUTPUT_FILE_PREFIX: The storage bucket location where all output documents are stored.

HTTP method and URL:

https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument

Request JSON body:

{
  "source_language_code": "SOURCE_LANGUAGE",
  "target_language_codes": ["TARGET_LANGUAGE", ...],
  "input_configs": [
    {
      "s3_source": {
        "input_uri": "s3://INPUT_FILE_PATH_1"
      }
    },
    {
      "s3_source": {
        "input_uri": "s3://INPUT_FILE_PATH_2"
      }
    },
    ...
  ],
  "output_config": {
    "s3_destination": {
      "output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
    }
  },
  "format_conversions": {
    "application/pdf": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
  }
}

To send your request, save the request body in a file named request.json, and run the following command:

curl -X POST \
  -H "Authorization: Bearer TOKEN" \
  -H "Content-Type: application/json; charset=utf-8" \
  -d @request.json \
  "https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"

The response contains the ID for a long-running operation.

{
  "name": "projects/PROJECT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
    "state": "RUNNING"
  }
}

Use a glossary

You can include a glossary to handle domain-specific terminology. If you specify a glossary, you must specify the source language. The following example uses a glossary. You can specify up to 10 target languages with their own glossary.

If you specify a glossary for some target languages, the system doesn't use any glossary for the unspecified languages.

REST

Before using any of the request data, make the following replacements:

  • ENDPOINT: the Translation endpoint that you use for your organization. For more information, view service status and endpoints.
  • PROJECT_ID: the numeric or alphanumeric ID of your project.
  • SOURCE_LANGUAGE: The language code of the input documents. Set to one of the language codes listed in Supported languages.
  • TARGET_LANGUAGE: The target language or languages to translate the input documents to. Use the language codes listed in Supported languages.
  • INPUT_FILE_PATH: The storage bucket location and filename of one or more input documents.
  • OUTPUT_FILE_PREFIX: The storage bucket location where all output documents are stored.
  • GLOSSARY_PROJECT_ID: The project ID where the glossary is located.

HTTP method and URL:

https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument

Request JSON body:

{
  "source_language_code": "SOURCE_LANGUAGE",
  "target_language_codes": "[TARGET_LANGUAGE`, ...]",
  "input_configs": [
    {
      "s3_source": {
        "input_uri": "s3://INPUT_FILE_PATH"
      }
    }
  ],
  "output_config": {
    "s3_destination": {
      "output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
    }
  },
  "glossaries": {
    "TARGET_LANGUAGE": {
      "glossary": "projects/GLOSSARY_PROJECT_ID"
    },
    ...
  }
}

To send your request, save the request body in a file named request.json, and run the following command:

curl -X POST \
   -H "Authorization: Bearer TOKEN" \
   -H "Content-Type: application/json; charset=utf-8" \
   -d @request.json \
   "https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"

The response contains the ID for a long-running operation.

{
  "name": "projects/PROJECT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
    "state": "RUNNING"
  }
}