Translation and OCR combine to provide the Document Vision Service (DVS) and document processing feature, which use the Translate Document API for directly translating formatted documents such as PDF files. Compared to plain text translations, the feature preserves the original formatting and layout in your translated documents, helping you retain much of the original context, like paragraph breaks. DVS supports document inline translations, from storage buckets, and in batch.
This document guides the Application Operator (AO) through the process of using the Vertex AI Translate Document pre-trained API on Google Distributed Cloud (GDC) air-gapped.
Supported formats
DVS supports the following input file types and their associated output file types.
Inputs | Document MIME type | Output |
---|---|---|
application/pdf |
PDF, DOCX | |
DOC | application/msword |
DOC, DOCX |
DOCX | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
DOCX |
PPT | application/vnd.ms-powerpoint |
PPT, PPTX |
PPTX | application/vnd.openxmlformats-officedocument.presentationml.presentation |
PPTX |
XLS | application/vnd.ms-excel |
XLS, XLSX |
XLSX | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
XLSX |
Original and scanned PDF document translations
DVS supports both original and scanned PDF files, including translations to or from right-to-left languages. Also, DVS preserves hyperlinks, font size, and font color from files.
Before you begin
Follow these steps before trying DVS:
Create the
dvs-project
project. For information about creating and using projects, see Create a project.Alternatively, you can create the project using a custom resource (CR):
apiVersion: resourcemanager.gdc.goog/v1 kind: Project metadata: labels: atat.config.google.com/clin-number: CLIN_NUMBER atat.config.google.com/task-order-number: TASK_ORDER_NUMBER name: dvs-project namespace: platform
Ask your Project IAM Admin to grant you the AI Translation Developer (
ai-translation-developer
) role in thedvs-project
project namespace. For more information, see Grant access to project resources.Download the gdcloud command-line interface (CLI).
Install Vertex AI client libraries. You must download the Vision and Translation client libraries according to your operating system.
Set up your service account
Set up your service account with a name, project ID, and service key.
${HOME}/gdcloud init # set URI and project
${HOME}/gdcloud auth login
${HOME}/gdcloud iam service-accounts create SERVICE_ACCOUNT --project=PROJECT_ID
${HOME}/gdcloud iam service-accounts keys create "SERVICE_KEY".json --project=PROJECT_ID --iam-account=SERVICE_ACCOUNT
Replace the following:
SERVICE_ACCOUNT
: the name you want to give to your service account.PROJECT_ID
: your project ID number.SERVICE_KEY
: the name of the JSON file for the service key.
Grant access to project resources
Grant access to the Translation API service account by providing
your project ID, name of your service account, and the role
ai-translation-developer
.
Ask your Project IAM Admin to grant you the AI Translation Developer (ai-translation-developer
) role. To learn how to grant permissions to a subject, see Grant and revoke access.
${HOME}/gdcloud iam service-accounts add-iam-policy-binding --project=PROJECT_ID --iam-account=SERVICE_ACCOUNT --role=role/ai-translation-developer
Authenticate the request
You must get a token to authenticate the requests to the Translation pre-trained services. Follow these steps:
gdcloud CLI
Export the identity token for the specified account to an environment variable:
export TOKEN="$($HOME/gdcloud auth print-identity-token --audiences=https://ENDPOINT)"
Replace ENDPOINT
with the Translation endpoint. For more information, view service statuses and endpoints.
Python
Install the
google-auth
client library.pip install google-auth
Save the following code to a Python script.
import google.auth from google.auth.transport import requests api_endpoint = "https://ENDPOINT" creds, project_id = google.auth.default() creds = creds.with_gdch_audience(api_endpoint) def test_get_token(): req = requests.Request() creds.refresh(req) print(creds.token) if __name__=="__main__": test_get_token()
Replace
ENDPOINT
with the Translation endpoint. For more information, view service statuses and endpoints.Run the script to fetch the token.
For any curl
request, you must replace TOKEN
with the fetched token in the header as in the following example:
-H "Authorization: Bearer TOKEN"
Translate documents online
DVS in Distributed Cloud provides the following two types of document translations:
Translate a document from a storage bucket
To translate a document that is stored in a bucket, follow these steps:
Prepare your environment
Before using the Translation API, follow these steps:
- Create a storage bucket in the
dvs-project
project, using theStandard
class. - Grant
read
andwrite
permissions on the bucket to the Vertex AI Translation system service account (g-vai-translation-sie-sa
) used by the Translation service.
Alternatively, you can follow these steps to create the storage bucket, role, and role binding using custom resources (CR):
Create the storage bucket by deploying a
Bucket
CR in thedvs-project
namespace:apiVersion: object.gdc.goog/v1 kind: Bucket metadata: name: dvs-bucket namespace: dvs-project spec: description: bucket for document vision service storageClass: Standard bucketPolicy: lockingPolicy: defaultObjectRetentionDays: 90
Create the role by deploying a
Role
CR in thedvs-project
namespace:apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: dvs-reader-writer namespace: dvs-project rules: - apiGroups: - object.gdc.goog resources: - buckets verbs: - read-object - write-object
Create the role binding by deploying a
RoleBinding
CR in thedvs-project
namespace:apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: dvs-reader-writer-rolebinding namespace: dvs-project roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: dvs-reader-writer subjects: - kind: ServiceAccount name: g-vai-translation-sie-sa namespace: g-vai-translation-sie
Upload files to the storage bucket
You must upload your documents to the storage bucket to let the Translation service process the files.
To upload files to the storage bucket, follow these steps:
- Configure the gdcloud CLI storage by following the instructions from Configure the gdcloud CLI for object storage.
- Upload your document to the storage bucket you created. For more information about how to upload objects to storage buckets, see Upload and download storage objects in projects.
The following example translates a file from a bucket and outputs the result to another bucket path. The response also returns a byte stream. You can specify the MIME type; if you don't, DVS determines it by using the input file's extension.
DVS supports language auto-detection for documents stored in buckets. If you
don't specify a source language code, DVS detects the language for you. The
detected language is included in the output in the detectedLanguageCode
field.
HTTP
The following example uses the curl
tool to make an HTTP call with an input
PDF document in a storage bucket.
Save the following
request.json
file:cat <<- EOF > request.json { "parent": "projects/PROJECT_ID", "source_language_code": "SOURCE_LANGUAGE", "target_language_code": "TARGET_LANGUAGE", "document_input_config": { "mime_type": "application/pdf", "s3_source": { "input_uri": "s3://INPUT_FILE_PATH" } }, "document_output_config": { "mime_type": "application/pdf" }, "enable_rotation_correction": "true" } EOF
Replace the following:
PROJECT_ID
: The ID of the project that you want to use.SOURCE_LANGUAGE
: the language in which your document is written. For a list of supported languages, see Get supported languages.TARGET_LANGUAGE
: the language or languages into which you want to translate your document. For a list of supported languages, see Get supported languages.INPUT_FILE_PATH
: the path of your document file in the storage bucket.
Use the
curl
tool to call the endpoint and take the request from therequest.json
file:curl -vv --cacert CACERT --data-binary @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" https://ENDPOINT.GDC_URL:443/v3/projects/PROJECT_ID:translateDocument < request.json
Replace the following:
CACERT
: the path to find the CA certificate.TOKEN
: the token you obtained when you authenticated the gdcloud CLI.ENDPOINT
: the Translation endpoint that you use for your organization.GDC_URL
: the URL of your organization in Distributed Cloud, for example,org-1.zone1.gdch.test
.PROJECT_ID
: The ID of the project that you want to use.
You obtain the output following the command.
Translate a document inline
The following example sends a document inline as part of the request. You must include the MIME type for inline document translations.
DVS supports language auto-detection for inline text translations. If you don't
specify a source language code, DVS detects the language for you. The detected
language is included in the output in the detectedLanguageCode
field.
HTTP
The following example uses the curl
tool to make an HTTP call with an inline PDF document.
echo '{"parent": "projects/PROJECT_ID","source_language_code": "SOURCE_LANGUAGE", "target_language_code": "TARGET_LANGUAGE", "document_input_config": { "mime_type": "application/pdf", "content": "'$(base64 -w 0 INPUT_FILE_PATH)'" }, "document_output_config": { "mime_type": "application/pdf" }, "enable_rotation_correction": "true"}' | curl --cacert CACERT --data-binary @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" https://ENDPOINT.GDC_URL/v3/projects/PROJECT_ID:translateDocument
Replace the following:
PROJECT_ID
: The ID of the project that you want to use.SOURCE_LANGUAGE
: the language in which your document is written. For a list of supported languages, see Get supported languages.TARGET_LANGUAGE
: the language or languages into which you want to translate your document. For a list of supported languages, see Get supported languages.INPUT_FILE_PATH
: the path of your document file locally.ENDPOINT
: the Translation endpoint that you use for your organization.GDC_URL
: the URL of your organization in Distributed Cloud, for example,org-1.zone1.gdch.test
.TOKEN
: the token you obtained when you authenticated the gdcloud CLI.
You obtain the output following the command.
Translate documents in batch
Batch translation lets you translate multiple files into multiple languages in a single request. For each request, you can send up to 100 files with a total content size of up to 1 GB or 100 million Unicode codepoints, whichever limit is hit first. You can specify a particular translation model for each language.
For more information, see batchTranslateDocument
.
Translate multiple documents
The following example includes multiple input configurations. Each input configuration is a pointer to a file in a storage bucket.
REST
Before using any of the request data, make the following replacements:
ENDPOINT
: the Translation endpoint that you use for your organization. For more information, view service status and endpoints.PROJECT_ID
: the numeric or alphanumeric ID of your project.SOURCE_LANGUAGE
: The language code of the input documents. Set to one of the language codes listed in Supported languages.TARGET_LANGUAGE
: The target language or languages to translate the input documents to. Use the language codes listed in Supported languages.INPUT_FILE_PATH
: The storage bucket location and filename of one or more input documents.OUTPUT_FILE_PREFIX
: The storage bucket location where all output documents are stored.
HTTP method and URL:
https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument
Request JSON body:
{
"source_language_code": "SOURCE_LANGUAGE",
"target_language_codes": ["TARGET_LANGUAGE", ...],
"input_configs": [
{
"s3_source": {
"input_uri": "s3://INPUT_FILE_PATH_1"
}
},
{
"s3_source": {
"input_uri": "s3://INPUT_FILE_PATH_2"
}
},
...
],
"output_config": {
"s3_destination": {
"output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
}
}
}
To send your request, save the request body in a file named request.json
, and run the following command:
curl -X POST \
-H "Authorization: Bearer TOKEN" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"
The response contains the ID for a long-running operation.
{
"name": "projects/PROJECT_ID/operations/OPERATION_ID",
"metadata": {
"@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
"state": "RUNNING"
}
}
Translate and convert an original PDF file
The following example translates and converts an original PDF file to a DOCX file. You can specify multiple inputs of various file types; they don't all have to be original PDF files. However, scanned PDF files cannot be included when including a conversion; the request is rejected and no translations are done. Only original PDF files are translated and converted to DOCX files. For example, if you include PPTX files, they are translated and returned as PPTX files.
If you regularly translate a mix of scanned and original PDF files, we recommend that you organize them into separate buckets. That way, when you request a batch translation and conversion, you can exclude the bucket that contains scanned PDF files instead of having to exclude individual files.
REST
Before using any of the request data, make the following replacements:
ENDPOINT
: the Translation endpoint that you use for your organization. For more information, view service status and endpoints.PROJECT_ID
: the numeric or alphanumeric ID of your project.SOURCE_LANGUAGE
: The language code of the input documents. Set to one of the language codes listed in Supported languages.TARGET_LANGUAGE
: The target language or languages to translate the input documents to. Use the language codes listed in Supported languages.INPUT_FILE_PATH
: The storage bucket location and filename of one or more input documents.OUTPUT_FILE_PREFIX
: The storage bucket location where all output documents are stored.
HTTP method and URL:
https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument
Request JSON body:
{
"source_language_code": "SOURCE_LANGUAGE",
"target_language_codes": ["TARGET_LANGUAGE", ...],
"input_configs": [
{
"s3_source": {
"input_uri": "s3://INPUT_FILE_PATH_1"
}
},
{
"s3_source": {
"input_uri": "s3://INPUT_FILE_PATH_2"
}
},
...
],
"output_config": {
"s3_destination": {
"output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
}
},
"format_conversions": {
"application/pdf": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
}
To send your request, save the request body in a file named request.json
, and run the following command:
curl -X POST \
-H "Authorization: Bearer TOKEN" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"
The response contains the ID for a long-running operation.
{
"name": "projects/PROJECT_ID/operations/OPERATION_ID",
"metadata": {
"@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
"state": "RUNNING"
}
}
Use a glossary
You can include a glossary to handle domain-specific terminology. If you specify a glossary, you must specify the source language. The following example uses a glossary. You can specify up to 10 target languages with their own glossary.
If you specify a glossary for some target languages, the system doesn't use any glossary for the unspecified languages.
REST
Before using any of the request data, make the following replacements:
ENDPOINT
: the Translation endpoint that you use for your organization. For more information, view service status and endpoints.PROJECT_ID
: the numeric or alphanumeric ID of your project.SOURCE_LANGUAGE
: The language code of the input documents. Set to one of the language codes listed in Supported languages.TARGET_LANGUAGE
: The target language or languages to translate the input documents to. Use the language codes listed in Supported languages.INPUT_FILE_PATH
: The storage bucket location and filename of one or more input documents.OUTPUT_FILE_PREFIX
: The storage bucket location where all output documents are stored.GLOSSARY_PROJECT_ID
: The project ID where the glossary is located.
HTTP method and URL:
https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument
Request JSON body:
{
"source_language_code": "SOURCE_LANGUAGE",
"target_language_codes": "[TARGET_LANGUAGE`, ...]",
"input_configs": [
{
"s3_source": {
"input_uri": "s3://INPUT_FILE_PATH"
}
}
],
"output_config": {
"s3_destination": {
"output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
}
},
"glossaries": {
"TARGET_LANGUAGE": {
"glossary": "projects/GLOSSARY_PROJECT_ID"
},
...
}
}
To send your request, save the request body in a file named request.json
, and run the following command:
curl -X POST \
-H "Authorization: Bearer TOKEN" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"
The response contains the ID for a long-running operation.
{
"name": "projects/PROJECT_ID/operations/OPERATION_ID",
"metadata": {
"@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
"state": "RUNNING"
}
}