You can transfer documents from the Document AI Warehouse to the Document AI Workbench using the export-to-Workbench pipeline. The pipeline exports the documents to a Cloud Storage folder, then imports them to a Document AI dataset. You provide the Cloud Storage folder and the Document AI dataset.
Prerequisites
Before you begin, you need the following:
- Under the same Google Cloud project, follow the steps to create a processor .
Dedicate an empty Cloud Storage folder for storing exported documents.
On the custom processor page, click Configure Your Dataset and then Continue to initialize the dataset.
Run the pipeline
REST
curl --location --request POST 'https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION:runPipeline' \ --header 'Content-Type: application/json' \ --header "Authorization: Bearer ${AUTH_TOKEN}" \ --data '{ "name": "projects/PROJECT_NUMBER/locations/LOCATION", "export_cdw_pipeline": { "documents": [ "projects/PROJECT_NUMBER/locations/LOCATION/documents/DOCUMENT", ], "export_folder_path": "gs://CLOUD STORAGE FOLDER", "doc_ai_dataset": "projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR/dataset", "training_split_ratio": RATIO, }, "request_metadata": { "user_info": { "id": "user:USER EMAIL ADDRESS", } }
}'
The training and test split ratio can be specified in the training_split_ratio
field as a floating-point number. For example, for a set of 10 documents, if the ratio is specified as 0.8
, 8 documents will be added to the training set and the remaining 2 documents to the test set.
This command returns a resource name for a long-running operation. Use it to track the progress of the pipeline in the next step.
Get long-running operation result
REST
curl --location --request GET 'https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION' \
--header "Authorization: Bearer ${AUTH_TOKEN}"
Next step
- Go to your Document AI to check exported documents.