Run a predefined pipeline.
HTTP request
POST https://contentwarehouse.googleapis.com/v1/{name}:runPipeline
Path parameters
Parameters | |
---|---|
name |
Required. The resource name which owns the resources of the pipeline. Format: projects/{projectNumber}/locations/{location}. It takes the form |
Request body
The request body contains data with the following structure:
JSON representation |
---|
{ "requestMetadata": { object ( |
Fields | |
---|---|
requestMetadata |
The meta information collected about the end user, used to enforce access control for the service. |
Union field pipeline . The predefined pipelines. pipeline can be only one of the following: |
|
gcsIngestPipeline |
Cloud Storage ingestion pipeline. |
gcsIngestWithDocAiProcessorsPipeline |
Use DocAI processors to process documents in Cloud Storage and ingest them to Document Warehouse. |
exportCdwPipeline |
Export docuemnts from Document Warehouse to CDW for training purpose. |
processWithDocAiPipeline |
Use a DocAI processor to process documents in Document Warehouse, and re-ingest the updated results into Document Warehouse. |
Response body
If successful, the response body contains an instance of Operation
.
Authorization scopes
Requires the following OAuth scope:
https://www.googleapis.com/auth/cloud-platform
For more information, see the Authentication Overview.
IAM Permissions
Requires the following IAM permission on the name
resource:
contentwarehouse.documents.create
For more information, see the IAM documentation.
GcsIngestPipeline
The configuration of the Cloud Storage Ingestion pipeline.
JSON representation |
---|
{
"inputPath": string,
"schemaName": string,
"processorType": string,
"skipIngestedDocuments": boolean,
"pipelineConfig": {
object ( |
Fields | |
---|---|
inputPath |
The input Cloud Storage folder. All files under this folder will be imported to Document Warehouse. Format: |
schemaName |
The Document Warehouse schema resource name. All documents processed by this pipeline will use this schema. Format: projects/{projectNumber}/locations/{location}/documentSchemas/{document_schema_id}. |
processorType |
The Doc AI processor type name. Only used when the format of ingested files is Doc AI Document proto format. |
skipIngestedDocuments |
The flag whether to skip ingested documents. If it is set to true, documents in Cloud Storage contains key "status" with value "status=ingested" in custom metadata will be skipped to ingest. |
pipelineConfig |
Optional. The config for the Cloud Storage Ingestion pipeline. It provides additional customization options to run the pipeline and can be skipped if it is not applicable. |
IngestPipelineConfig
The ingestion pipeline config.
JSON representation |
---|
{
"documentAclPolicy": {
object ( |
Fields | |
---|---|
documentAclPolicy |
The document level acl policy config. This refers to an Identity and Access (IAM) policy, which specifies access controls for all documents ingested by the pipeline. The The following roles are supported for document level acl control: * roles/contentwarehouse.documentAdmin * roles/contentwarehouse.documentEditor * roles/contentwarehouse.documentViewer The following members are supported for document level acl control: * user:user-email@example.com * group:group-email@example.com note that for documents searched with LLM, only single level user or group acl check is supported. |
enableDocumentTextExtraction |
The document text extraction enabled flag. If the flag is set to true, DWH will perform text extraction on the raw document. |
folder |
Optional. The name of the folder to which all ingested documents will be linked during ingestion process. Format is |
cloudFunction |
The Cloud Function resource name. The Cloud Function needs to live inside consumer project and is accessible to Document AI Warehouse P4SA. Only Cloud Functions V2 is supported. Cloud function execution should complete within 5 minutes or this file ingestion may fail due to timeout. Format: |
GcsIngestWithDocAiProcessorsPipeline
The configuration of the Cloud Storage Ingestion with DocAI Processors pipeline.
JSON representation |
---|
{ "inputPath": string, "splitClassifyProcessorInfo": { object ( |
Fields | |
---|---|
inputPath |
The input Cloud Storage folder. All files under this folder will be imported to Document Warehouse. Format: |
splitClassifyProcessorInfo |
The split and classify processor information. The split and classify result will be used to find a matched extract processor. |
extractProcessorInfos[] |
The extract processors information. One matched extract processor will be used to process documents based on the classify processor result. If no classify processor is specified, the first extract processor will be used. |
processorResultsFolderPath |
The Cloud Storage folder path used to store the raw results from processors. Format: |
skipIngestedDocuments |
The flag whether to skip ingested documents. If it is set to true, documents in Cloud Storage contains key "status" with value "status=ingested" in custom metadata will be skipped to ingest. |
pipelineConfig |
Optional. The config for the Cloud Storage Ingestion with DocAI Processors pipeline. It provides additional customization options to run the pipeline and can be skipped if it is not applicable. |
ProcessorInfo
The DocAI processor information.
JSON representation |
---|
{ "processorName": string, "documentType": string, "schemaName": string } |
Fields | |
---|---|
processorName |
The processor resource name. Format is |
documentType |
The processor will process the documents with this document type. |
schemaName |
The Document schema resource name. All documents processed by this processor will use this schema. Format: projects/{projectNumber}/locations/{location}/documentSchemas/{document_schema_id}. |
ExportToCdwPipeline
The configuration of exporting documents from the Document Warehouse to CDW pipeline.
JSON representation |
---|
{ "documents": [ string ], "exportFolderPath": string, "docAiDataset": string, "trainingSplitRatio": number } |
Fields | |
---|---|
documents[] |
The list of all the resource names of the documents to be processed. Format: projects/{projectNumber}/locations/{location}/documents/{documentId}. |
exportFolderPath |
The Cloud Storage folder path used to store the exported documents before being sent to CDW. Format: |
docAiDataset |
Optional. The CDW dataset resource name. This field is optional. If not set, the documents will be exported to Cloud Storage only. Format: projects/{project}/locations/{location}/processors/{processor}/dataset |
trainingSplitRatio |
Ratio of training dataset split. When importing into Document AI Workbench, documents will be automatically split into training and test split category with the specified ratio. This field is required if docAiDataset is set. |
ProcessWithDocAiPipeline
The configuration of processing documents in Document Warehouse with DocAi processors pipeline.
JSON representation |
---|
{
"documents": [
string
],
"exportFolderPath": string,
"processorInfo": {
object ( |
Fields | |
---|---|
documents[] |
The list of all the resource names of the documents to be processed. Format: projects/{projectNumber}/locations/{location}/documents/{documentId}. |
exportFolderPath |
The Cloud Storage folder path used to store the exported documents before being sent to CDW. Format: |
processorInfo |
The CDW processor information. |
processorResultsFolderPath |
The Cloud Storage folder path used to store the raw results from processors. Format: |