The ML.PROCESS_DOCUMENT function
This document describes the ML.PROCESS_DOCUMENT
function, which lets you
process unstructured documents from an
object table by using the
Document AI API.
Syntax
ML.PROCESS_DOCUMENT( MODEL `project_id.dataset.model_name`, { TABLE `project_id.dataset.object_table` | (query_statement) } [, PROCESS_OPTIONS => ( JSON 'process_options')] )
Arguments
ML.PROCESS_DOCUMENT
takes the following arguments:
project_id
: Your project ID.dataset
: The BigQuery dataset that contains the model.model
: The name of a remote model with aREMOTE_SERVICE_TYPE
ofCLOUD_AI_DOCUMENT_V1
.object_table
: The name of the object table that contains URIs of the documents.The documents in the object table must be of a supported type. An error is returned for any row that contains a document of an unsupported type.
query_statement
: A GoogleSQLSELECT
query that only references the object table. The query can't containJOIN
operations and can't use aliases to rename columns. You must include theuri
andcontent_type
columns from the object table in theSELECT
statement. Other columns are optional.process_options
: aSTRING
value that contains aProcessOptions
resource in JSON format. Use this option to configure custom processing options corresponding to the document processor for your use case.For example, you might configure process options when using the layout parser to perform document chunking. The JSON configuration would look similar to
'{"layout_config": {"chunking_config": {"chunk_size": 250,"include_ancestor_headings": true}}}'
.
Output
ML.PROCESS_DOCUMENT
returns the following columns:
ml_process_document_result
: aJSON
value that contains the entities returned by the Document AI API.ml_process_document_status
: aSTRING
value that contains the API response status for the corresponding row. This value is empty if the operation was successful.- The fields returned by the processor specified in the model.
- The columns from the object table or query referenced in the function input.
Quotas
See Cloud AI service functions quotas and limits.
For quick links to update the quotas for specific Document AI API metrics, see Quotas list.
Known issues
Sometimes after a query job that uses this function finishes successfully, some returned rows contain the following error message:
A retryable error occurred: RESOURCE EXHAUSTED error from <remote endpoint>
This issue occurs because BigQuery query jobs finish successfully
even if the function fails for some of the rows. The function fails when the
volume of API calls to the remote endpoint exceeds the quota limits for that
service. This issue occurs most often when you are running multiple parallel
batch queries. BigQuery retries these calls, but if the retries
fail, the resource exhausted
error message is returned.
To iterate through inference calls until all rows are successfully processed, you can use the BigQuery remote inference SQL scripts or the BigQuery remote inference pipeline Dataform package.
Locations
ML.PROCESS_DOCUMENT
must run in the same region as the remote model that the
function references. You can only create models based on
Document AI in the US
and EU
multi-regions.
Limitations
The function can't process documents with more than 100 pages. Any row that contains such a file returns an error.
Example
The following example uses the
invoice parser
to process the documents represented by the documents
table.
Create the model:
# Create model CREATE OR REPLACE MODEL `myproject.mydataset.invoice_parser` REMOTE WITH CONNECTION `myproject.myregion.myconnection` OPTIONS (remote_service_type = 'cloud_ai_document_v1', document_processor='processor_id');
Process the documents:
SELECT * FROM ML.PROCESS_DOCUMENT( MODEL `myproject.mydataset.invoice_parser`, TABLE `myproject.mydataset.documents` );
The result is similar to the following:
ml_process_document_result | ml_process_document_status | invoice_type | currency | ... |
---|---|---|---|---|
{"entities":[{"confidence":1,"id":"0","mentionText":"10 105,93 10,59","pageAnchor":{"pageRefs":[{"boundingPoly":{"normalizedVertices":[{"x":0.40452111,"y":0.67199326},{"x":0.74776918,"y":0.67199326},{"x":0.74776918,"y":0.68208581},{"x":0.40452111,"y":0.68208581}]}}]},"properties":[{"confidence":0.66... | USD |
What's next
- Get step-by-step instructions on how to
process documents
using the
ML.PROCESS_DOCUMENT
function. - To learn more about model inference, including other functions that you can use to analyze BigQuery data, see Model inference overview.
- For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.