The ML.PROCESS_DOCUMENT function

This document describes the ML.PROCESS_DOCUMENT function, which lets you process unstructured documents from an object table by using the Document AI API.

Syntax

ML.PROCESS_DOCUMENT(
  MODEL `project_id.dataset.model_name`,
   { TABLE `project_id.dataset.object_table` | (query_statement) }
   [, PROCESS_OPTIONS => ( JSON 'process_options')]
)

Arguments

ML.PROCESS_DOCUMENT takes the following arguments:

  • project_id: Your project ID.

  • dataset: The BigQuery dataset that contains the model.

  • model: The name of a remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_DOCUMENT_V1.

  • object_table: The name of the object table that contains URIs of the documents.

    The documents in the object table must be of a supported type. An error is returned for any row that contains a document of an unsupported type.

  • query_statement: A GoogleSQL SELECT query that only references the object table. The query can't contain JOIN operations and can't use aliases to rename columns. You must include the uri and content_type columns from the object table in the SELECT statement. Other columns are optional.

  • process_options: a STRING value that contains a ProcessOptions resource in JSON format. Use this option to configure custom processing options corresponding to the document processor for your use case.

    For example, you might configure process options when using the layout parser to perform document chunking. The JSON configuration would look similar to '{"layout_config": {"chunking_config": {"chunk_size": 250,"include_ancestor_headings": true}}}'.

Output

ML.PROCESS_DOCUMENT returns the following columns:

  • ml_process_document_result: a JSON value that contains the entities returned by the Document AI API.
  • ml_process_document_status: a STRING value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
  • The fields returned by the processor specified in the model.
  • The columns from the object table or query referenced in the function input.

Quotas

See Cloud AI service functions quotas and limits.

For quick links to update the quotas for specific Document AI API metrics, see Quotas list.

Known issues

Sometimes after a query job that uses this function finishes successfully, some returned rows contain the following error message:

A retryable error occurred: RESOURCE EXHAUSTED error from <remote endpoint>

This issue occurs because BigQuery query jobs finish successfully even if the function fails for some of the rows. The function fails when the volume of API calls to the remote endpoint exceeds the quota limits for that service. This issue occurs most often when you are running multiple parallel batch queries. BigQuery retries these calls, but if the retries fail, the resource exhausted error message is returned.

To iterate through inference calls until all rows are successfully processed, you can use the BigQuery remote inference SQL scripts or the BigQuery remote inference pipeline Dataform package.

Locations

ML.PROCESS_DOCUMENT must run in the same region as the remote model that the function references. You can only create models based on Document AI in the US and EU multi-regions.

Limitations

The function can't process documents with more than 100 pages. Any row that contains such a file returns an error.

Example

The following example uses the invoice parser to process the documents represented by the documents table.

Create the model:

# Create model
CREATE OR REPLACE MODEL
`myproject.mydataset.invoice_parser`
REMOTE WITH CONNECTION `myproject.myregion.myconnection`
OPTIONS (remote_service_type = 'cloud_ai_document_v1',
document_processor='processor_id');

Process the documents:

SELECT *
FROM ML.PROCESS_DOCUMENT(
  MODEL `myproject.mydataset.invoice_parser`,
  TABLE `myproject.mydataset.documents`
);

The result is similar to the following:

ml_process_document_result ml_process_document_status invoice_type currency ...
{"entities":[{"confidence":1,"id":"0","mentionText":"10 105,93 10,59","pageAnchor":{"pageRefs":[{"boundingPoly":{"normalizedVertices":[{"x":0.40452111,"y":0.67199326},{"x":0.74776918,"y":0.67199326},{"x":0.74776918,"y":0.68208581},{"x":0.40452111,"y":0.68208581}]}}]},"properties":[{"confidence":0.66... USD

What's next