BigQuery integration

BigQuery integrates with Document AI helps to build document analytics and generative AI use cases. As digital transformation accelerates, organizations are generating vast amounts of text and other document data, all of which holds immense potential for insights and powering novel generative AI use cases. To help harness this data, we're excited to announce an integration between BigQuery and Document AI, letting you extract insights from document data and build new large language model (LLM) applications.


BigQuery customers can now create Document AI custom extractors, powered by Google's cutting-edge foundation models, which they can customize based on their own documents and metadata. These customized models can then be invoked from BigQuery to extract structured data from documents in a secure, governed manner, using the simplicity and power of SQL. Prior to this integration, some customers tried to construct independent Document AI pipelines, which involved manually curating extraction logic and schema. The lack of built-in integration capabilities left them to develop bespoke infrastructure to synchronize and maintain data consistency. This turned each document analytics project into a substantial undertaking that required significant investment. Now, with this integration, customers can create remote models in BigQuery for their custom extractors in Document AI, and use them to perform document analytics and generative AI at scale, unlocking a new era of data-driven insights and innovation.

A unified, governed data to AI experience

You can build a custom extractor in the Document AI with three steps:

  1. Define the data you need to extract from your documents. This is called document schema, stored with each version of the custom extractor, accessible from BigQuery.
  2. Optionally, provide extra documents with annotations as samples of the extraction.
  3. Train the model for the custom extractor, based on the foundation models provided in Document AI.

In addition to custom extractors that require manual training, Document AI also provides ready to use extractors for expenses, receipts, invoices, tax forms, government ids, and a multitude of other scenarios, in the processor gallery.

Then, once you have the custom extractor ready, you can move to BigQuery Studio to analyze the documents using SQL in the following four steps:

  1. Register a BigQuery remote model for the extractor using SQL. The model can understand the document schema (created above), invoke the custom extractor, and parse the results.
  2. Create object tables using SQL for the documents stored in Cloud Storage. You can govern the unstructured data in the tables by setting row-level access policies, which limits users' access to certain documents and thus restricts the AI power for privacy and security.
  3. Use the function ML.PROCESS_DOCUMENT on the object table to extract relevant fields by making inference calls to the API endpoint. You can also filter out the documents for the extractions with a WHERE clause outside of the function. The function returns a structured table, with each column being an extracted field.
  4. Join the extracted data with other BigQuery tables to combine structured and unstructured data, producing business values.

The following example illustrates the user experience:


  # Create an object table in BigQuery that maps to the document files stored in Cloud Storage.
  CREATE OR REPLACE EXTERNAL TABLE `my_dataset.receipt_table`
    object_metadata = 'SIMPLE',
    uris = ['gs://my_bucket/path/*'],
    metadata_cache_mode= 'AUTOMATIC',
    max_staleness= INTERVAL 1 HOUR

  # Create a remote model to register your Doc AI processor in BigQuery.
  CREATE OR REPLACE MODEL `my_dataset.invoice_parser`
    remote_service_type = 'CLOUD_AI_DOCUMENT_V1',   

  # Invoke the registered model over the object table to parse PDF expense receipts
  SELECT uri, total_amount, invoice_date
    MODEL `my_dataset.invoice_parser`,
    TABLE `my_dataset.receipt_table`)
  WHERE content_type = 'application/pdf';

Table of results


Text analytics, summarization and other document analysis use cases

Once you have extracted text from your documents, you can then perform document analytics in a few ways:

  • Use BigQuery ML to perform text-analytics: BigQuery ML supports training and deploying text models in a variety of ways. For example, you can use BigQuery ML to identify customer sentiment in support calls, or to classify product feedback into different categories. If you are a Python user, you can also use BigQuery DataFrames for pandas, and scikit-learn-like APIs for text analysis on your data.
  • Use PaLM 2 LLM to summarize the documents: BigQuery has a ML.GENERATE_TEXT function that calls the PaLM 2 model to generate texts, which can be used to summarize the documents. For example, you can use a Document AI to extract customer feedback and summarize the feedback using PaLM 2, all with BigQuery SQL.
  • Join document metadata with other structured data stored in BigQuery tables: This lets you combine structured and unstructured data for more powerful use cases. For example, you could identify high customer lifetime value (CLTV) customers with feedback captured from online reviews, or shortlist the most requested product features from customer feedback.
  // Example of document summarization using PaLM 2
    ml_generate_text_result['predictions'][0]['content'] AS generated_text,
      AS safety_attributes,
    * EXCEPT (ml_generate_text_result)
      MODEL `my_dataset.llm_model`,
            'Summarize the following text: ',customer_feedback) AS prompt,
        MODEL `my_dataset.customer_feedback_extractor`,
        TABLE `my_dataset.customer_feecback_documents`)
        0.2 AS temperature,
        1024 AS max_output_tokens));

Implement search and generative AI use cases

Once you've extracted structured text from your documents, you can build indexes optimized for needle in the haystack queries, made possible by BigQuery's search and indexing capabilities, unlocking powerful search capability. This integration also helps unlock new generative LLM applications like executing text-file processing for privacy filtering, content safety checks, and token chunking using SQL and custom Document AI models. The extracted text, combined with other metadata, simplifies the curation of the training corpus required to fine-tune large language models. Moreover, you're building LLM use cases on governed, enterprise data that's been grounded through BigQuery's embedding generation and vector index management capabilities. By synchronizing this index with Vertex AI, you can implement retrieval-augmented generation use cases, for a more governed and streamlined AI experience.

Sample application

For an example of an end-to-end application using the Document AI Connector:

  • Refer to this expense report demo on GitHub.
  • Read the companion blog post.
  • Watch a deep dive video from Google Cloud Next 2021.