Process documents with Layout Parser

Layout Parser extracts document content elements like text, tables, and lists, and creates context-aware chunks that facilitate information retrieval in generative AI and discovery applications.

Layout Parser features

  • Parse document layouts. You can input HTML or PDF files to Layout Parser to identify content elements like text blocks, tables, lists, and structural elements such as titles and headings. These elements help define the organization and hierarchy of a document with rich content and structural elements that can create more context for information retrieval and discovery.

  • Chunk documents. Layout Parser can break documents up into chunks that retain contextual information about the layout hierarchy of the original document. Answer-generating LLMs can use chunks to improve relevance and decrease computational load.

    Taking a document's layout into account during chunking improves semantic coherence and reduces noise in the content when it's used for retrieval and LLM generation. All text in a chunk comes from the same layout entity, such as a heading, subheading, or list.

Limitations

The following limitations apply:

  • Usage limit of 100 document files per project per day
  • Online processing:
    • Input file size maximum of 20 MB for all file types
    • Maximum of 15 pages for PDF files
  • Batch processing:
    • Maximum single file size of 40 MB for all file types

Layout detection per file type

The following table lists the elements that Layout Parser can detect per document file type.

File type Detected elements
HTML paragraph, table, list, title, heading, page header, page footer
PDF paragraph, table, title, heading, page header, page footer

Before you begin

To turn on Layout Parser, follow these steps:

  1. Create a Layout Parser by following the instructions in Creating and managing processors.

    The processor type name is LAYOUT_PARSER_PROCESSOR.

  2. Enable Layout Parser by following the instructions in Enable a processor.

  3. Configure fields in ProcessOptions.layoutConfig in ProcessDocumentRequest.

REST

Before using any of the request data, make the following replacements:

  • LOCATION: your processor's location, for example:
    • us - United States
    • eu - European Union
  • PROJECT_ID: Your Google Cloud project ID.
  • PROCESSOR_ID: the ID of your custom processor.
  • MIME_TYPE: One of the valid MIME type options.
  • IMAGE_CONTENT: One of the valid Inline document content, represented as a stream of bytes. For JSON representations, the base64 encoding (ASCII string) of your binary image data. This string should look similar to the following string:
    • /9j/4QAYRXhpZgAA...9tAVx/zDQDlGxn//2Q==
    Visit the Base64 encode topic for more information.
  • CHUNK_SIZE: Optional. The chunk size, in tokens, to use when splitting documents.
  • INCLUDE_ANCESTOR_HEADINGS: Optional. Boolean. Whether or not to include ancestor headings when splitting documents.
  • BREAKPOINT_PERCENTILE_THRESHOLD: Optional. Integer. The percentile of cosine dissimilarity that must be exceeded between a group of tokens and the next. The smaller this number is, the more chunks will be generated.

HTTP method and URL:

POST https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process

Request JSON body:

{
  "rawDocument": {
    "mimeType": "MIME_TYPE",
    "content": "IMAGE_CONTENT"
  },
  "processOptions": {
    "layoutConfig": {
      "chunkingConfig": {
        "chunkSize": "CHUNK_SIZE",
        "includeAncestorHeadings": "INCLUDE_ANCESTOR_HEADINGS",
        "breakpointPercentileThreshold": "BREAKPOINT_PERCENTILE_THRESHOLD"
      }
    }
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$headers = @{  }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process" | Select-Object -Expand Content
The response includes the processed document with layout and chunking information as Document.documentLayout and Document.chunkedDocument.

Python

For more information, see the Document AI Python API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from typing import Optional

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1beta3 as documentai


# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types

def process_document_layout_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    process_options = documentai.ProcessOptions(
        layout_config=documentai.ProcessOptions.LayoutConfig(
            chunking_config=documentai.ProcessOptions.LayoutConfig.ChunkingConfig(
                chunk_size=1000,
                include_ancestor_headings=True,
                breakpoint_percentile_threshold=90,
            )
        )
    )

    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    print("Document Layout Blocks")
    for block in document.document_layout.blocks:
        print(block)

    print("Document Chunks")
    for chunk in document.chunked_document.chunks:
        print(chunk)

def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document

Batch process documents with Layout Parser

Use the following procedure to parse and chunk multiple documents in a single request.

  1. Input documents to Layout Parser to parse and chunk.

    To process documents, use API version v1beta3 when following the instructions for batch processing requests in Send a processing request.

    Configure fields in ProcessOptions.layoutConfig when making a batchProcess request.

    Input

    The following example JSON configures ProcessOptions.layoutConfig.

    "processOptions": {
      "layoutConfig": {
        "chunkingConfig": {
          "chunkSize": "CHUNK_SIZE",
          "includeAncestorHeadings": "INCLUDE_ANCESTOR_HEADINGS_BOOLEAN"
        }
      }
    }
    

    Replace the following:

    • CHUNK_SIZE: The maximum chunk size, in number of tokens, to use when splitting documents.
    • INCLUDE_ANCESTOR_HEADINGS_BOOLEAN: Whether to include ancestor headings when splitting documents. Ancestor headings are the parents of subheadings in the original document. They can provide a chunk with additional context about its position in the original document. Up to two levels of headings can be included with a chunk.

What's next