Use advanced PDF parsing with LlamaIndex on Vertex AI for RAG

This page shows you how to use the advanced PDF parser with LlamaIndex on Vertex AI for RAG.

LlamaIndex on Vertex AI for RAG implements retrieval-augmented generation (RAG) for various file formats, including PDFs. Parsers extract information from your files so that LlamaIndex on Vertex AI for RAG can ground responses to your prompts. Each supported file format has one or more different parsers that can read that format. For more information about supported file formats, see Supported document types .

For PDFs, two types of parsers are available, a basic PDF parser, which is the default for PDF files, and an advanced PDF parser. The basic PDF parser extracts text information from a native PDF in the order in which the text is presented in the document. Native PDFs might contain other elements such as images, which are ignored by the basic PDF parser. In some cases, the presentation order might differ from the visual order of the document depending on how the PDF was constructed. The advanced PDF parser supports native and scanned PDFs by analyzing the layout of the document and extracting text based on the logical way in which the document flows. In addition, the advanced PDF parser yields better quality results than the basic PDF parser, such as a substantial improvement on table parsing quality.

Examples of how to enable advanced parsing

The ImportRagFiles API supports advanced PDF parsing, which supports native and scanned PDFs. The following sample code demonstrates how to enable advanced parsing using REST in a curl command and using the Vertex AI SDK for Python.

To enable basic PDF parsing, don't use the use_advanced_pdf_parsing option.

REST

To enable advanced PDF parsing using REST, specify the use_advanced_pdf_parsing option in your rag_file_parsing_config configuration.

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragFiles:import \
-d '{
  "import_rag_files_config": {
    // ... Existing options for import files here.
    "rag_file_parsing_config": {  // New configuration for advanced parsing.
      use_advanced_pdf_parsing: true
    }
  }
}'

Python

To enable advanced PDF parsing using the SDK, set the use_advanced_pdf_parsing option to True.

response = rag.import_files(
    # ... Existing options for import files here.
    use_advanced_pdf_parsing=True,  # New option for advanced parsing.
)

What's next