LlamaIndex on Vertex AI for RAG overview

LlamaIndex on Vertex AI for RAG is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).

A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With LlamaIndex on Vertex AI for RAG, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.

By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.

The following concepts are key to understanding LlamaIndex on Vertex AI for RAG. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.

  1. Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.

  2. Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.

  3. Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.

  4. Data indexing: LlamaIndex on Vertex AI for RAG creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.

  5. Retrieval: When a user asks a question or provides a prompt, the retrieval component in LlamaIndex on Vertex AI for RAG searches through its knowledge base to find information that is relevant to the query.

  6. Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.

Generative AI models that support RAG

This section lists the Google models and open models that support LlamaIndex on Vertex AI for RAG.

Gemini models

The following table lists the Gemini models and their versions that support LlamaIndex on Vertex AI for RAG:

Model Version
Gemini 1.5 Flash gemini-1.5-flash-002
gemini-1.5-flash-001
Gemini 1.5 Pro gemini-1.5-pro-002
gemini-1.5-pro-001
Gemini 1.0 Pro gemini-1.0-pro-001
gemini-1.0-pro-002
Gemini 1.0 Pro Vision gemini-1.0-pro-vision-001
Gemini gemini-experimental

Open models

Google-operated Llama 3.1 model as a service (MaaS) endpoint and your self-deployed open model endpoints support LlamaIndex on Vertex AI for RAG.

The following code sample demonstrates how to use the Gemini GenerateContent API to create an open model instance.

  # Create a model instance with Llama 3.1 MaaS endpoint
  rag_model = GenerativeModel(
      "projects/PROJECT_ID/locations/REGION/publisher/meta/models/llama3-405B-instruct-maas",
      tools=[rag_retrieval_tool]
  )

  # Create a model instance with your self-deployed open model endpoint
  rag_model = GenerativeModel(
      "projects/PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID",
      tools=[rag_retrieval_tool]
  )

The following code sample demonstrates how to use the OpenAI compatible ChatCompletions API to generate a model response.

  # Generate a response with Llama 3.1 MaaS endpoint
  response = client.chat.completions.create(
      model="meta/llama3-405b-instruct-maas",
      messages=[{"role": "user", "content": "your-query"}],
      extra_body={
          "extra_body": {
              "google": {
                  "vertex_rag_store": {
                      "rag_resources": {
                          "rag_corpus": rag_corpus_resource
                      },
                      "similarity_top_k": 10
                  }
              }
          }
      },
  )

Embedding models

Embedding models are used to create a corpus and for search and retrieval during response generation. This section lists the supported embedding models.

  • textembedding-gecko@003
  • textembedding-gecko-multilingual@001
  • text-embedding-004 (default)
  • text-multilingual-embedding-002
  • textembedding-gecko@002 (fine-tuned versions only)
  • textembedding-gecko@001 (fine-tuned versions only)

For more information about tuning embedding models, see Tune text embeddings.

The following open embedding models are also supported. You can find them in Model Garden.

  • e5-base-v2
  • e5-large-v2
  • e5-small-v2
  • multilingual-e5-large
  • multilingual-e5-small

Document types supported for RAG

Only text documents are supported. The following table shows the file types and their file size limits:

File type File size limit
Google documents 10 MB when exported from Google Workspace
Google drawings 10 MB when exported from Google Workspace
Google slides 10 MB when exported from Google Workspace
HTML file 10 MB
JSON file 1 MB
Markdown file 10 MB
Microsoft PowerPoint slides (PPTX file) 10 MB
Microsoft Word documents (DOCX file) 50 MB
PDF file 50 MB
Text file 10 MB

Using LlamaIndex on Vertex AI for RAG with other document types is possible but can generate lower-quality responses.

Data sources supported for RAG

The following data sources are supported:

  • Upload a local file: A single-file upload using upload_file (up to 25 MB), which is a synchronous call.
  • Cloud Storage: Import file(s) from Cloud Storage.
  • Google Drive: Import a directory from Google Drive.

    The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.

    To authenticate and grant permissions, do the following:

    1. Go to the IAM page of your Google Cloud project.
    2. Select Include Google-provided role grant.
    3. Search for the Vertex AI RAG Data Service Agent service account.
    4. Click Share on the drive folder, and share with the service account.
    5. Grant Viewer permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
  • Slack: Import files from Slack by using a data connector.

  • Jira: Import files from Jira by using a data connector.

For more information, see the RAG API reference.

Fine-tune your RAG transformations

After a document is ingested, LlamaIndex on Vertex AI for RAG runs a set of transformations to prepare the data for indexing. You can control your use cases using the following parameters:

Parameter Description
chunk_size When documents are ingested into an index, they are split into chunks. The chunk_size parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens.
chunk_overlap By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens.

A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.

For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk of text might not fit into a small-window model.

Retrieval parameters

The following table includes the retrieval parameters:

Parameter Description
similarity_top_k Controls the maximum number of contexts that are retrieved.
vector_distance_threshold Only contexts with a distance smaller than the threshold are considered.

Manage your RAG knowledge base (corpus)

This section describes how you can manage your corpus for RAG tasks by performing index management and file management.

Corpus management

A corpus, also referred to as an index, is a collection of documents or source of information. The corpus can be queried to retrieve relevant contexts for response generation. When creating a corpus for the first time, the process might take an additional minute.

When creating a corpus, you can specify the following:

Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.

File management

The following file operations are supported:

For more information, see the RAG API reference.

What's next