LlamaIndex on Vertex AI for RAG overview

LlamaIndex is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).

A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With LlamaIndex on Vertex AI for RAG, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.

By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.

The following concepts are key to understanding LlamaIndex on Vertex AI. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.

  1. Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.

  2. Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.

  3. Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.

  4. Data indexing: LlamaIndex on Vertex AI for RAG creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.

  5. Retrieval: When a user asks a question or provides a prompt, the retrieval component in LlamaIndex on Vertex AI for RAG searches through its knowledge base to find information that is relevant to the query.

  6. Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.

This page gets you started with using LlamaIndex on Vertex AI for RAG and provides Python samples to demonstrate how to use the RAG API.

For information about the file size limits, see Supported document types. For information about quotas related to LlamaIndex on Vertex AI for RAG, see LlamaIndex on Vertex AI for RAG quotas. For information about customizing parameters, see Retrieval parameters.

Run LlamaIndex on Vertex AI for RAG using the Vertex AI SDK

To use LlamaIndex on Vertex AI for RAG, do the following:

  1. Install the Vertex AI SDK for Python.

  2. Run this command in the Google Cloud console to set up your project.

    gcloud config set {project}

  3. Run this command to authorize your login.

    gcloud auth application-default login

  4. Copy and paste this sample code into the Google Cloud console to run LlamaIndex on Vertex AI.


To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from vertexai.preview import rag
from vertexai.preview.generative_models import GenerativeModel, Tool
import vertexai

# Create a RAG Corpus, Import Files, and Generate a response

# TODO(developer): Update and un-comment below lines
# project_id = "PROJECT_ID"
# display_name = "test_corpus"
# paths = ["https://drive.google.com/file/d/123", "gs://my_bucket/my_files_dir"]  # Supports Google Cloud Storage and Google Drive Links

# Initialize Vertex AI API once per session
vertexai.init(project=project_id, location="us-central1")

# Create RagCorpus
# Configure embedding model, for example "text-embedding-004".
embedding_model_config = rag.EmbeddingModelConfig(

rag_corpus = rag.create_corpus(

# Import Files to the RagCorpus
response = rag.import_files(
    chunk_size=512,  # Optional
    chunk_overlap=100,  # Optional
    max_embedding_requests_per_min=900,  # Optional

# Direct context retrieval
response = rag.retrieval_query(
            # Supply IDs from `rag.list_files()`.
            # rag_file_ids=["rag-file-1", "rag-file-2", ...],
    text="What is RAG and why it is helpful?",
    similarity_top_k=10,  # Optional
    vector_distance_threshold=0.5,  # Optional

# Enhance generation
# Create a RAG retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
                    rag_corpus=rag_corpus.name,  # Currently only 1 corpus is allowed.
                    # Supply IDs from `rag.list_files()`.
                    # rag_file_ids=["rag-file-1", "rag-file-2", ...],
            similarity_top_k=3,  # Optional
            vector_distance_threshold=0.5,  # Optional
# Create a gemini-pro model instance
rag_model = GenerativeModel(
    model_name="gemini-1.5-flash-001", tools=[rag_retrieval_tool]

# Generate response
response = rag_model.generate_content("What is RAG and why it is helpful?")

Supported generation models

The following models and their versions that support LlamaIndex on Vertex AI include:

Model Version
Gemini 1.5 Flash gemini-1.5-flash-001
Gemini 1.5 Pro gemini-1.5-pro-001
Gemini 1.0 Pro gemini-1.0-pro-001
Gemini 1.0 Pro Vision gemini-1.0-pro-vision-001
Gemini gemini-experimental

Supported embedding models

The following model versions are supported Google models:

  • textembedding-gecko@003
  • textembedding-gecko-multilingual@001
  • text-embedding-004
  • text-multilingual-embedding-002

The following model versions are supported fine-tuned Google models:

  • textembedding-gecko@003
  • textembedding-gecko-multilingual@001
  • text-embedding-004
  • text-multilingual-embedding-002
  • textembedding-gecko@002
  • textembedding-gecko@001

If the configuration isn't specified, the default behavior is to use text-embedding-004 for the embedding choice on RagCorpus. For more information about tuning embedding models, see Tune text embeddings.

Supported document types

Text-only documents are supported, which include the following file types with their file size limits:

File type File size limit
Google documents 10 MB when exported from Google Workspace
Google drawings 10 MB when exported from Google Workspace
Google slides 10 MB when exported from Google Workspace
HTML file 10 MB
JSON file 1 MB
Markdown file 10 MB
Microsoft PowerPoint slides (PPTX file) 10 MB
Microsoft Word documents (DOCX file) 10 MB
PDF file 50 MB
Text file 10 MB

Using LlamaIndex on Vertex AI for RAG with other document types is possible but can generate lower-quality responses.

Supported data sources

There are three supported data sources, which include:

  • A single-file upload using upload_file (up to 25 MB), which is a synchronous call.

  • Import file(s) from Cloud Storage.

  • Import a directory from Google Drive.

    The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.

    To authenticate and grant permissions, do the following:

    1. Go to the IAM page of your Google Cloud project.
    2. Select Include Google-provided role grant.
    3. Search for the Vertex AI RAG Data Service Agent service account.
    4. Click Share on the drive folder, and share with the service account.
    5. Grant Viewer permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.

For more information, see the RAG API reference.

Supported data transformations

After a document is ingested, LlamaIndex on Vertex AI for RAG runs a set of transformations for the best quality, and there are parameters that developers can control for their use cases.

These parameters include the following:

Parameter Description
chunk_size When documents are ingested into an index, they are split into chunks. The chunk_size parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens.
chunk_overlap By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens.

A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.

For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk might not fit into a small-window model.

Retrieval parameters

The following table includes the retrieval parameters:

Parameter Description
similarity_top_k Controls the maximum number of contexts that are retrieved.
vector_distance_threshold Only contexts with a distance smaller than the threshold are considered.

Index management

A corpus is a collection of documents or source of information. That collection is also referred to as an index. The index can then be queried to retrieve relevant contexts for LLM generation. When creating an index for the first time, the process might take an additional minute. For more creations of indexes within the same Google Cloud project, the process takes less time.

The following index operations are supported:

Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.

File management

The following file operations are supported:

For more information, see the RAG API reference.

What's next