RAG infrastructure for generative AI using Google Agentspace and Vertex AI

Last reviewed 2025-09-11 UTC

This document provides a reference architecture that you can use to design the infrastructure for a generative AI application with retrieval-augmented generation (RAG) by using Google Agentspace and Vertex AI. This reference architecture demonstrates how to use managed services and deploy a single AI agent to facilitate an end-to-end RAG dataflow. Google Agentspace serves as the unified platform for agent orchestration across the enterprise. Vertex AI accelerates the development and deployment of custom agents and provides managed datastores to facilitate efficient retrieval for RAG.

The intended audience for this document includes architects, developers, and administrators of generative AI applications. The document assumes that you have a basic understanding of AI, machine learning (ML), and large language model (LLM) concepts. This document doesn't provide guidance about how to design and develop a generative AI application. For information about how to design an application, see Develop a generative AI application.

Architecture

The following diagram shows a high-level view of the architecture that this document presents:

A high-level view of the data ingestion and serving flows in the architecture.

The architecture in the preceding diagram has two subsystems: data ingestion and serving.

  • The data ingestion subsystem ingests and prepares data from external sources for use in RAG. The subsystem generates embeddings for the ingested data and uses them to build and maintain a searchable vector index in a managed datastore.
  • The serving subsystem contains the generative AI application's frontend and backend services.
    • The frontend service handles the query-response flow with application users and forwards queries to the backend service.
    • The backend service uses Google Agentspace and Vertex AI to build and deploy your AI agent to orchestrate the RAG process. This process uses the indexed vector data to generate responses that are contextually grounded and adhere to Responsible AI safety filters.

The following diagram shows a detailed view of the architecture:

A detailed view of the data ingestion and serving flows in the architecture.

The following sections describe the data flow within each subsystem of the preceding architecture diagram.

Data ingestion subsystem

The data ingestion subsystem ingests data from external sources and prepares the data for RAG. The following are the steps in the data-ingestion and preparation flow:

  1. Data engineers upload data from external sources to a Cloud Storage bucket. The external sources might be applications, databases, or streaming services.
  2. Upon completion, Cloud Storage publishes a message to a Pub/Sub topic.
  3. The Pub/Sub topic triggers a processing job to run in Cloud Run functions.
  4. Cloud Run functions processes the raw data by generating and storing the metadata as JSON Lines (JSONL) files. The JSONL files are stored into a separate Cloud Storage bucket.
  5. Upon completion, Cloud Run functions publishes a message to a Pub/Sub topic.
  6. The Pub/Sub topic triggers a processing job to run in the managed datastore within Google Agentspace. The processing job pulls the raw ingested data and metadata from the Cloud Storage buckets, then parses and chunks the data for efficient retrieval during serving. Google Agentspace automatically generates vector embeddings without any configuration necessary.

Serving subsystem

The serving subsystem handles the query-response flow between the generative AI application and its users. The following are the steps in the serving flow:

  1. An application user sends a query through one of the Cloud Run frontend services. You can customize these services for different experiences, such as a chatbot UI, a search page, or a mobile application.
  2. The frontend service receives the query, and then forwards the query to a centralized Cloud Run backend service. This backend provides a single, unified endpoint to support all of the different frontend clients. The backend service also performs necessary preprocessing, which can include constructing filters for the search query. This approach keeps the logic transparent to the frontends.
  3. The backend service sends the prepared request to Google Agentspace by using the Google Agentspace API endpoint to initiate the RAG workflow.
  4. To process the query, Google Agentspace uses the enterprise search and custom agent to perform the following tasks:
    1. Create an embedding of the user's query.
    2. Perform a semantic search on the indexed data in the Managed datastore to find the most relevant information.
    3. Augment the original query with the retrieved data from the managed datastore to create a detailed, contextual prompt.
    4. Generate a final response that's based on the augmented prompt.
  5. Google Agentspace sends the generated response to the Cloud Run backend service.
  6. The backend service returns the final response to the frontend service that sent the original request. The frontend service presents the answer to the application user.

Products used

This reference architecture uses the following Google Cloud products:

  • Google Agentspace: A managed platform that serves as a central registry and interaction hub for all of your AI agents within an enterprise, and enables seamless discovery, governance, and use by applications.
  • Vertex AI: An ML platform that lets you train and deploy ML models and AI applications, and customize LLMs for use in AI-powered applications.
  • Cloud Run: A serverless compute platform that lets you run containers directly on top of Google's scalable infrastructure.
  • Pub/Sub: An asynchronous and scalable messaging service that decouples services that produce messages from services that process those messages.
  • Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.

Use cases

This architecture is designed for enterprise scenarios where your generative AI application needs access to the most current information and requires deep, contextual understanding to provide accurate responses.

The architecture includes a custom data ingestion subsystem to address two key enterprise requirements:

  • Real-time data availability: The event-driven pipeline processes new data as soon as it's available in your organization—for example, a new product guide or an updated report. The pipeline also makes the information available in your managed datastore. This design helps to mitigate information staleness because it ensures that there's a minimal delay between data availability and use.
  • Enriched contextual search: The custom processing job allows your organization to apply its own business logic to enrich data with valuable metadata. The Cloud Run function can tag each document with specific attributes like product line, author, location, or document type. This rich metadata helps the agent to narrow down its search and deliver more precise, context-aware answers.

RAG is an effective technique to improve the quality of output that's generated from an LLM. This section provides examples of use cases for which you can use RAG-capable generative AI applications.

Personalized product recommendations

An online shopping site might use an LLM-powered chatbot to assist customers with finding products or getting shopping-related help. The questions from a user can be augmented by using historical data about the user's buying behavior and website interaction patterns. The data might include user reviews and feedback that's stored in an unstructured datastore or search-related metrics that are stored in a web analytics data warehouse. The augmented question can then be processed by the LLM to generate personalized responses that the user might find more appealing and compelling.

Clinical assistance systems

Doctors in hospitals need to quickly analyze and diagnose a patient's health condition to make decisions about appropriate care and medication. A generative AI application that uses a medical LLM like Med-PaLM can be used to assist doctors in their clinical diagnosis process. The responses that the application generates can be grounded in historical patient records by contextualizing the doctors' prompts with data from the hospital's electronic health record (EHR) database or from an external knowledge base like PubMed.

Generative AI-powered legal research lets lawyers quickly query large volumes of statutes and case laws to identify relevant legal precedents or summarize complex legal concepts. The output of such research can be enhanced by augmenting a lawyer's prompts with data that's retrieved from the law firm's proprietary corpus of contracts, past legal communication, and internal case records. This design approach ensures that the generated responses are relevant to the legal domain that the lawyer specializes in.

Design alternatives

This section presents alternative design approaches that you can consider for your RAG-capable generative AI application in Google Cloud.

AI infrastructure alternatives

If you need an architecture that uses a fully managed vector search product, you can use Vertex AI and Vector Search, which provide optimized serving infrastructure for large-scale vector searches. For more information, see Infrastructure for a RAG-capable generative AI application using Vertex AI and Vector Search.

If you want to take advantage of the vector store capabilities of a fully managed Google Cloud database like AlloyDB for PostgreSQL or Cloud SQL, then see Infrastructure for a RAG-capable generative AI application using Vertex AI and AlloyDB for PostgreSQL.

If you want to rapidly build and deploy RAG-capable generative AI applications by using open source tools and models such as Ray, Hugging Face, and LangChain, see Infrastructure for a RAG-capable generative AI application using GKE and Cloud SQL.

Application hosting options

In the architecture that's shown in this document, Cloud Run is the host for the generative AI application and data processing. Cloud Run is an application that is developer-focused and fully managed. You can also deploy your application to Vertex AI Agent Engine, GKE clusters, or to Compute Engine VMs.

To choose an application host, consider the following trade-offs between configuration flexibility and management effort:

  • With the serverless Cloud Run option, you deploy your custom services to a preconfigured, managed environment. To host the frontend services and the custom backend logic for request pre-processing, this architecture requires the ability to deploy custom applications.
  • With the Vertex AI Agent Engine option, you use a fully managed platform that's designed for agent serving. Vertex AI Agent Engine reduces management overhead and ensures tight integration with Google Agentspace.
  • With Compute Engine VMs and GKE containers, you're responsible for managing the underlying compute resources, but you have greater configuration flexibility and control.

For more information about choosing an appropriate application hosting service, see the following documents:

Other infrastructure options

For information about other infrastructure options, supported models, and grounding techniques that you can use for generative AI applications in Google Cloud, see Choose models and infrastructure for your generative AI application.

Design considerations

This section provides guidance to help you develop a RAG-capable generative AI architecture in Google Cloud that meets your specific requirements for security and compliance, reliability, cost, and performance. The guidance in this section isn't exhaustive. Depending on the specific requirements of your generative AI application and the Google Cloud products and features that you use, you might need to consider additional design factors and trade-offs.

For an overview of architectural principles and recommendations that are specific to AI and ML workloads in Google Cloud, see the AI and ML perspective in the Well-Architected Framework.

Security, privacy, and compliance

This section describes design considerations and recommendations to design a topology in Google Cloud that meets your workload's security and compliance requirements.


Product

Design considerations and recommendations

Vertex AI

Vertex AI supports Google Cloud security controls that you can use to meet your requirements for data residency, data encryption, network security, and access transparency. For more information, see the following documentation: Google Agentspace Enterprise deletes user-requested data within 60 days. For more information, see Data deletion on Google Cloud.

Generative AI models might produce harmful responses, especially when they are explicitly prompted for such responses. To enhance safety and mitigate potential misuse, you can configure content filters to act as barriers to harmful responses. For more information, see Safety and content filters.

Cloud Run

By default, Cloud Run encrypts data by using Google-owned and Google-managed encryption keys. To protect your containers by using keys that you control, you can use customer-managed encryption keys (CMEKs). For more information, see Using customer managed encryption keys.

To ensure that only authorized container images are deployed to Cloud Run, you can use Binary Authorization.

Cloud Run helps you meet data residency requirements. Your Cloud Run functions run within the selected region.

Cloud Storage

By default, Cloud Storage encrypts the data that it stores by using Google-owned and Google-managed encryption keys. If required, you can use CMEKs or your own keys that you manage by using an external management method like customer-supplied encryption keys (CSEKs). For more information, see Data encryption options.

Cloud Storage supports two methods to grant users access to your buckets and objects: Identity and Access Management (IAM) and access control lists (ACLs). In most cases, we recommend that you use IAM, which lets you grant permissions at the bucket and project levels. For more information, see Overview of access control.

The data that you load into the data ingestion subsystem through Cloud Storage might include sensitive data. You can use Sensitive Data Protection to discover, classify, and de-identify sensitive data. For more information, see Using Sensitive Data Protection with Cloud Storage.

Cloud Storage helps you meet data residency requirements. Cloud Storage stores or replicates data within the region that you specify.

Pub/Sub

By default, Pub/Sub encrypts all of the messages, which includes at rest and in transit messages, by using Google-owned and Google-managed encryption keys. Pub/Sub supports the use of CMEKs for message encryption at the application layer. For more information, see Configure message encryption.

If you have data residency requirements, to ensure that message data is stored in specific locations, you can configure message storage policies.

For security principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Security in the Well-Architected Framework.

Reliability

This section describes design considerations and recommendations to build and operate reliable infrastructure for your deployment in Google Cloud.


Product

Design considerations and recommendations

Vertex AI

Vertex AI ensures data residency at rest. Vertex AI stores your source data, which includes data for RAG in the managed datastore, within the Google Cloud location that you've selected. This separation of processing from storage is a fundamental aspect of how the platform provides both high reliability and compliance.

Cloud Run

Cloud Run is a regional service that stores data synchronously across multiple zones within a region. The service automatically load-balances the traffic across the zones. If a zone outage occurs, Cloud Run jobs continue to run and data isn't lost. If a region outage occurs, the Cloud Run jobs stop running until Google resolves the outage.

Individual Cloud Run jobs or tasks might fail. To handle such failures, you can use task retries and checkpointing. For more information, see Jobs retries and checkpoints best practices.

Cloud Storage

You can create Cloud Storage buckets in one of three location types: regional, dual-region, or multi-region. For data in regional buckets, Cloud Storage synchronously replicates that data across multiple zones within a region. For higher availability, you can use dual-region or multi-region buckets, where Cloud Storage replicates data asynchronously across regions. Ensure that your choice aligns with your compliance requirements.

For reliability principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Reliability in the Well-Architected Framework.

Cost optimization

This section provides guidance to optimize the cost of setting up and operating a Google Cloud topology that you build by using this reference architecture.


Product

Design considerations and recommendations

Vertex AI

The underlying AI model that the agent invokes can directly influence the cost to use that agent. Pricing is calculated based on the number of input and output tokens for each request. For more information, see Generative AI on Vertex AI quotas and system limits and Google Cloud's pricing calculator.

For information about how to minimize token count to reduce cost, see Optimize prompt and output length.

Cloud Run functions

When you create Cloud Run jobs, you specify the amount of memory and CPU to be allocated to the container instance. To control costs, start with the default CPU and memory allocations. To improve performance, you can increase the allocation by configuring the CPU limit and memory limit.

If you can predict the CPU and memory requirements of your Cloud Run jobs, then you can save money with discounts for committed usage. For more information, see Cloud Run committed use discounts.

Cloud Storage

For the Cloud Storage bucket that you use to load data into the data ingestion subsystem, choose an appropriate storage class based on the data-retention and access-frequency requirements of your workloads. For example, you can choose the Standard storage class, and use Object Lifecycle Management to control storage costs. Object Lifecycle Management automatically downgrades objects to a lower-cost storage class or deletes objects based on conditions that you set.

For cost optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Cost optimization in the Well-Architected Framework.

Performance optimization

This section describes design considerations and recommendations to design a topology in Google Cloud that meets the performance requirements of your workloads.


Product

Design considerations and recommendations

Google Agentspace

To reduce latency during serving, stream responses by sending model responses before the agent generates the complete output. This enables real-time processing of the output, and you can immediately update your user interface and perform other concurrent tasks. Streaming enhances perceived responsiveness and creates a more interactive user experience. For more information, see Stream answers.

Cloud Run

Adjust the memory and CPU allocation for the Cloud Run instances based on your performance requirements. For more information, see Configure CPU limits for jobs and Configure memory limits for services.

Cloud Storage

To upload large files, you can use a method called parallel composite uploads. With this strategy, the large file is split into chunks. You upload the chunks to Cloud Storage in parallel, and Cloud Storage then reassembles the data in Google Cloud. Parallel composite uploads can be faster than regular upload operations if you have sufficient network bandwidth and disk speed. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.

For performance optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Performance optimization in the Well-Architected Framework.

Deployment

To deploy this reference architecture, use the Terraform example that's available in GitHub. For more information, see RAG Infrastructure for Generative AI Applications using Google Agentspace and Vertex AI.

What's next

Contributors

Author: Samantha He | Technical Writer

Other contributors: