GraphRAG infrastructure for generative AI using Vertex AI and Spanner Graph

Last reviewed 2025-07-01 UTC

This document provides a reference architecture to help you design infrastructure for GraphRAG generative AI applications in Google Cloud. The intended audience includes architects, developers, and administrators who build and manage intelligent information retrieval systems. The document assumes a foundational understanding of AI, graph data management, and knowledge graph concepts. This document doesn't provide specific guidance for designing and developing GraphRAG applications.

GraphRAG is a graph-based approach to retrieval augmented generation (RAG). RAG helps to ground AI-generated responses by augmenting prompts with contextually relevant data that's retrieved using vector search. GraphRAG combines vector search with a knowledge-graph query to retrieve contextual data that better reflects the interconnectedness of data from diverse sources. Prompts that are augmented using GraphRAG can generate more detailed and relevant AI responses.

Architecture

The following diagram shows an architecture for a GraphRAG-capable generative AI application in Google Cloud:

The data ingestion and serving flows in the architecture.

The architecture in the preceding diagram consists of two subsystems: data ingestion and serving. The following sections describe the purpose of the subsystems and the data flow within and across the subsystems.

Data ingestion subsystem

The data ingestion subsystem ingests data from external sources and then it prepares the data for GraphRAG. The data ingestion and preparation flow involves the following steps:

Data is ingested into a Cloud Storage bucket. This data can be uploaded by a data analyst, ingested from a database, or streamed from any source.
When data is ingested, a message is sent to a Pub/Sub topic.
Pub/Sub triggers a Cloud Run function to process the uploaded data.
The Cloud Run function builds a knowledge graph from the input files by using Gemini API in Vertex AI and tools like LangChain's LLMGraphTransformer.
The function stores the knowledge graph in a Spanner Graph database.
The function segments the textual content of the data files into granular units by using tools like LangChain's RecursiveCharacterTextSplitter or Document AI's Layout Parser.
The function creates vector embeddings of the text segments by using the Vertex AI Embeddings APIs.
The function stores the vector embeddings and associated graph nodes in Spanner Graph.

The vector embeddings serve as the basis for semantic retrieval. The knowledge graph nodes enable traversal and analysis of intricate data relationships and patterns.

Serving subsystem

The serving subsystem manages the query-response lifecycle between the generative AI application and its users. The serving flow involves the following steps:

A user submits a natural-language query to an AI agent, which is deployed on Vertex AI Agent Engine.
The agent processes the query as follows:
1. Converts the query to vector embeddings by using the Vertex AI Embeddings APIs.
2. Retrieves graph nodes that are related to the query by performing a vector-similarity search in the embeddings database.
3. Retrieves data that's related to the query by traversing the knowledge graph.
4. Augments the prompt by combining the original query with the retrieved graph data.
5. Uses the AI Applications ranking API to rank the results, which consists of nodes and edges that are retrieved from the graph database. The ranking is based on semantic relevance to the query.
6. Summarizes the results by calling Gemini API Vertex AI.
The agent then sends the summarized result to the user.

You can store and view query-response activity logs in Cloud Logging and set up logs-based monitoring by using Cloud Monitoring.

Products used

This reference architecture uses the following Google products and tools:

Spanner Graph: A graph database that provides the scalability, availability, and consistency features of Spanner.
Vertex AI: An ML platform that lets you train and deploy ML models and AI applications, and customize LLMs for use in AI-powered applications.
Cloud Run functions: A serverless compute platform that lets you run single-purpose functions directly in Google Cloud.
Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
Pub/Sub: An asynchronous and scalable messaging service that decouples services that produce messages from services that process those messages.
Cloud Logging: A real-time log management system with storage, search, analysis, and alerting.
Cloud Monitoring: A service that provides visibility into the performance, availability, and health of your applications and infrastructure.

Use cases

GraphRAG facilitates intelligent data retrieval for use cases in various industries. This section describes a few use cases in healthcare, finance, legal services, and manufacturing.

Healthcare and pharmaceuticals: Clinical decision support

In clinical decision-support systems, GraphRAG integrates vast amounts of data from medical literature, patient electronic health records, drug interaction databases, and clinical trial results into a unified knowledge graph. When clinicians and researchers query a patient's symptoms and current medications, GraphRAG traverses the knowledge graph to identify relevant conditions and potential drug interactions. It can also generate personalized treatment recommendations based on other data such as the patient's genetic profile. This type of information retrieval provides answers that are more contextually rich and evidence-based than keyword matching.

Financial services: Unifying financial data

Financial services firms use knowledge graphs to give their analysts a unified, structured view of data from disparate sources like analyst reports, earnings calls, and risk assessments. Knowledge graphs identify key data entities like companies and executives, and they map the crucial relationships between the entities. This approach provides a rich, interconnected web of data, which enables deeper and more efficient financial analysis. Analysts can discover previously hidden insights, such as intricate supply chain dependencies, board memberships that overlap across competitors, and exposure to complex geopolitical risks.

Legal services: Case research and precedent analysis

In the legal sector, GraphRAG can be used to generate personalized legal recommendations based on precedents, statutes, case law, regulatory updates, and internal documents. When lawyers prepare for cases, they can ask nuanced questions about specific legal arguments, prior rulings on similar cases, or the implications of new legislation. GraphRAG leverages the interconnectedness of the available legal knowledge to identify relevant precedents and explain their applicability. It can also suggest counter arguments by tracing the relationships between legal concepts, statutes, and judicial interpretations. With this approach, legal practitioners can gain more thorough and precise insights than conventional knowledge-retrieval methods.

Manufacturing and supply chain: Unlocking institutional knowledge

Manufacturing and supply chain operations require a high degree of precision. The knowledge that's necessary to maintain the required level of precision is often buried in thousands of dense, static Standard Operating Procedure (SOP) documents. When a production line or machine in a factory fails, or if a logistical issue occurs, engineers and technicians often waste critical time searching through disconnected PDF documents to diagnose and troubleshoot the issue. Knowledge graphs and conversational AI can be combined to turn buried institutional knowledge into an interactive diagnostic partner.

Design alternatives

The architecture that this document describes is modular. You can adapt certain components of the architecture to use alternative products, tools, and technologies depending on your requirements.

Building the knowledge graph

You can use LangChain's LLMGraphTransformer tool to build a knowledge graph from scratch. By specifying the graph schema with LLMGraphTransformer parameters like allowed_nodes, allowed_relationships, node_properties, and relationship_properties, you can improve the quality of the resulting knowledge graph. However, LLMGraphTransformer might extract entities from generic domains, so it might not be suitable for niche domains like healthcare or pharmaceuticals. Also, if your organization already has a robust process to build knowledge graphs, then the data ingestion subsystem that's shown in this reference architecture is optional.

Storing the knowledge graph and vector embeddings

The architecture in this document uses Spanner as the datastore for the knowledge graph and the vector embeddings. If your enterprise knowledge graphs already exist elsewhere (such as on a platform like Neo4j), then you might consider using a vector database for the embeddings. However, this approach requires additional management effort and it might cost more. Spanner provides a consolidated, globally consistent datastore for both graph structures and vector embeddings. Such a datastore enables unified data management, which helps to optimize cost, performance, security governance, and operational efficiency.

Agent runtime

In this reference architecture, the agent is deployed on Vertex AI Agent Engine, which provides a managed runtime for AI agents. Other options that you can consider include Cloud Run and Google Kubernetes Engine (GKE). A discussion of those options is outside the scope of this document.

Grounding using RAG

As discussed in the Use cases section, GraphRAG enables intelligent data retrieval for grounding in many scenarios. However, if the source data that you use for augmenting prompts doesn't have complex inter-relationships, then RAG might be an appropriate choice for your generative AI application.

The following reference architectures show how you can build the infrastructure required for RAG in Google Cloud by using vector-enabled managed databases or specialized vector search products:

Design considerations

This section describes design factors, best practices, and recommendations to consider when you use this reference architecture to develop a topology that meets your specific requirements for security, reliability, cost, and performance.

The guidance in this section is not exhaustive. Depending on your workload's requirements and the Google Cloud and third-party products and features that you use, there might be additional design factors and trade-offs that you should consider.

Security, privacy, and compliance

This section describes design considerations and recommendations to design a topology in Google Cloud that meets your workload's security and compliance requirements.

Product	Design considerations and recommendations
Vertex AI	Vertex AI supports Google Cloud security controls that you can use to meet your requirements for data residency, data encryption, network security, and access transparency. For more information, see the following documentation: Security controls for Vertex AI Security controls for Generative AI Generative AI and data governance Generative AI models might produce harmful responses, especially when they are explicitly prompted for such responses. To enhance safety and mitigate potential misuse, you can configure content filters to act as barriers to harmful responses. For more information, see Safety and content filters.
Spanner Graph	By default, data that's stored in Spanner Graph is encrypted using Google-owned and Google-managed encryption keys. If you need to use encryption keys that you control and manage, you can use customer-managed encryption keys (CMEKs). For more information, see About CMEK.
Cloud Run functions	By default, Cloud Run encrypts data by using Google-owned and Google-managed encryption keys. To protect your containers by using keys that you control, you can use CMEKs. For more information, see Using customer managed encryption keys. To ensure that only authorized container images are deployed to Cloud Run, you can use Binary Authorization. Cloud Run helps you meet data residency requirements. Your Cloud Run functions run within the selected region.
Cloud Storage	By default, the data that's stored in Cloud Storage is encrypted using Google-owned and Google-managed encryption keys. If required, you can use CMEKs or your own keys that you manage by using an external management method like customer-supplied encryption keys (CSEKs). For more information, see Data encryption options. Cloud Storage supports two methods for granting users access to your buckets and objects: Identity and Access Management (IAM) and access control lists (ACLs). In most cases, we recommend using IAM, which lets you grant permissions at the bucket and project levels. For more information, see Overview of access control. The data that you load into the data ingestion subsystem through Cloud Storage might include sensitive data. You can use Sensitive Data Protection to discover, classify, and de-identify sensitive data. For more information, see Using Sensitive Data Protection with Cloud Storage. Cloud Storage helps you meet data residency requirements. Data is stored or replicated within the region that you specify.
Pub/Sub	By default, Pub/Sub encrypts all messages, both at rest and in transit, by using Google-owned and Google-managed encryption keys. Pub/Sub supports the use of CMEKs for message encryption at the application layer. For more information, see Configure message encryption. If you have data residency requirements, to ensure that message data is stored in specific locations, you can configure message storage policies.
Cloud Logging	Admin Activity audit logs are enabled by default for all the Google Cloud services that are used in this reference architecture. These logs record API calls or other actions that modify the configuration or metadata of Google Cloud resources. For the Google Cloud services that are used in this architecture, you can enable Data Access audit logs. These logs let you track API calls that read the configuration or metadata of resources or user requests to create, modify, or read user-provided resource data. To help meet data residency requirements, you can configure Cloud Logging to store log data in the region that you specify. For more information, see Regionalize your logs.

For security principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Security in the Google Cloud Well-Architected Framework.

Reliability

This section describes design considerations and recommendations to build and operate reliable infrastructure for your deployment in Google Cloud.

Product	Design considerations and recommendations
Vertex AI	Vertex AI supports dynamic shared quota (DSQ) for Gemini models. DSQ helps to flexibly manage pay-as-you-go requests and eliminates the need to manage quota manually or request quota increases. DSQ dynamically allocates the available resources for a given model and region across active customers. With DSQ, there are no predefined quota limits on individual customers. If the number of your requests exceeds the allocated capacity, then error code 429 is returned. For workloads that are business critical and consistently require high throughput, you can reserve throughput by using Provisioned Throughput. If data can be shared across multiple regions or countries, you can use a global endpoint.
Spanner Graph	Spanner is designed for high data availability and global scalability. To help ensure availability even during a region outage, Spanner offers multi-region configurations, which replicate data in multiple zones across multiple regions. In addition to these in-built resilience capabilities, Spanner provides the following features to support comprehensive disaster recovery strategies: Database deletion protection Robust backup and restore capabilities, including scheduled and cross-region copies Point-in-time recovery (PITR) for protection against logical data corruption, operator errors, or accidental writes for up to seven days For more information, see Disaster recovery overview.
Cloud Run functions	Cloud Run is a regional service. Data is stored synchronously across multiple zones within a region. Traffic is automatically load-balanced across the zones. If a zone outage occurs, Cloud Run continues to run and data isn't lost. If a region outage occurs, the service stops running until Google resolves the outage.
Cloud Storage	You can create Cloud Storage buckets in one of three location types: regional, dual-region, or multi-region. Data that's stored in regional buckets is replicated synchronously across multiple zones within a region. For higher availability, you can use dual-region or multi-region buckets, where data is replicated asynchronously across regions.
Pub/Sub	To avoid errors during periods of transient spikes in message traffic, you can limit the rate of publish requests by configuring flow control in the publisher settings. To handle failed publish attempts, adjust the retry-request variables as necessary. For more information, see Retry requests.
All of the products in the architecture	After you deploy your workload in Google Cloud, use Active Assist to get recommendations to further optimize the reliability of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub.

For reliability principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Reliability in the Well-Architected Framework.

Cost optimization

This section provides guidance to optimize the cost of setting up and operating a Google Cloud topology that you build by using this reference architecture.

Product	Design considerations and recommendations
Vertex AI	To analyze and manage Vertex AI costs, we recommend that you create a baseline of queries per second (QPS) and tokens per second (TPS) and monitor these metrics after deployment. The baseline also helps with capacity planning. For example, the baseline helps you determine when Provisioned Throughput is necessary. Selecting the appropriate model for your generative AI application is a critical decision that directly affects both costs and performance. To identify the model that provides an optimal balance between performance and cost for your specific use case, test models iteratively. We recommend that you start with the most cost-efficient model and progress gradually to more powerful options. The length of your prompts (input) and the generated responses (output) directly affect performance and cost. Write prompts that are short, direct, and provide sufficient context. Design your prompts to get concise responses from the model. For example, include phrases such as "summarize in 2 sentences" or "list 3 key points". For more information, see the best practices for prompt design. To reduce the cost of requests that contain repeated content with high input token counts, use context caching. When relevant, consider batch prediction. Batched requests are billed at a lower price than standard requests.
Spanner Graph	Use the managed autoscaler to dynamically adjust the compute capacity for Spanner Graph databases based on CPU utilization and storage needs. A minimum capacity is often required, even for small workloads. For predictable, stable, or baseline compute capacity, purchase committed use discounts (CUDs). CUDs offer significant discounts in exchange for committing to a certain hourly spend on compute capacity. When you copy backups to different regions for disaster recovery or compliance, consider network egress costs. To help reduce costs, copy only essential backups.
Cloud Run functions	When you create Cloud Run functions, you can specify the amount of memory and CPU to be allocated. To control costs, start with the default (minimum) CPU and memory allocations. To improve performance, you can increase the allocation by configuring the CPU limit and memory limit. For more information, see the following documentation: Configure memory limits for services Configure CPU limits for services If you can predict the CPU and memory requirements, you can save money with CUDs.
Cloud Storage	For the Cloud Storage bucket in the data ingestion subsystem, choose an appropriate storage class based on your workload's requirements for data retention and access frequency. For example, to control storage costs, you can choose the Standard class and use Object Lifecycle Management. This approach enables automatic downgrade of objects to a lower-cost storage class or automatic deletion of objects based on specified conditions.
Cloud Logging	To control the cost of storing logs, you can do the following: Reduce the volume of logs by excluding or filtering unnecessary log entries. For more information, see Exclusion filters. Reduce the log-retention period. For more information, see Configure custom retention.
All of the products in the architecture	After you deploy your workload in Google Cloud, use Active Assist to get recommendations to further optimize the cost of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub.

To estimate the cost of your Google Cloud resources, use the Google Cloud Pricing Calculator.

For cost optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Cost optimization in the Well-Architected Framework.

Performance optimization

This section describes design considerations and recommendations to design a topology in Google Cloud that meets the performance requirements of your workloads.

Product	Design considerations and recommendations
Vertex AI	Selecting the appropriate model for your generative AI application is a critical decision that directly affects both costs and performance. To identify the model that provides an optimal balance between performance and cost for your specific use case, test models iteratively. We recommend that you start with the most cost-efficient model and progress gradually to more powerful options. The length of your prompts (input) and the generated responses (output) directly affect performance and cost. Write prompts that are short, direct, and provide sufficient context. Design your prompts to get concise responses from the model. For example, include phrases such as "summarize in 2 sentences" or "list 3 key points". For more information, see the best practices for prompt design. The Vertex AI prompt optimizer lets you rapidly improve and optimize prompt performance at scale and eliminates the need for manual rewriting. The optimizer helps you efficiently adapt prompts across different models.
Spanner Graph	For recommendations to optimize Spanner Graph performance, see the following documentation: Best practices for designing a Spanner Graph schema Best practices for tuning Spanner Graph queries
Cloud Run functions	By default, each Cloud Run function instance is allocated one CPU and 256 MiB of memory. Depending on your performance requirements, you can configure CPU and memory limits. For more information, see the following documentation: Configure memory limits for services Configure CPU limits for services For more performance optimization guidance, see General Cloud Run development tips.
Cloud Storage	To upload large files, you can use parallel composite uploads. With this strategy, the large file is split into chunks. The chunks are uploaded to Cloud Storage in parallel and then the data is recomposed in the cloud. When network bandwidth and disk speed aren't limiting factors, then parallel composite uploads can be faster than regular upload operations. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.
All of the products in the architecture	After you deploy your workload in Google Cloud, use Active Assist to get recommendations to further optimize the performance of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub.

For performance optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Performance optimization in the Well-Architected Framework.

Deployment

To explore how GraphRAG works in Google Cloud, download and run the following Jupyter notebook from GitHub: GraphRAG on Google Cloud With Spanner Graph and Vertex AI Agent Engine.

What's next

Build GraphRAG applications using Spanner Graph and LangChain.
Choose models and infrastructure for your generative AI applications.
Design infrastructure for RAG-capable generative AI applications:
To learn about architectural principles and recommendations for AI workloads in Google Cloud, review the Well-Architected Framework: AI and ML perspective.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.

Contributors

Authors:

Tristan Li | Principal Architect, AI/ML
Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Ahsif Sheikh | AI Customer Engineer
Ashish Chauhan | AI Customer Engineer
Greg Brosman | Product Manager
Lukas Bruderer | Product Manager, Cloud AI
Nanditha Embar | AI Customer Engineer
Piyush Mathur | Product Manager, Spanner
Smitha Venkat | AI Customer Engineer

GraphRAG infrastructure for generative AI using Vertex AI and Spanner Graph Stay organized with collections Save and categorize content based on your preferences.