Multi-agent AI system in Google Cloud

Last reviewed 2025-09-16 UTC

This document provides a reference architecture to help you design robust multi-agent AI systems in Google Cloud. A multi-agent AI system optimizes complex and dynamic processes by segmenting them into discrete tasks that multiple specialized AI agents collaboratively execute.

The intended audience for this document includes architects, developers, and administrators who build and manage AI infrastructure and applications in the cloud. This document assumes a foundational understanding of AI agents and models. The document doesn't provide specific guidance for designing and coding AI agents.

Architecture

The following diagram shows an architecture for an example of a multi-agent AI system that's deployed in Google Cloud.

Architecture for a multi-agent AI system in Google Cloud. Architecture for a multi-agent AI system in Google Cloud.

Architecture components

The example architecture in the preceding section contains the following components:

Component Description
Frontend Users interact with the multi-agent system through a frontend, such as a chat interface, that runs as a serverless Cloud Run service.
Agents A coordinator agent controls the agentic AI system in this example. The coordinator agent invokes an appropriate subagent to trigger the agentic flow. The agents can communicate with each other by using the Agent2Agent (A2A) protocol, which enables interoperability between agents regardless of their programming language and runtime. The example architecture shows a sequential pattern and an iterative refinement pattern. For more information about the subagents in this example, see the Agentic flow section.
Agents runtime AI agents can be deployed as serverless Cloud Run services, as containerized apps on Google Kubernetes Engine (GKE), or on Vertex AI Agent Engine.
ADK The Agent Development Kit (ADK) provides tools and a framework to develop, test, and deploy agents. The ADK abstracts the complexity of agent creation and lets AI developers focus on the agent's logic and capabilities.
AI model and model runtimes For inference serving, the agents in this example architecture use an AI model on Vertex AI. The architecture shows Cloud Run and GKE as alternative runtimes for the AI model that you choose to use.
Model Armor Model Armor enables inspection and sanitization of inputs and responses for models that are deployed in Vertex AI and GKE. For more information, see Model Armor integration with Google Cloud services.
MCP clients, servers, and tools The Model Context Protocol (MCP) facilitates access to tools by standardizing the interaction between agents and tools. For each agent-tool pair, an MCP client sends requests to an MCP server through which the agent accesses a tool such as a database, a file system, or an API.

Agentic flow

The example multi-agent system in the preceding architecture has the following flow:

  1. A user enters a prompt through a frontend, such as a chat interface, which runs as a serverless Cloud Run service.
  2. The frontend forwards the prompt to a coordinator agent.
  3. The coordinator agent starts one of the following agentic flows based on the intent that's expressed in the prompt.

    • Sequential:
      1. The task-A subagent performs a task.
      2. The task-A subagent invokes the task-A.1 subagent.
    • Iterative refinement:

      1. The task-B subagent performs a task.
      2. The quality evaluator subagent reviews the output of the task-B subagent.
      3. If the output is unsatisfactory, the quality evaluator invokes the prompt enhancer subagent to refine the prompt.
      4. The task-B subagent performs its task again by using the enhanced prompt.

      This cycle continues until the output is satisfactory or the maximum number of iterations is reached.

    The example architecture includes a human-in-the-loop path to let human users intervene in the agentic flow when necessary.

  4. The task-A.1 subagent and quality evaluator subagent independently invoke the response generator subagent.

  5. The response generator subagent generates a response, performs validation and grounding checks, and then it sends the final response to the user through the coordinator agent.

Products and tools used

This reference architecture uses the following Google Cloud and third-party products and tools:

  • Cloud Run: A serverless compute platform that lets you run containers directly on top of Google's scalable infrastructure.
  • Vertex AI: An ML platform that lets you train and deploy ML models and AI applications, and customize LLMs for use in AI-powered applications.
  • Google Kubernetes Engine (GKE): A Kubernetes service that you can use to deploy and operate containerized applications at scale using Google's infrastructure.
  • Model Armor: A service that provides protection for your generative and agentic AI resources against prompt injection, sensitive data leaks, and harmful content.
  • Agent Development Kit (ADK): A set of tools and libraries to develop, test, and deploy AI agents.
  • Agent2Agent (A2A) protocol: An open protocol that enables communication and interoperability between agents regardless of their programming language and runtime.
  • Model Context Protocol (MCP): An open-source standard for connecting AI applications to external systems.

Use cases

Multi-agent AI systems are suitable for complex use cases that require collaboration and coordination across multiple specialized skill sets to achieve a business goal. To identify use cases that multi-agent AI systems are suitable for, analyze your business processes and identify specific tasks that AI can augment. Focus on tangible business outcomes, like cost reduction and accelerated processing. This approach helps align your investments in AI with business value.

The following are examples of use cases for multi-agent AI systems.

Financial advisor

Provide personalized stock trading recommendations and execute trades. The following diagram shows an example of an agentic flow for this use case. This example uses a sequential pattern.

Financial advisor use case for a multi-agent system.

The diagram shows the following flow:

  1. A data retriever agent retrieves real-time and historical stock prices, company financial reports, and other relevant data from reliable sources.
  2. A financial analyzer agent applies appropriate analytics and charting techniques to the data, identifies price movement patterns, and makes predictions.
  3. A stock recommender agent uses the analysis and charts to generate personalized recommendations to buy and sell specific stocks based on the user's risk profile and investment goals.
  4. A trade executor agent buys and sells stocks on behalf of the user.

Research assistant

Create a research plan, gather information, evaluate and refine the research, and then compose a report. The following diagram shows an example of an agentic flow for this use case. The main flow in this example uses a sequential pattern. The example also includes an iterative refinement pattern.

Research assistant use case for a multi-agent system.

The diagram shows the following flow:

  1. A planner agent creates a detailed research plan.
  2. A researcher agent completes the following tasks:

    1. Uses the research plan to identify appropriate internal and external data sources.
    2. Gathers and analyzes the required data.
    3. Prepares a research summary and provides the summary to an evaluator agent.

    The researcher agent repeats these tasks until the evaluator agent approves the research.

  3. A report composer agent creates the final research report.

Supply chain optimizer

Optimize inventory, track shipments, and communicate with supply chain partners. The following diagram shows an example of an agentic flow for this use case. This example uses a sequential pattern.

Supply chain optimizer use case for a multi-agent system.

  1. A warehouse manager agent ensures optimal stock levels by creating re-stock orders based on inventory, demand forecasts, and supplier lead times.

    • The agent interacts with the shipment tracker agent to track deliveries.
    • The agent interacts with the supplier communicator agent to notify suppliers about changes in orders.
  2. A shipment tracker agent ensures timely and efficient fulfillment of orders by integrating with suppliers' logistics platforms and carrier systems.

  3. A supplier communicator agent communicates with external suppliers on behalf of the other agents in the system.

Design considerations

This section describes design factors, best practices, and recommendations to consider when you use this reference architecture to develop a topology that meets your specific requirements for security, reliability, cost, and performance.

The guidance in this section isn't exhaustive. Depending on your workload's requirements and the Google Cloud and third-party products and features that you use, there might be additional design factors and trade-offs that you should consider.

System design

This section provides guidance to help you choose Google Cloud regions for your deployment and to select appropriate Google Cloud products and tools.

Region selection

When you select Google Cloud regions for your AI applications, consider the following factors:

To select appropriate Google Cloud locations for your applications, use the following tools:

  • Google Cloud Region Picker: An interactive web-based tool to select the optimal Google Cloud region for your applications and data based on factors like carbon footprint, cost, and latency.
  • Cloud Location Finder API: A public API that provides a programmatic way to find deployment locations in Google Cloud, Google Distributed Cloud, and other cloud providers.

Agent design

This section provides general recommendations for designing AI agents. Detailed guidance about writing agent code and logic is outside the scope of this document.

Design focus Recommendations
Agent definition and design
  • Clearly define the business goal of the agentic AI system and the task that each agent performs.
  • Use an agent pattern that best meets your requirements.
  • Use the ADK to efficiently create, deploy, and manage your agentic architecture.
Agent interactions
  • Design the human-facing agents in the architecture to support natural language interactions.
  • Ensure that each agent clearly communicates its actions and status to its dependent clients.
  • Design the agents to detect and handle ambiguous queries and nuanced interactions.
Context, tools, and data
  • Ensure that the agents have sufficient context to track multi-turn interactions and session parameters.
  • Clearly describe the purpose, arguments, and usage of the tools that the agents can use.
  • Ensure that the agents' responses are grounded in reliable data sources to reduce hallucinations.
  • Implement logic to handle no-match situations, such as when a prompt is off-topic.

Security

This section describes design considerations and recommendations to design a topology in Google Cloud that meets your workload's security requirements.

Component Design considerations and recommendations
Agents

AI agents introduce certain unique and critical security risks that conventional, deterministic security practices might not be able to mitigate adequately. Google recommends an approach that combines the strengths of deterministic security controls with dynamic, reasoning-based defenses. This approach is grounded in three core principles: human oversight, carefully defined agent autonomy, and observability. The following are specific recommendations that are aligned with these core principles.

Human oversight: An agentic AI system might sometimes fail or not perform as expected. For example, the model might generate inaccurate content or an agent might select inappropriate tools. In business-critical agentic AI systems, incorporate a human-in-the-loop flow to let human supervisors monitor, override, and pause agents in real time. For example, human users can review the output of agents, approve or reject the outputs, and provide further guidance to correct errors or to make strategic decisions. This approach combines the efficiency of agentic AI systems with the critical thinking and domain expertise of human users.

Access control for agents: Configure agent permissions by using Identity and Access Management (IAM) controls. Grant each agent only the permissions that it needs to perform its tasks and to communicate with tools and with other agents. This approach helps to minimize the potential impact of a security breach, because a compromised agent would have limited access to other parts of the system. For more information, see Set up the identity and permissions for your agent and Managing access for deployed agents.

Monitoring: Monitor agent behavior by using comprehensive tracing capabilities that give you visibility into every action that an agent takes, including its reasoning process, tool selection, and execution paths. For more information, see Logging an agent in Vertex AI Agent Engine and Logging in the ADK.

For more information about securing AI agents, see Safety and Security for AI Agents.

Vertex AI

Shared responsibility: Security is a shared responsibility. Vertex AI secures the underlying infrastructure and provides tools and security controls to help you protect your data, code, and models. You are responsible for properly configuring your services, managing access controls, and securing your applications. For more information, see Vertex AI shared responsibility.

Security controls: Vertex AI supports Google Cloud security controls that you can use to meet your requirements for data residency, customer-managed encryption keys (CMEK), network security using VPC Service Controls, and Access Transparency. For more information, see the following documentation:

Safety: AI models might produce harmful responses, sometimes in response to malicious prompts.

  • To enhance safety and mitigate potential misuse of the agentic AI system, you can configure content filters to act as barriers to harmful inputs and responses. For more information, see Safety and content filters.
  • To inspect and sanitize inference requests and responses for threats like prompt injection and harmful content, you can use Model Armor. Model Armor helps you prevent malicious input, verify content safety, protect sensitive data, maintain compliance, and enforce safety and security policies consistently.

Model access: You can set up organization policies to limit the type and versions of AI models that can be used in a Google Cloud project. For more information, see Control access to Model Garden models.

Data protection: To discover and de-identify sensitive data in the prompts and responses and in log data, use the Cloud Data Loss Prevention API. For more information, see this video: Protecting sensitive data in AI apps.

MCP See MCP and Security.
A2A

Transport security: The A2A protocol mandates HTTPS for all A2A communication in production environments and it recommends Transport Layer Security (TLS) versions 1.2 or higher.

Authentication: The A2A protocol delegates authentication to standard web mechanisms like HTTP headers and to standards like OAuth2 and OpenID Connect. Each agent advertises the authentication requirements in its Agent Card. For more information, see A2A authentication.

Cloud Run

Ingress security (for the frontend service): To control access to the application, disable the default run.app URL of the frontend Cloud Run service and set up a regional external Application Load Balancer. In addition to load-balancing incoming traffic to the application, the load balancer handles SSL certificate management. For added protection, you can use Google Cloud Armor security policies to provide request filtering, DDoS protection, and rate limiting for the service.

User authentication: To authenticate user access to the frontend Cloud Run service, use Identity-Aware Proxy (IAP). When a user tries to access an IAP-secured resource, IAP performs authentication and authorization checks. For more information, see Enabling IAP for Cloud Run.

Container image security: To ensure that only authorized container images are deployed to Cloud Run, you can use Binary Authorization. To identify and mitigate security risks in the container images, use Artifact Analysis to automatically run vulnerability scans. For more information, see Container scanning overview.

Data residency: Cloud Run helps you meet data residency requirements. Your Cloud Run functions run within the selected region.

For more guidance about container security, see General Cloud Run development tips.

All of the products in the architecture

Data encryption: By default, Google Cloud encrypts data at rest by using Google-owned and Google-managed encryption keys. To protect your agents' data by using encryption keys that you control, you can use CMEKs that you create and manage in Cloud KMS. For information about Google Cloud services that are compatible with Cloud KMS, see Compatible services.

Mitigate data exfiltration risk: To reduce the risk of data exfiltration, create a VPC Service Controls perimeter around the infrastructure. VPC Service Controls supports all of the Google Cloud services that this reference architecture uses.

Access control: When you configure permissions for the resources in your topology, follow the principle of least privilege.

Post-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize security by using Active Assist Recommendation Hub. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub.

Cloud environment security: Use the tools in Security Command Center to detect vulnerabilities, identify and mitigate threats, define and deploy a security posture, and export data for further analysis.

More security recommendations

Reliability

This section describes design considerations and recommendations to build and operate reliable infrastructure for your deployment in Google Cloud.

Component Design considerations and recommendations
Agents

Fault tolerance: Design the agentic system to tolerate or handle agent-level failures. Where feasible, use a decentralized approach where agents can operate independently.

Simulate failures: Before deploying the agentic AI system to production, validate it by simulating a production environment. Identify and fix inter-agent coordination issues and unexpected behaviors.

Error handling: To enable diagnosis and troubleshooting of errors, implement logging, exception handling, and retry mechanisms.

Vertex AI

Quota management: Vertex AI supports dynamic shared quota (DSQ) for Gemini models. DSQ helps to flexibly manage pay-as-you-go requests, and it eliminates the need to manage quota manually or to request quota increases. DSQ dynamically allocates the available resources for a given model and region across active customers. With DSQ, there are no predefined quota limits on individual customers.

Capacity planning: If the number of requests to the model exceeds the allocated capacity, then error code 429 is returned. For workloads that are business critical and that require consistently high throughput, you can reserve throughput by using Provisioned Throughput.

Model endpoint availability: If data can be shared across multiple regions or countries, you can use a global endpoint for the model.

Cloud Run Robustness to infrastructure outages: Cloud Run is a regional service. It stores data synchronously across multiple zones within a region and it automatically load-balances traffic across the zones. If a zone outage occurs, Cloud Run continues to run and data isn't lost. If a region outage occurs, the service stops running until Google resolves the outage.
All of the products in the architecture Post-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize reliability by using Active Assist Recommendation Hub. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub.

For reliability principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Reliability in the Well-Architected Framework.

Operations

This section describes the factors to consider when you use this reference architecture to design a Google Cloud topology that you can operate efficiently.

Component Design considerations and recommendations
Vertex AI

Monitoring using logs: By default, agent logs that are written to the stdout and stderr streams are routed to Cloud Logging. For advanced logging, you can integrate the Python logger with Cloud Logging. If you need full control over logging and structured logs, use the Cloud Logging client. For more information, see Logging an agent and Logging in the ADK.

Continuous evaluation: Regularly perform a qualitative evaluation of the output of the agents and the trajectory or steps taken by the agents to produce the output. To implement agent evaluation, you can use the Gen AI evaluation service or the evaluation methods that the ADK supports.

MCP

Database tools: To efficiently manage database tools for your AI agents and to ensure that the agents securely handle complexities like connection pooling and authentication, use the MCP Toolbox for Databases. It provides a centralized location to store and update database tools. You can share the tools across agents and update the tools without redeploying agents. The toolbox includes a wide range of tools for Google Cloud databases like AlloyDB for PostgreSQL and for third-party databases like MongoDB.

Generative AI models: To enable AI agents to use Google generative AI models like Imagen and Veo, you can use MCP Servers for Google Cloud generative media APIs.

Google security products and tools: To enable your AI agents to access Google security products and tools like Google Security Operations, Google Threat Intelligence, and Security Command Center, use MCP servers for Google security products.

All of the Google Cloud products in the architecture Tracing: Continuously gather and analyze trace data by using Cloud Trace. Trace data lets you rapidly identify and diagnose errors within complex agent workflows. You can perform in-depth analysis through visualizations in the Trace Explorer tool. For more information, see Trace an agent.

For operational excellence principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Operational excellence in the Well-Architected Framework.

Cost optimization

This section provides guidance to optimize the cost of setting up and operating a Google Cloud topology that you build by using this reference architecture.

Component Design considerations and recommendations>
Vertex AI

Cost analysis and management: To analyze and manage Vertex AI costs, we recommend that you create baseline metrics for queries per second (QPS) and tokens per second (TPS). Then, monitor these metrics after deployment. The baseline also helps with capacity planning. For example, the baseline helps you determine when Provisioned Throughput might be necessary.

Model selection: The model that you select for your AI application directly affects both costs and performance. To identify the model that provides an optimal balance between performance and cost for your specific use case, test models iteratively. We recommend that you start with the most cost-efficient model and progress gradually to more powerful options.

Cost-effective prompting: The length of your prompts (input) and the generated responses (output) directly affect performance and cost. Write prompts that are short, direct, and provide sufficient context. Design your prompts to get concise responses from the model. For example, include phrases such as "summarize in 2 sentences" or "list 3 key points". For more information, see the best practices for prompt design.

Context caching: To reduce the cost of requests that contain repeated content with high input token counts, use context caching.

Batch requests: When relevant, consider batch prediction. Batched requests incur a lower cost than standard requests.

Cloud Run

Resource allocation: When you create a Cloud Run service, you can specify the amount of memory and CPU to be allocated. Start with the default CPU and memory allocations. Observe the resource usage and cost over time, and adjust the allocation as necessary. For more information, see the following documentation:

Rate optimization: If you can predict the CPU and memory requirements, you can save money with committed use discounts (CUDs).

All of the products in the architecture Post-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize cost by using Active Assist Recommendation Hub. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub.

To estimate the cost of your Google Cloud resources, use the Google Cloud Pricing Calculator.

For cost optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Cost optimization in the Well-Architected Framework.

Performance optimization

This section describes design considerations and recommendations to design a topology in Google Cloud that meets the performance requirements of your workloads.

Component Design considerations and recommendations
Agents

Model selection: When you select models for your agentic AI system, consider the capabilities that are required for the tasks that the agents need to perform.

Prompt optimization: To rapidly improve and optimize prompt performance at scale and to eliminate the need for manual rewriting, use the Vertex AI prompt optimizer. The optimizer helps you efficiently adapt prompts across different models.

Vertex AI

Model selection: The model that you select for your AI application directly affects both costs and performance. To identify the model that provides an optimal balance between performance and cost for your specific use case, test models iteratively. We recommend that you start with the most cost-efficient model and progress gradually to more powerful options.

Prompt engineering: The length of your prompts (input) and the generated responses (output) directly affect performance and cost. Write prompts that are short, direct, and provide sufficient context. Design your prompts to get concise responses from the model. For example, include phrases such as "summarize in 2 sentences" or "list 3 key points". For more information, see the best practices for prompt design.

Context caching: To reduce latency for requests that contain repeated content with high input token counts, use context caching.

Cloud Run

Resource allocation: Depending on your performance requirements, configure the memory and CPU to be allocated to the Cloud Run service. For more information, see the following documentation:

For more performance optimization guidance, see General Cloud Run development tips.

All of the products in the architecture Post-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize performance by using Active Assist Recommendation Hub. Review the recommendations and apply them as appropriate for your environment. For more information, see Find recommendations in Recommendation Hub..

For performance optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Performance optimization in the Well-Architected Framework.

Deployment

To learn how to build and deploy multi-agent AI systems, use the following code samples. These code samples are fully functional starting points for learning and experimentation. For optimal operation in production environments, you must customize the code based on your specific business and technical requirements.

  • Financial advisor: Analyze stock market data, create trading strategies, define execution plans, and evaluate risks.
  • Research assistant: Plan and conduct research, evaluate the findings, and compose a research report.
  • Insurance agent: Create memberships, provide roadside assistance, and handle insurance claims.
  • Search optimizer: Find search keywords, analyze web pages, and provide suggestions to optimize search.
  • Data analyzer: Retrieve data, perform complex manipulations, generate visualizations, and run ML tasks.
  • Web-marketing agent: Choose a domain name, design a website, create campaigns, and produce content.
  • Airbnb planner (with A2A and MCP): For a given location and time, find Airbnb listings and get weather information.

For code samples to get started with using the ADK together with MCP servers, see MCP Tools.

What's next

Contributors

Author: Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors: