This guide provides an overview of using Cloud Run to host apps, run inference, and build AI workflows.
Cloud Run for hosting AI applications, agents, and scalable API endpoints
Cloud Run provides a fully managed platform that scales your AI apps and workloads.
When you host AI apps on Cloud Run, you typically have the following architectural components:
- Serving and orchestration: You deploy your application code or container to Cloud Run.
- AI models: You use Google's AI models, open-source models, or custom models with your app.
- Integrations: You can connect to Google Cloud services or third-party services for memory, databases, storage, security, and more.
- Tools: You can connect to tools for other tasks and operations.
The following diagram shows a high-level overview of using Cloud Run as a hosting platform for AI apps:
As shown in the diagram:
Within the serving and orchestration layer, a Cloud Run service acts as a scalable API endpoint for your application's core logic. It efficiently manages multiple concurrent users through automatic, on-demand, and rapid scaling of instances.
You bring your container to deploy to Cloud Run. You can either package your application and its dependency into a container, or provide your source code and let Cloud Run automatically build your code into a container for deployment. For source code deployments, you can use any language, open frameworks, or SDKs to build your AI apps.
Your AI app acts as a scalable API endpoint that handles incoming requests and sends data to a pre-trained AI model for processing, then returns the results.
Cloud Run is integrated with Google's models, such as the Gemini and Vertex AI models, and can integrate with open-source models, like Llama and Gemma. If you have a custom model that you've trained yourself, you can also use that model with your Cloud Run resource.
Google Cloud offers a wide variety of solutions to support your AI application's infrastructure. Some Google Cloud integrations that work well with your AI app include:
- Memory and databases
- Short-term
- Memorystore is a caching and temporary, high-access data management service that provides a fast and external cache for short-term data storage.
- Long-term
- AlloyDB for PostgreSQL
is a PostgreSQL-compatible database designed for demanding
transactional and analytical workloads. It offers built-in
vector embedding generation and a high-speed vector index,
making it fast for semantic search compared to the standard
pgvectorimplementation. - Cloud SQL is a relational database
database service for MySQL, PostgreSQL, and SQL Server that can
also serve as a vector store with the
pgvectorextension for PostgreSQL. - Firestore is a scalable NoSQL document database service that includes built-in vector search capabilities.
- AlloyDB for PostgreSQL
is a PostgreSQL-compatible database designed for demanding
transactional and analytical workloads. It offers built-in
vector embedding generation and a high-speed vector index,
making it fast for semantic search compared to the standard
- Short-term
- Storage
- Cloud Storage is an object storage solution for holding large datasets for model training, input/output files for your application, or model artifacts.
- Security
- Secret Manager is a secrets and credential management service that provides a secure and centralized way to store sensitive data like API keys, passwords, and credentials, which are often required for AI applications to interact with external services.
To learn more, see Connect to Google Cloud services.
- Memory and databases
Tools let your AI apps and models interact with services, APIs, or websites that run externally or on Cloud Run.
For example, if your AI app is an AI agent, your agent might send a request to an MCP server to execute an external tool, or use tools running in your container, like code execution, computer use, information retrieval, and so forth.
Host models on Cloud Run for AI inference
In addition to building applications and agents that use a large language model (LLM), you can also enable GPUs with Cloud Run to run pre-trained or custom self-deployed models for AI inference.
Cloud Run GPUs make it possible to handle the large number of operations that are needed to run computationally demanding tasks for AI inference workloads. Deploy AI models as container images or from source code and use a variety of methods to deploy your Cloud Run resources.