To deploy a model for online prediction, you need an endpoint. Endpoints can be divided into the following types:
Public endpoints can be accessed over the public internet. They are easier to use, because no private network infrastructure is required. There are two types of public endpoints: dedicated and shared. A dedicated public endpoint is a faster endpoint providing production isolation, support for larger payload sizes, and longer request timeouts than a shared public endpoint. Also, when you send a prediction request to a dedicated public endpoint, it is isolated from other users' traffic. For these reasons, dedicated public endpoints are recommended as a best practice.
Private Service Connect endpoints provide a secure connection for private communication between on-premises and Google Cloud. They can be used to control Google API traffic through the use of Private Service Connect APIs. They are recommended as a best practice.
Private endpoints also provide a secure connection to your model and can also be used for private communication between on-premises and Google Cloud. They use private services access over a VPC Network Peering connection.
For more information about deploying a model to an endpoint, see Deploy a model to an endpoint.
The following table compares the supported endpoint types for serving Vertex AI online predictions.
Dedicated public endpoint (recommended) | Shared public endpoint | Private Service Connect endpoint (recommended) | Private endpoint | |
---|---|---|---|---|
Purpose | Default networking experience. Enables submitting requests from public internet (if VPC Service Controls isn't enabled). | Default networking experience. Enables submitting requests from public internet (if VPC Service Controls isn't enabled). | Recommended for production enterprise applications. Improves network latency and security by ensuring requests and responses are routed privately. | Recommended for production enterprise applications. Improves network latency and security by ensuring requests and responses are routed privately. |
Inbound networking | Public internet using dedicated networking plane | Public internet using shared networking plane | Private networking using Private Service Connect endpoint | Private networking using Private services access (VPC Network Peering) |
Outbound networking | Public internet | Public internet | Not supported | Private networking using Private services access (VPC Network Peering) |
VPC Service Controls | Not supported. Use a Private Service Connect endpoint instead. | Supported | Supported | Supported |
Cost | Vertex AI Prediction | Vertex AI Prediction | Vertex AI Prediction + Private Service Connect endpoint | Vertex AI Prediction + Private services access (see: "Using a Private Service Connect endpoint (forwarding rule) to access a published service") |
Network latency | Optimized | Unoptimized | Optimized | Optimized |
Encryption in transit | TLS with CA-signed certificate | TLS with CA-signed certificate | Optional TLS with self-signed certificate | None |
Inference timeout | Configurable up to 1 hour | 60 seconds | Configurable up to 1 hour | 60 seconds |
Payload size limit | 10 MB | 1.5 MB | 10 MB | 10 MB |
QPM quota | Unlimited | 30,000 | Unlimited | Unlimited |
Protocol support | HTTP or gRPC | HTTP | HTTP or gRPC | HTTP |
Streaming support | Yes (SSE) | No | Yes (SSE) | No |
Traffic split | Yes | Yes | Yes | No |
Request and response logging | Yes | Yes | Yes | No |
Access logging | Yes | Yes | Yes | No |
Tuned Gemini model deployment | No | Yes | No | No |
AutoML models and explainability | No | Yes | No | No |
Client libraries supported | Vertex AI SDK for Python | Vertex AI client libraries, Vertex AI SDK for Python | Vertex AI SDK for Python | Vertex AI SDK for Python |
What's next
- Learn more about deploying a model to an endpoint.