A dedicated public endpoint is a public endpoint for online prediction. It offers the following benefits:
- Dedicated networking: When you send a prediction request to a dedicated public endpoint, it is isolated from other users' traffic.
- Optimized network latency
- Larger payload support: Up to 10 MB.
- Longer request timeouts: Configurable up to 1 hour.
- Generative AI-ready: Streaming and gRPC are supported. Inference timeout is configurable up to 1 hour.
For these reasons, dedicated public endpoints are recommended as a best practice for serving Vertex AI online predictions.
To learn more, see Choose an endpoint type.
Create a dedicated public endpoint and deploy a model to it
You can create a dedicated endpoint and deploy a model to it by using the Google Cloud console. For details, see Deploy a model by using the Google Cloud console.
You can also create a dedicated public endpoint and deploy a model to it by using the Vertex AI API as follows:
- Create a dedicated public endpoint. Configuration of the inference timeout and request-response logging settings is supported at the time of endpoint creation.
- Deploy the model by using the Vertex AI API.
Get online predictions from a dedicated public endpoint
Dedicated endpoints support both HTTP and gRPC communication protocols. For gRPC requests, the x-vertex-ai-endpoint-id header must be included for proper endpoint identification. The following APIs are supported:
- Predict
- RawPredict
- StreamRawPredict
- Chat Completion (Model Garden only)
You can send online prediction requests to a dedicated public endpoint by using the Vertex AI SDK for Python. For details, see Send an online prediction request to a dedicated public endpoint.
Tutorial
Limitations
- Deployment of tuned Gemini models isn't supported.
- VPC Service Controls isn't supported. Use a Private Service Connect endpoint instead.
What's next
- Learn about Vertex AI online prediction endpoint types.