Use dedicated public endpoints for online inference

A dedicated public endpoint is a public endpoint for online inference. It offers the following benefits:

Dedicated networking: When you send an inference request to a dedicated public endpoint, it is isolated from other users' traffic.
Optimized network latency
Larger payload support: Up to 10 MB.
Longer request timeouts: Configurable up to 1 hour.
Generative AI-ready: Streaming and gRPC are supported. Inference timeout is configurable up to 1 hour.

For these reasons, dedicated public endpoints are recommended as a best practice for serving Vertex AI online inferences.

To learn more, see Choose an endpoint type.

Create a dedicated public endpoint and deploy a model to it

You can create a dedicated endpoint and deploy a model to it by using the Google Cloud console. For details, see Deploy a model by using the Google Cloud console.

You can also create a dedicated public endpoint and deploy a model to it by using the Vertex AI API as follows:

Create a dedicated public endpoint. Configuration of the inference timeout and request-response logging settings is supported at the time of endpoint creation.
Deploy the model by using the Vertex AI API.

Get online inferences from a dedicated public endpoint

Dedicated endpoints support both HTTP and gRPC communication protocols. For gRPC requests, the x-vertex-ai-endpoint-id header must be included for proper endpoint identification. The following APIs are supported:

Predict
RawPredict
StreamRawPredict
Chat Completion (Model Garden only)

You can send online inference requests to a dedicated public endpoint by using the Vertex AI SDK for Python. For details, see Send an online inference request to a dedicated public endpoint.

Tutorial

Limitations

Deployment of tuned Gemini models isn't supported.
VPC Service Controls isn't supported. Use a Private Service Connect endpoint instead.

What's next

Learn about Vertex AI online inference endpoint types.