Use dedicated public endpoints for online prediction

A dedicated public endpoint is a public endpoint for online prediction. It offers the following benefits:

  • Dedicated networking: When you send a prediction request to a dedicated public endpoint, it is isolated from other users' traffic.
  • Optimized network latency
  • Larger payload support: Up to 10 MB.
  • Longer request timeouts: Configurable up to 1 hour.
  • Generative AI-ready: Streaming and gRPC are supported. Inference timeout is configurable up to 1 hour.

For these reasons, dedicated public endpoints are recommended as a best practice for serving Vertex AI online predictions.

To learn more, see Choose an endpoint type.

Create a dedicated public endpoint and deploy a model to it

You can create a dedicated endpoint and deploy a model to it by using the Google Cloud console. For details, see Deploy a model by using the Google Cloud console.

You can also create a dedicated public endpoint and deploy a model to it by using the Vertex AI API as follows:

  1. Create a dedicated public endpoint. Configuration of the inference timeout and request-response logging settings is supported at the time of endpoint creation.
  2. Deploy the model by using the Vertex AI API.

Get online predictions from a dedicated public endpoint

Dedicated endpoints support both HTTP and gRPC communication protocols. For gRPC requests, the x-vertex-ai-endpoint-id header must be included for proper endpoint identification. The following APIs are supported:

  • Predict
  • RawPredict
  • StreamRawPredict
  • Chat Completion (Model Garden only)

You can send online prediction requests to a dedicated public endpoint by using the Vertex AI SDK for Python. For details, see Send an online prediction request to a dedicated public endpoint.

Tutorial

Limitations

  • Deployment of tuned Gemini models isn't supported.
  • VPC Service Controls isn't supported. Use a Private Service Connect endpoint instead.

What's next