Use dedicated public endpoints for online prediction

A dedicated public endpoint is a public endpoint for online prediction. It offers the following benefits:

  • Dedicated networking: When you send a prediction request to a dedicated public endpoint, it is isolated from other users' traffic.
  • Optimized network latency
  • Larger payload support: Up to 10 MB.
  • Longer request timeouts: Configurable up to 1 hour.
  • Generative AI-ready: Streaming and gRPC are supported. Inference timeout is configurable up to 1 hour.

For these reasons, dedicated public endpoints are recommended as a best practice for serving Vertex AI online predictions.

To learn more, see Choose an endpoint type.

Create a dedicated public endpoint and deploy a model to it

You can create a dedicated endpoint and deploy a model to it by using the Google Cloud console. For details, see Deploy a model by using the Google Cloud console.

You can also create a dedicated public endpoint and deploy a model to it by using the Vertex AI API as follows:

  1. Create a dedicated public endpoint. When creating the endpoint, select the Enable dedicated DNS checkbox.
  2. Deploy the model by using the Vertex AI API.

Get online predictions from a dedicated public endpoint

You can send online prediction requests to a dedicated public endpoint by using the Vertex AI SDK for Python. For details, see Send an online prediction request to a dedicated public endpoint.

Tutorial

Limitations

  • VPC Service Controls isn't supported. Use a Private Service Connect endpoint instead.

What's next