An inference is the output of a trained machine learning model. This page provides an overview of the workflow for getting inferences from your models on Vertex AI.
Vertex AI offers two methods for getting inferences:
-
Online inferences are synchronous requests made to
a model that is deployed to an
Endpoint
. Therefore, before sending a request, you must first deploy theModel
resource to an endpoint. This associates compute resources with the model so that the model can serve online inferences with low latency. Use online inferences when you are making requests in response to application input or in situations that require timely inference. -
Batch
inferences are asynchronous requests made to a model
that isn't deployed to an endpoint. You send the request (as a
BatchPredictionJob
resource) directly to theModel
resource. Use batch inferences when you don't require an immediate response and want to process accumulated data by using a single request.
Test your model locally
Before getting inferences, it's useful to deploy your model to a local endpoint during the development and testing phase. This lets you both iterate more quickly and test your model without deploying it to an online endpoint or incurring inference costs. Local deployment is intended for local development and testing, not for production deployments.
To deploy a model locally, use the Vertex AI SDK for Python and deploy a
LocalModel
to a
LocalEndpoint
.
For a demonstration, see this
notebook.
Even if your client is not written in Python, you can still use the Vertex AI SDK for Python to launch the container and server so that you can test requests from your client.
Get inferences from custom trained models
To get inferences, you must first import your
model. After it's imported, it becomes a
Model
resource that is visible in
Vertex AI Model Registry.
Then, read the following documentation to learn how to get inferences:
What's next
- Learn about Compute resources for prediction.