Stay organized with collections
Save and categorize content based on your preferences.
An inference is the output of a trained machine learning model. This page
provides an overview of the workflow for getting inferences from your models on
Vertex AI.
Vertex AI offers two methods for getting inferences:
Online inferences are synchronous requests made to
a model that is deployed to an
Endpoint.
Therefore, before sending a request, you must first deploy the
Model
resource to an endpoint. This associates
compute resources with
the model so that the model can serve online inferences with
low latency. Use online inferences when you are making requests in
response to application input or in situations that require timely
inference.
Batch
inferences are asynchronous requests made to a model
that isn't deployed to an endpoint. You send the request (as a
BatchPredictionJob
resource) directly to the Model resource. Use
batch inferences when you don't require an immediate response and
want to process accumulated data by using a single request.
Test your model locally
Before getting inferences, it's useful to deploy your model to a local
endpoint during the development and testing phase. This lets you both iterate
more quickly and test your model without deploying it to an online endpoint or
incurring inference costs. Local deployment is intended for local development
and testing, not for production deployments.
Even if your client is not written in Python, you can still use the
Vertex AI SDK for Python to launch the container and server so that you can test
requests from your client.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[],[],null,["# Get inferences from a custom trained model\n\nAn inference is the output of a trained machine learning model. This page\nprovides an overview of the workflow for getting inferences from your models on\nVertex AI.\n\nVertex AI offers two methods for getting inferences:\n\n- **Online inferences** are synchronous requests made to a model that is deployed to an [`Endpoint`](/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints). Therefore, before sending a request, you must first deploy the [`Model`](/vertex-ai/docs/reference/rest/v1/projects.locations.models) resource to an endpoint. This associates [compute resources](/vertex-ai/docs/predictions/configure-compute) with the model so that the model can serve online inferences with low latency. Use online inferences when you are making requests in response to application input or in situations that require timely inference.\n- are asynchronous requests made to a model that isn't deployed to an endpoint. You send the request (as a [`BatchPredictionJob`](/vertex-ai/docs/reference/rest/v1/projects.locations.batchPredictionJobs) resource) directly to the `Model` resource. Use batch inferences when you don't require an immediate response and want to process accumulated data by using a single request.\n\nTest your model locally\n-----------------------\n\nBefore getting inferences, it's useful to deploy your model to a local\nendpoint during the development and testing phase. This lets you both iterate\nmore quickly and test your model without deploying it to an online endpoint or\nincurring inference costs. Local deployment is intended for local development\nand testing, not for production deployments.\n\nTo deploy a model locally, use the Vertex AI SDK for Python and deploy a\n[`LocalModel`](/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.prediction.LocalModel)\nto a\n[`LocalEndpoint`](/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.prediction.LocalEndpoint).\nFor a demonstration, see [this\nnotebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/vertex_endpoints/find_ideal_machine_type/find_ideal_machine_type.ipynb).\n\nEven if your client is not written in Python, you can still use the\nVertex AI SDK for Python to launch the container and server so that you can test\nrequests from your client.\n\nGet inferences from custom trained models\n-----------------------------------------\n\nTo get inferences, you must first [import your\nmodel](/vertex-ai/docs/model-registry/import-model). After it's imported, it becomes a\n[`Model`](/vertex-ai/docs/reference/rest/v1/projects.locations.models) resource that is visible in\n[Vertex AI Model Registry](/vertex-ai/docs/model-registry/introduction).\n\nThen, read the following documentation to learn how to get inferences:\n\n- [Get batch inferences](/vertex-ai/docs/predictions/get-batch-predictions)\n\n Or\n- [Deploy model to endpoint](/vertex-ai/docs/general/deployment) and\n [get online inferences](/vertex-ai/docs/predictions/get-online-predictions).\n\nWhat's next\n-----------\n\n- Learn about [Compute resources for\n prediction](/vertex-ai/docs/predictions/configure-compute)."]]