To use a custom container to serve predictions from a custom-trained model, you must provide Vertex AI with a Docker container image that runs an HTTP server. This document describes the requirements that a container image must meet to be compatible with Vertex AI. The document also describes how Vertex AI interacts with your custom container once it starts running. In other words, this document describes what you need to consider when designing a container image to use with Vertex AI.
To walk through using a custom container image to serve predictions, read Using a custom container.
Container image requirements
When your Docker container image runs as a container, the container must run an HTTP server. Specifically, the container must listen and respond to liveness checks, health checks, and prediction requests. The following subsections describe these requirements in detail.
You can implement the HTTP server in any way, using any programming language, as long as it meets the requirements in this section. For example, you can write a custom HTTP server using a web framework like Flask or use machine learning (ML) serving software that runs an HTTP server, like TensorFlow Serving, TorchServe, or KServe Python Server.
Run the HTTP server
You can run an HTTP server by using an ENTRYPOINT
instruction, a CMD
instruction,
or both in the Dockerfile that you use to build your container image. Read about
the interaction between CMD
and
ENTRYPOINT
.
Alternatively, you can specify the containerSpec.command
and containerSpec.args
fields when you create your Model
resource in order to override your container
image's ENTRYPOINT
and CMD
respectively. Specifying one of these fields lets
you use a container image that would otherwise not meet the requirements due to
an incompatible (or nonexistent) ENTRYPOINT
or CMD
.
However you determine which command your container runs when it starts, ensure
that this ENTRYPOINT
instruction runs indefinitely. For example, don't run a command
that starts an HTTP server in the background and then exits; if you do, the
container will exit immediately after it starts running.
Your HTTP server must listen for requests on 0.0.0.0
, on a port of your
choice. When you create a Model
, specify this port in the
containerSpec.ports
field.
To learn how the container can access this value, read the section of this
document about the AIP_HTTP_PORT
environment variable.
Liveness checks
Vertex AI performs a liveness check when your container starts to
ensure that your server is running. When you deploy a custom-trained model to
an Endpoint
resource,
Vertex AI uses a TCP liveness
probe
to attempt to establish a TCP connection to your container on
the configured port. The probe makes up to 4 attempts to establish a connection,
waiting 10 seconds after each failure. If the probe still hasn't established a
connection at this point, Vertex AI restarts your container.
Your HTTP server doesn't need to perform any special behavior to handle these checks. As long as it is listening for requests on the configured port, the liveness probe is able to make a connection.
Health checks
You can optionally specify startup_probe
or health_probe
.
The startup probe checks whether the container application has started. If startup probe isn't provided, there is no startup probe and health checks begin immediately. If startup probe is provided, health checks aren't performed until startup probe succeeds.
Legacy applications that might require additional startup time on their first initialization should configure a startup probe. For example, if the application needs to copy the model artifacts from an external source, a startup probe should be configured to return success when that initialization is completed.
The health probe checks whether a container is ready to accept traffic. If health probe isn't provided, Vertex AI uses the default health checks as described in Default health checks.
Legacy applications that don't return 200 OK
to indicate the model is loaded
and ready to accept traffic should configure a health probe. For example, an
application might return 200 OK
to indicate success even though the actual
model load status that's in the response body indicates that the model might
not be loaded and might therefore not be ready to accept traffic. In this case, a
health probe should be configured to return success only when the model is
loaded and ready to serve traffic.
To perform a probe, Vertex AI executes the specified exec
command
in the target container. If the command succeeds, it returns 0, and the
container is considered to be alive and healthy.
Default health checks
By default, Vertex AI intermittently performs health checks on your HTTP server
while it is running to ensure that it is ready to handle prediction requests.
The service uses a health probe to send HTTP GET
requests to a configurable
health check path on your server. Specify this path in the
containerSpec.healthRoute
field
when you create a Model
. To learn how the container can access this value,
read the section of this document about the AIP_HEALTH_ROUTE
environment
variable.
Configure the HTTP server to respond to each health check request as follows:
If the server is ready to handle prediction requests, respond to the health check request within 10 seconds with status code
200 OK
. The contents of the response body don't matter; Vertex AI ignores them.This response signifies that the server is healthy.
If the server isn't ready to handle prediction requests, don't respond to the health check request within 10 seconds, or respond with any status code except for
200 OK
. For example, respond with status code503 Service Unavailable
.This response (or lack of a response) signifies that the server is unhealthy.
If the health probe receives an unhealthy response from your server
(including no response within 10 seconds), it sends up to 3 additional
health checks at 10 second intervals. During this
period, Vertex AI still considers your server healthy. If the probe
receives a healthy response to any of these checks, the probe immediately
returns to its intermittent schedule of health checks. However, if the probe
receives 4 consecutive unhealthy responses, Vertex AI stops
routing prediction traffic to the container. (If the DeployedModel
resource is
scaled to use multiple prediction nodes, Vertex AI routes
prediction requests to other, healthy containers.)
Vertex AI doesn't restart the container; instead the health probe continues sending intermittent health check requests to the unhealthy server. If it receives a healthy response, it marks that container as healthy and starts to route prediction traffic to it again.
Practical guidance
In some cases, it is sufficient for the HTTP server in your container to always
respond with status code 200 OK
to health checks. If your container loads
resources before starting the server, the container is unhealthy during the
startup period and during any periods when the HTTP server fails. At all other
times, it responds as healthy.
For a more sophisticated configuration, you might want to purposefully design the HTTP server to respond to health checks with an unhealthy status at certain times. For example, you might want to block prediction traffic to a node for a period so that the container can perform maintenance.
Prediction requests
When a client sends a projects.locations.endpoints.predict
request to the
Vertex AI API, Vertex AI forwards this request as an HTTP
POST
request to a configurable prediction path on your server. Specify this
path in the containerSpec.predictRoute
field
when you create a Model
. To learn how the container can access this
value, read the section of this document about the AIP_PREDICT_ROUTE
environment variable.
Request requirements
If the model is deployed to a public endpoint, each prediction request must be
1.5 MB or smaller. The HTTP server must accept prediction requests that have
the Content-Type: application/json
HTTP header and JSON bodies with the
following format:
{
"instances": INSTANCES,
"parameters": PARAMETERS
}
In these requests:
INSTANCES is an array of one or more JSON values of any type. Each values represents an instance that you are providing a prediction for.
PARAMETERS is a JSON object containing any parameters that your container requires to help serve predictions on the instances. Vertex AI considers the
parameters
field optional, so you can design your container to require it, only use it when provided, or ignore it.
Learn more about the request body requirements.
Response requirements
If the model is deployed to a public endpoint, each prediction response must be 1.5 MB or smaller. The HTTP server must send responses with JSON bodies that meet the following format:
{
"predictions": PREDICTIONS
}
In these responses, replace PREDICTIONS with an array of JSON values representing the predictions that your container has generated for each of the INSTANCES in the corresponding request.
After your HTTP server sends this response, Vertex AI adds a
deployedModelId
field
to the response before returning it to the client. This field specifies which
DeployedModel
on an
Endpoint
is sending the response. Learn more about the response body
format.
Container image publishing requirements
You must push your container image to Artifact Registry in order to use it with Vertex AI. Learn how to push a container image to Artifact Registry.
In particular, you must push the container image to a repository that meets the following location and permissions requirements.
Location
When you use Artifact Registry, the repository must use a
region that matches the
regional endpoint where you plan to create a Model
.
For example, if you plan to create a Model
on the
us-central1-aiplatform.googleapis.com
endpoint, the full name of your
container image must start with us-central1-docker.pkg.dev/
. Don't use a
multi-regional repository for your container image.
Permissions
Vertex AI must have permission to pull the container image when you
create a Model
. Specifically, the Vertex AI Service Agent for your project
must have the permissions of the Artifact Registry Reader role
(roles/artifactregistry.reader
)
for the container image's repository.
Vertex AI uses the Vertex AI Service Agent for your project to interact with
other Google Cloud services. This service account has the email address
service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com
, where PROJECT_NUMBER is replaced with the project
number
of your Vertex AI project.
If you have pushed your container image to the same Google Cloud project where you are using Vertex AI, you don't have to configure any permissions. The default permissions granted to the Vertex AI Service Agent are sufficient.
On the other hand, if you have pushed your container image to a different Google Cloud project from the one where you are using Vertex AI, you must grant the Artifact Registry Reader role for the Artifact Registry repository to the Vertex AI Service Agent.
Access model artifacts
When you create a custom-trained Model
without a custom container, you must
specify the URI of a Cloud Storage directory with model
artifacts as the artifactUri
field. When
you create a Model
with a custom container, providing model artifacts in
Cloud Storage is optional.
If the container image includes the model artifacts that you need to serve
predictions, there is no need to load files from Cloud Storage.
However, if you do provide model artifacts by specifying the artifactUri
field, the container must load these artifacts when it starts running.
When Vertex AI starts your container, it sets the AIP_STORAGE_URI
environment variable to a Cloud Storage URI that begins with gs://
.
Your container's ENTRYPOINT
instruction can download the directory specified by this
URI in order to access the model artifacts.
Note that the value of the AIP_STORAGE_URI
environment variable isn't
identical to the Cloud Storage URI that you specify in the
artifactUri
field when you create the Model
. Rather,
AIP_STORAGE_URI
points to a copy of your model artifact directory in a
different Cloud Storage bucket, which Vertex AI manages.
Vertex AI populates this directory when you create a Model
.
You can't update the contents of the directory. If you want to use new model
artifacts, you must create a new Model
.
The service account that your container uses by default has permission to read from this URI.
On the other hand, if you specify a custom service
account when you deploy the Model
to an
Endpoint
, Vertex AI automatically grants your specified service
account the Storage Object Viewer (roles/storage.objectViewer
)
role for the URI's
Cloud Storage bucket.
Use any library that supports Application Default Credentials (ADC) to load the model artifacts; you don't need to explicitly configure authentication.
Environment variables available in the container
When running, the container's ENTRYPOINT
instruction can reference environment
variables that you have configured manually, as well as environment variables
set automatically by Vertex AI. This section describes each way
that you can set environment variables, and it provides details about the
variables set automatically by Vertex AI.
Variables set in the container image
To set environment variables in the container image when you build it, use
Docker's ENV
instruction.
Don't set any environment variables that begin with the prefix AIP_
.
The container's ENTRYPOINT
instruction can use these environment variables, but you
can't reference them in any of your Model
's API
fields.
Variables set by Vertex AI
When Vertex AI starts running the container, it sets the following
environment variables in the container environment. Each variable begins with
the prefix AIP_
. Don't manually set any environment variables that use this
prefix.
The container's ENTRYPOINT
instruction can access these variables. To learn which
Vertex AI API fields can also reference these variables, read the
API reference for
ModelContainerSpec
.
Variable name | Default value | How to configure value | Details |
---|---|---|---|
AIP_ACCELERATOR_TYPE | Unset | When you deploy a Model as a DeployedModel
to an Endpoint resource, set the dedicatedResources.machineSpec.acceleratorType
field. |
If applicable, this variable specifies the type of accelerator used by the virtual machine (VM) instance that the container is running on. |
AIP_DEPLOYED_MODEL_ID | A string of digits identifying the DeployedModel to which
this container's Model has been deployed. |
Not configurable | This value is the DeployedModel 's id
field. |
AIP_ENDPOINT_ID | A string of digits identifying the Endpoint on which the
container's Model has been deployed. |
Not configurable | This value is the last segment of the Endpoint 's name
field (following endpoints/ ). |
AIP_FRAMEWORK | CUSTOM_CONTAINER |
Not configurable | |
AIP_HEALTH_ROUTE | /v1/endpoints/ENDPOINT/deployedModels/DEPLOYED_MODEL In this string, replace ENDPOINT with the value of the AIP_ENDPOINT_ID variable and replace
DEPLOYED_MODEL with the value of the
AIP_DEPLOYED_MODEL_ID variable. |
When you create a Model , set the containerSpec.healthRoute
field. |
This variables specifies the HTTP path on the container that Vertex AI sends health checks to. |
AIP_HTTP_PORT | 8080 |
When you create a Model , set the containerSpec.ports
field. The first entry in this field becomes the value of
AIP_HTTP_PORT . |
Vertex AI sends liveness checks, health checks, and prediction requests to this port on the container. Your container's HTTP server must listen for requests on this port. |
AIP_MACHINE_TYPE | No default, must be configured | When you deploy a Model as a DeployedModel
to an Endpoint resource, set the dedicatedResources.machineSpec.machineType
field. |
This variable specifies the type of VM that the container is running on. |
AIP_MODE | PREDICTION |
Not configurable | This variable signifies that the container is running on Vertex AI to serve online predictions. You can use this environment variable to add custom logic to your container, so that it can run in multiple computing environments but only use certain code paths on when run on Vertex AI. |
AIP_MODE_VERSION | 1.0.0 |
Not configurable | This variable signifies the version of the custom container requirements (this document) that Vertex AI expects the container to meet. This document updates according to semantic versioning. |
AIP_MODEL_NAME | The value of the AIP_ENDPOINT_ID variable |
Not configurable | See the AIP_ENDPOINT_ID row. This variable exists for
compatibility reasons. |
AIP_PREDICT_ROUTE | /v1/endpoints/ENDPOINT/deployedModels/DEPLOYED_MODEL:predict In this string, replace ENDPOINT with the value of the AIP_ENDPOINT_ID variable and replace
DEPLOYED_MODEL with the value of the
AIP_DEPLOYED_MODEL_ID variable. |
When you create a Model , set the containerSpec.predictRoute
field. |
This variable specifies the HTTP path on the container that Vertex AI forwards prediction requests to. |
AIP_PROJECT_NUMBER | The project number of the Google Cloud project where you are using Vertex AI | Not configurable | |
AIP_STORAGE_URI |
|
Not configurable | This variable specifies the directory that contains a copy of your model artifacts, if applicable. |
AIP_VERSION_NAME | The value of the AIP_DEPLOYED_MODEL_ID variable |
Not configurable | See the AIP_DEPLOYED_MODEL_ID row. This variable exists
for compatibility reasons. |
Variables set in the Model
resource
When you create a Model
, you can set additional environment variables in
the containerSpec.env
field.
The container's ENTRYPOINT
instruction can access these variables. To learn which
Vertex AI API fields can also reference these variables, read the
API reference for
ModelContainerSpec
.
What's next
- Learn more about serving predictions using a custom container, including how to specify container-related API fields when you import a model.