Custom container requirements for inference

To use a custom container to serve inferences from a custom-trained model, you must provide Vertex AI with a Docker container image that runs an HTTP server. This document describes the requirements that a container image must meet to be compatible with Vertex AI. The document also describes how Vertex AI interacts with your custom container once it starts running. In other words, this document describes what you need to consider when designing a container image to use with Vertex AI.

To walk through using a custom container image to serve inferences, read Using a custom container.

Container image requirements

When your Docker container image runs as a container, the container must run an HTTP server. Specifically, the container must listen and respond to liveness checks, health checks, and inference requests. The following subsections describe these requirements in detail.

You can implement the HTTP server in any way, using any programming language, as long as it meets the requirements in this section. For example, you can write a custom HTTP server using a web framework like Flask or use machine learning (ML) serving software that runs an HTTP server, like TensorFlow Serving, TorchServe, or KServe Python Server.

Run the HTTP server

You can run an HTTP server by using an ENTRYPOINT instruction, a CMD instruction, or both in the Dockerfile that you use to build your container image. Read about the interaction between CMD and ENTRYPOINT.

Alternatively, you can specify the containerSpec.command and containerSpec.args fields when you create your Model resource in order to override your container image's ENTRYPOINT and CMD respectively. Specifying one of these fields lets you use a container image that would otherwise not meet the requirements due to an incompatible (or nonexistent) ENTRYPOINT or CMD.

However you determine which command your container runs when it starts, ensure that this ENTRYPOINT instruction runs indefinitely. For example, don't run a command that starts an HTTP server in the background and then exits; if you do, the container will exit immediately after it starts running.

Your HTTP server must listen for requests on 0.0.0.0, on a port of your choice. When you create a Model, specify this port in the containerSpec.ports field. To learn how the container can access this value, read the section of this document about the AIP_HTTP_PORT environment variable.

Liveness checks

Vertex AI performs a liveness check when your container starts to ensure that your server is running. When you deploy a custom-trained model to an Endpoint resource, Vertex AI uses a TCP liveness probe to attempt to establish a TCP connection to your container on the configured port. The probe makes up to 4 attempts to establish a connection, waiting 10 seconds after each failure. If the probe still hasn't established a connection at this point, Vertex AI restarts your container.

Your HTTP server doesn't need to perform any special behavior to handle these checks. As long as it is listening for requests on the configured port, the liveness probe is able to make a connection.

Health checks

You can optionally specify startup_probe or health_probe.

The startup probe checks whether the container application has started. If startup probe isn't provided, there is no startup probe and health checks begin immediately. If startup probe is provided, health checks aren't performed until startup probe succeeds.

Legacy applications that might require additional startup time on their first initialization should configure a startup probe. For example, if the application needs to copy the model artifacts from an external source, a startup probe should be configured to return success when that initialization is completed.

The health probe checks whether a container is ready to accept traffic. If health probe isn't provided, Vertex AI uses the default health checks as described in Default health checks.

Legacy applications that don't return 200 OK to indicate the model is loaded and ready to accept traffic should configure a health probe. For example, an application might return 200 OK to indicate success even though the actual model load status that's in the response body indicates that the model might not be loaded and might therefore not be ready to accept traffic. In this case, a health probe should be configured to return success only when the model is loaded and ready to serve traffic.

To perform a probe, Vertex AI executes the specified exec command in the target container. If the command succeeds, it returns 0, and the container is considered to be alive and healthy.

Default health checks

By default, Vertex AI intermittently performs health checks on your HTTP server while it's running to make sure that it is ready to handle inference requests. The service uses a health probe to send HTTP GET requests to a configurable health check path on your server. Specify this path in the containerSpec.healthRoute field when you create a Model. To learn how the container can access this value, read the section of this document about the AIP_HEALTH_ROUTE environment variable.

Configure the HTTP server to respond to each health check request as follows:

If the server is ready to handle inference requests, respond to the health check request within 10 seconds with status code 200 OK. The contents of the response body don't matter; Vertex AI ignores them.

This response signifies that the server is healthy.
If the server isn't ready to handle inference requests, don't respond to the health check request within 10 seconds, or respond with any status code except for 200 OK. For example, respond with status code 503 Service Unavailable.

This response (or lack of a response) signifies that the server is unhealthy.

If the health probe receives an unhealthy response from your server (including no response within 10 seconds), it sends up to 3 additional health checks at 10 second intervals. During this period, Vertex AI still considers your server healthy. If the probe receives a healthy response to any of these checks, the probe immediately returns to its intermittent schedule of health checks. However, if the probe receives 4 consecutive unhealthy responses, Vertex AI stops routing inference traffic to the container. (If the DeployedModel resource is scaled to use multiple inference nodes, Vertex AI routes inference requests to other, healthy containers.)

Vertex AI doesn't restart the container; instead the health probe continues sending intermittent health check requests to the unhealthy server. If it receives a healthy response, it marks that container as healthy and starts to route inference traffic to it again.

Practical guidance

In some cases, it is sufficient for the HTTP server in your container to always respond with status code 200 OK to health checks. If your container loads resources before starting the server, the container is unhealthy during the startup period and during any periods when the HTTP server fails. At all other times, it responds as healthy.

For a more sophisticated configuration, you might want to purposefully design the HTTP server to respond to health checks with an unhealthy status at certain times. For example, you might want to block inference traffic to a node for a period so that the container can perform maintenance.

Inference requests

When a client sends a projects.locations.endpoints.predict request to the Vertex AI API, Vertex AI forwards this request as an HTTP POST request to a configurable inference path on your server. Specify this path in the containerSpec.predictRoute field when you create a Model. To learn how the container can access this value, read the section of this document about the AIP_PREDICT_ROUTE environment variable.

Request requirements

If the model is deployed to a public endpoint, each inference request must be 1.5 MB or smaller. The HTTP server must accept inference requests that have the Content-Type: application/json HTTP header and JSON bodies with the following format:

{
  "instances": INSTANCES,
  "parameters": PARAMETERS
}

In these requests:

INSTANCES is an array of one or more JSON values of any type. Each values represents an instance that you are providing an inference for.
PARAMETERS is a JSON object containing any parameters that your container requires to help serve inferences on the instances. Vertex AI considers the parameters field optional, so you can design your container to require it, only use it when provided, or ignore it.

Learn more about the request body requirements.

Response requirements

If the model is deployed to a public endpoint, each inference response must be 1.5 MB or smaller. The HTTP server must send responses with JSON bodies that meet the following format:

{
  "predictions": INFERENCES
}

In these responses, replace INFERENCES with an array of JSON values representing the inferences that your container has generated for each of the INSTANCES in the corresponding request.

After your HTTP server sends this response, Vertex AI adds a deployedModelId field to the response before returning it to the client. This field specifies which DeployedModel on an Endpoint is sending the response. Learn more about the response body format.

Container image publishing requirements

You must push your container image to Artifact Registry in order to use it with Vertex AI. Learn how to push a container image to Artifact Registry.

In particular, you must push the container image to a repository that meets the following location and permissions requirements.

Location

When you use Artifact Registry, the repository must use a region that matches the locational endpoint where you plan to create a Model. For example, if you plan to create a Model on the us-central1-aiplatform.googleapis.com endpoint, the full name of your container image must start with us-central1-docker.pkg.dev/. Don't use a multi-regional repository for your container image.

Permissions

Vertex AI must have permission to pull the container image when you create a Model. Specifically, the Vertex AI Service Agent for your project must have the permissions of the Artifact Registry Reader role (roles/artifactregistry.reader) for the container image's repository.

Vertex AI uses the Vertex AI Service Agent for your project to interact with other Google Cloud services. This service account has the email address service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com, where PROJECT_NUMBER is replaced with the project number of your Vertex AI project.

If you have pushed your container image to the same Google Cloud project where you are using Vertex AI, you don't have to configure any permissions. The default permissions granted to the Vertex AI Service Agent are sufficient.

On the other hand, if you have pushed your container image to a different Google Cloud project from the one where you are using Vertex AI, you must grant the Artifact Registry Reader role for the Artifact Registry repository to the Vertex AI Service Agent.

Access model artifacts

When you create a custom-trained Model without a custom container, you must specify the URI of a Cloud Storage directory with model artifacts as the artifactUri field. When you create a Model with a custom container, providing model artifacts in Cloud Storage is optional.

If the container image includes the model artifacts that you need to serve inferences, there is no need to load files from Cloud Storage. However, if you do provide model artifacts by specifying the artifactUri field, the container must load these artifacts when it starts running. When Vertex AI starts your container, it sets the AIP_STORAGE_URI environment variable to a Cloud Storage URI that begins with gs://. Your container's ENTRYPOINT instruction can download the directory specified by this URI in order to access the model artifacts.

Note that the value of the AIP_STORAGE_URI environment variable isn't identical to the Cloud Storage URI that you specify in the artifactUri field when you create the Model. Rather, AIP_STORAGE_URI points to a copy of your model artifact directory in a different Cloud Storage bucket, which Vertex AI manages. Vertex AI populates this directory when you create a Model. You can't update the contents of the directory. If you want to use new model artifacts, you must create a new Model.

The service account that your container uses by default has permission to read from this URI.

On the other hand, if you specify a custom service account when you deploy the Model to an Endpoint, Vertex AI automatically grants your specified service account the Storage Object Viewer (roles/storage.objectViewer) role for the URI's Cloud Storage bucket.

Use any library that supports Application Default Credentials (ADC) to load the model artifacts; you don't need to explicitly configure authentication.

Environment variables available in the container

When running, the container's ENTRYPOINT instruction can reference environment variables that you have configured manually, as well as environment variables set automatically by Vertex AI. This section describes each way that you can set environment variables, and it provides details about the variables set automatically by Vertex AI.

Variables set in the container image

To set environment variables in the container image when you build it, use Docker's ENV instruction. Don't set any environment variables that begin with the prefix AIP_.

The container's ENTRYPOINT instruction can use these environment variables, but you can't reference them in any of your Model's API fields.

Variables set by Vertex AI

When Vertex AI starts running the container, it sets the following environment variables in the container environment. Each variable begins with the prefix AIP_. Don't manually set any environment variables that use this prefix.

The container's ENTRYPOINT instruction can access these variables. To learn which Vertex AI API fields can also reference these variables, read the API reference for ModelContainerSpec.

Variable name	Default value	How to configure value	Details
AIP_ACCELERATOR_TYPE	Unset	When you deploy a `Model` as a `DeployedModel` to an `Endpoint` resource, set the `dedicatedResources.machineSpec.acceleratorType` field.	If applicable, this variable specifies the type of accelerator used by the virtual machine (VM) instance that the container is running on.
AIP_DEPLOYED_MODEL_ID	A string of digits identifying the `DeployedModel` to which this container's `Model` has been deployed.	Not configurable	This value is the `DeployedModel`'s `id` field.
AIP_ENDPOINT_ID	A string of digits identifying the `Endpoint` on which the container's `Model` has been deployed.	Not configurable	This value is the last segment of the `Endpoint`'s `name` field (following `endpoints/`).
AIP_FRAMEWORK	`CUSTOM_CONTAINER`	Not configurable
AIP_HEALTH_ROUTE	`/v1/endpoints/ENDPOINT/deployedModels/DEPLOYED_MODEL` In this string, replace `ENDPOINT` with the value of the `AIP_ENDPOINT_ID` variable and replace `DEPLOYED_MODEL` with the value of the `AIP_DEPLOYED_MODEL_ID` variable.	When you create a `Model`, set the `containerSpec.healthRoute` field.	This variables specifies the HTTP path on the container that Vertex AI sends health checks to.
AIP_HTTP_PORT	`8080`	When you create a `Model`, set the `containerSpec.ports` field. The first entry in this field becomes the value of `AIP_HTTP_PORT`.	Vertex AI sends liveness checks, health checks, and inference requests to this port on the container. Your container's HTTP server must listen for requests on this port.
AIP_MACHINE_TYPE	No default, must be configured	When you deploy a `Model` as a `DeployedModel` to an `Endpoint` resource, set the `dedicatedResources.machineSpec.machineType` field.	This variable specifies the type of VM that the container is running on.
AIP_MODE	`PREDICTION`	Not configurable	This variable signifies that the container is running on Vertex AI to serve online inferences. You can use this environment variable to add custom logic to your container, so that it can run in multiple computing environments but only use certain code paths on when run on Vertex AI.
AIP_MODE_VERSION	`1.0.0`	Not configurable	This variable signifies the version of the custom container requirements (this document) that Vertex AI expects the container to meet. This document updates according to semantic versioning.
AIP_MODEL_NAME	The value of the `AIP_ENDPOINT_ID` variable	Not configurable	See the `AIP_ENDPOINT_ID` row. This variable exists for compatibility reasons.
AIP_PREDICT_ROUTE	`/v1/endpoints/ENDPOINT/deployedModels/DEPLOYED_MODEL:predict` In this string, replace `ENDPOINT` with the value of the `AIP_ENDPOINT_ID` variable and replace `DEPLOYED_MODEL` with the value of the `AIP_DEPLOYED_MODEL_ID` variable.	When you create a `Model`, set the `containerSpec.predictRoute` field.	This variable specifies the HTTP path on the container that Vertex AI forwards inference requests to.
AIP_PROJECT_NUMBER	The project number of the Google Cloud project where you are using Vertex AI	Not configurable
AIP_STORAGE_URI	If you don't set the `artifactUri` field when you create a `Model`: an empty string If you do set the `artifactUri` field when you create a `Model`: a Cloud Storage URI (starting with `gs://`) specifying a directory in a bucket managed by Vertex AI	Not configurable	This variable specifies the directory that contains a copy of your model artifacts, if applicable.
AIP_VERSION_NAME	The value of the `AIP_DEPLOYED_MODEL_ID` variable	Not configurable	See the `AIP_DEPLOYED_MODEL_ID` row. This variable exists for compatibility reasons.

Variables set in the `Model` resource

When you create a Model, you can set additional environment variables in the containerSpec.env field.