In a rolling deployment, a deployed model is replaced with a new version of the same model. The new model reuses the compute resources from the previous one.
In the rolling deployment request, the traffic split and dedicatedResources
values are the same as for the previous deployment. After the rolling deployment
completes, the traffic split is updated to show that all of the traffic from the
previous DeployedModel
has migrated to the new deployment.
Other configurable fields in DeployedModel
(such as serviceAccount
,
disableContainerLogging
, and enableAccessLogging
) are set to the same values
as for the previous DeployedModel
by default. However, you can optionally
specify new values for these fields.
When a model is deployed using a rolling deployment, a new DeployedModel
is
created. The new DeployedModel
receives a new ID that is different from that
of the previous one. It also receives a new revisionNumber
value in the
rolloutOptions
field.
If there are multiple rolling deployments targeting the same backing resources,
the DeployedModel
with the highest revisionNumber
is treated as the
intended final state.
As the rolling deployment progresses, all the existing replicas for the previous
DeployedModel
are replaced with replicas of the new DeployedModel
. This
happens quickly, and replicas are updated whenever the deployment has enough
available replicas or enough surge capacity to bring up additional replicas.
Additionally, as the rolling deployment progresses, the traffic for the old
DeployedModel
is gradually migrated to the new DeployedModel
. The traffic
is load-balanced in proportion to the number of ready-to-serve replicas of each
DeployedModel
.
If the rolling deployment's new replicas never become ready because their health
route consistently returns a non-200 response code, traffic isn't sent
to those unready replicas. In this case, the rolling deployment eventually
fails, and the replicas are reverted to the previous DeployedModel
.
Start a rolling deployment
To start a rolling deployment, include the rolloutOptions
field in the model
deployment request as shown in the following example.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
PREVIOUS_DEPLOYED_MODEL: The
DeployedModel
ID of a model on the same endpoint. This specifies theDeployedModel
whose backing resources are to be reused. You can callGetEndpoint
to get a list of deployed models on an endpoint along with their numeric IDs. - MAX_UNAVAILABLE_REPLICAS: The number of model replicas that can be taken down during the rolling deployment.
- MAX_SURGE_REPLICAS: The number of additional model replicas that can be brought up during the rolling deployment. If this is set to zero, then only the existing capacity is used.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID", "rolloutOptions": { "previousDeployedModel": "PREVIOUS_DEPLOYED_MODEL", "maxUnavailableReplicas": "MAX_UNAVAILABLE_REPLICAS", "maxSurgeReplicas": "MAX_SURGE_REPLICAS" } } }
To send your request, expand one of these options:
You should receive a successful status code (2xx) and an empty response.
If desired, you can replace maxSurgeReplicas
and maxUnavailableReplicas
,
or both, with percentage values, as shown in the following example.
REST
Before using any of the request data, make the following replacements:
- MAX_UNAVAILABLE_PERCENTAGE: The percentage of model replicas that can be taken down during the rolling deployment.
- MAX_SURGE_PERCENTAGE: The percentage of additional model replicas that can be brought up during the rolling deployment. If this is set to zero, then only the existing capacity is used.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT/locations/LOCATION_ID/models/MODEL_ID", "rolloutOptions": { "previousDeployedModel": "PREVIOUS_DEPLOYED_MODEL", "maxUnavailablePercentage": "MAX_UNAVAILABLE_PERCENTAGE", "maxSurgePercentage": "MAX_SURGE_PERCENTAGE" } } }
To send your request, expand one of these options:
You should receive a successful status code (2xx) and an empty response.
Roll back a rolling deployment
To roll back a rolling deployment, start a new rolling deployment of the
previous model, using the ongoing rolling deployment's DeployedModel
ID as the
previousDeployedModel
.
To get the DeployedModel
ID for an ongoing deployment, set the parameter
allDeploymentStates=true
in the call to GetEndpoint
, as shown in the
following example.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
HTTP method and URL:
GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID?allDeploymentStates=true
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID", "displayName": "rolling-deployments-endpoint", "deployedModels": [ { "id": "2718281828459045", "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID@1", "displayName": "rd-test-model", "createTime": "2024-09-11T21:37:48.522692Z", "dedicatedResources": { "machineSpec": { "machineType": "e2-standard-2" }, "minReplicaCount": 5, "maxReplicaCount": 5 }, "modelVersionId": "1", "state": "BEING_DEPLOYED" } ], "etag": "AMEw9yMs3TdZMn8CUg-3DY3wS74bkIaTDQhqJ7-Ld_Zp7wgT8gsEfJlrCOyg67lr9dwn", "createTime": "2024-09-11T21:22:36.588538Z", "updateTime": "2024-09-11T21:27:28.563579Z", "dedicatedEndpointEnabled": true, "dedicatedEndpointDns": "ENDPOINT_ID.LOCATION_ID-PROJECT_ID.prediction.vertexai.goog" }
Constraints and limitations
- The previous
DeployedModel
must be on the same endpoint as the newDeployedModel
. - You can't create multiple rolling deployments with the same
previousDeployedModel
. - You can't create rolling deployments on top of a
DeployedModel
that isn't fully deployed. Exception: IfpreviousDeployedModel
is itself an in-progress rolling deployment, then a new rolling deployment can be created on top of it. This allows for rolling back deployments that start to fail. - Previous models don't automatically undeploy after a rolling deployment completes successfully. You can undeploy the model manually.
- For rolling deployments on shared public endpoints, the
predictRoute
andhealthRoute
for the new model must be the same as for the previous model. - Rolling deployments aren't compatible with model cohosting.
- Rolling deployments can't be used for models that require
online explanations.