In the Google Cloud console, you can create a public endpoint and deploy a model to it.
Models can be deployed from the Online prediction page or the Model Registry page.
Deploy a model from the Online prediction page
In the Online prediction page, you can create an endpoint and deploy one or more models to it as follows:
In the Google Cloud console, in the Vertex AI section, go to the Online prediction page.
Click
Create.In the New endpoint pane:
Enter the Endpoint name.
Select Standard for the access type.
To create a dedicated (not shared) public endpoint, select the Enable dedicated DNS checkbox.
Click Continue.
In the Model settings pane:
Select your model from the drop-down list.
Choose the model version from the drop-down list.
Enter the Traffic split percentage for the model.
Click Done.
Repeat these steps for any additional models to be deployed.
Deploy a model from the Model Registry page
In the Model Registry page, you can deploy a model to one or more new or existing endpoints as follows:
In the Google Cloud console, in the Vertex AI section, go to the Models page.
Click the name and version ID of the model you want to deploy to open its details page.
Select the Deploy & Test tab.
If your model is already deployed to any endpoints, they are listed in the Deploy your model section.
Click Deploy to endpoint.
To deploy your model to a new endpoint:
- Select Create new endpoint
- Provide a name for the new endpoint.
- To create a dedicated (not shared) public endpoint, select the Enable dedicated DNS checkbox.
- Click Continue.
To deploy your model to an existing endpoint:
- Select Add to existing endpoint.
- Select the endpoint from the drop-down list.
- Click Continue.
You can deploy multiple models to an endpoint, or you can deploy the same model to multiple endpoints.
If you deploy your model to an existing endpoint that has one or more models deployed to it, you must update the Traffic split percentage for the model you are deploying and the already deployed models so that all of the percentages add up to 100%.
If you're deploying your model to a new endpoint, accept 100 for the Traffic split. Otherwise, adjust the traffic split values for all models on the endpoint so they add up to 100.
Enter the Minimum number of compute nodes you want to provide for your model.
This is the number of nodes that need to be available to the model at all times.
You are charged for the nodes used, whether to handle prediction load or for standby (minimum) nodes, even without prediction traffic. See the pricing page.
The number of compute nodes can increase if needed to handle prediction traffic, but it will never go higher than the maximum number of nodes.
To use autoscaling, enter the Maximum number of compute nodes you want Vertex AI to scale up to.
Select your Machine type.
Larger machine resources increase your prediction performance and increase costs. Compare the available machine types.
Select an Accelerator type and an Accelerator count.
If you enabled accelerator use when you imported or created the model, this option displays.
For the accelerator count, refer to the GPU table to check for valid numbers of GPUs that you can use with each CPU machine type. The accelerator count refers to the number of accelerators per node, not the total number of accelerators in your deployment.
If you want to use a custom service account for the deployment, select a service account in the Service account drop-down box.
Learn how to change the default settings for prediction logging.
Click Done for your model, and when all the Traffic split percentages are correct, click Continue.
The region where your model deploys is displayed. This must be the region where you created your model.
Click Deploy to deploy your model to the endpoint.
What's next
- Learn how to get an online prediction.
- Learn how to change the default settings for prediction logging.