gcloud beta ai model-garden models deploy

gcloud beta ai model-garden models deploy - deploy a model in Model Garden to a Vertex AI endpoint
gcloud beta ai model-garden models deploy --model=MODEL [--accelerator-type=ACCELERATOR_TYPE] [--accept-eula] [--asynchronous] [--container-args=[ARG,…]] [--container-command=[COMMAND,…]] [--container-deployment-timeout-seconds=CONTAINER_DEPLOYMENT_TIMEOUT_SECONDS] [--container-env-vars=[KEY=VALUE,…]] [--container-grpc-ports=[PORT,…]] [--container-health-probe-exec=[HEALTH_PROBE_EXEC,…]] [--container-health-probe-period-seconds=CONTAINER_HEALTH_PROBE_PERIOD_SECONDS] [--container-health-probe-timeout-seconds=CONTAINER_HEALTH_PROBE_TIMEOUT_SECONDS] [--container-health-route=CONTAINER_HEALTH_ROUTE] [--container-image-uri=CONTAINER_IMAGE_URI] [--container-ports=[PORT,…]] [--container-predict-route=CONTAINER_PREDICT_ROUTE] [--container-shared-memory-size-mb=CONTAINER_SHARED_MEMORY_SIZE_MB] [--container-startup-probe-exec=[STARTUP_PROBE_EXEC,…]] [--container-startup-probe-period-seconds=CONTAINER_STARTUP_PROBE_PERIOD_SECONDS] [--container-startup-probe-timeout-seconds=CONTAINER_STARTUP_PROBE_TIMEOUT_SECONDS] [--enable-fast-tryout] [--endpoint-display-name=ENDPOINT_DISPLAY_NAME] [--hugging-face-access-token=HUGGING_FACE_ACCESS_TOKEN] [--machine-type=MACHINE_TYPE] [--region=REGION] [--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]] [--spot] [--use-dedicated-endpoint] [GCLOUD_WIDE_FLAG]
To deploy a Model Garden model google/gemma2/gemma2-9b under project example in region us-central1, run:
gcloud ai model-garden models deploy --model=google/gemma2@gemma-2-9b --project=example --region=us-central1

To deploy a Hugging Face model meta-llama/Meta-Llama-3-8B under project example in region us-central1, run:

gcloud ai model-garden models deploy --model=meta-llama/Meta-Llama-3-8B --hugging-face-access-token={hf_token} --project=example --region=us-central1
The model to be deployed. If it is a Model Garden model, it should be in the format of {publisher_name}/{model_name}@{model_version_name}, e.g. google/gemma2@gemma-2-2b. If it is a Hugging Face model, it should be in the convention of Hugging Face models, e.g. meta-llama/Meta-Llama-3-8B.
The accelerator type to serve the model. It should be a supported accelerator type from the verified deployment configurations of the model. Use gcloud ai model-garden models list-deployment-config to check the supported accelerator types.
When set, the user accepts the End User License Agreement (EULA) of the model.
If set to true, the command will terminate immediately and not keep polling the operation status.
Comma-separated arguments passed to the command run by the container image. If not specified and no --command is provided, the container image's default command is used.
Entrypoint for the container image. If not specified, the container image's default entrypoint is run.
Deployment timeout in seconds.
List of key-value pairs to set as environment variables.
Container ports to receive grpc requests at. Must be a number between 1 and 65535, inclusive.
Exec specifies the action to take. Used by health probe. An example of this argument would be ["cat", "/tmp/healthy"].
How often (in seconds) to perform the health probe. Default to 10 seconds. Minimum value is 1.
Number of seconds after which the health probe times out. Defaults to 1 second. Minimum value is 1.
HTTP path to send health checks to inside the container.
URI of the Model serving container file in the Container Registry (e.g. gcr.io/myproject/server:latest).
Container ports to receive http requests at. Must be a number between 1 and 65535, inclusive.
HTTP path to send prediction requests to inside the container.
The amount of the VM memory to reserve as the shared memory for the model in megabytes.
Exec specifies the action to take. Used by startup probe. An example of this argument would be ["cat", "/tmp/healthy"].
How often (in seconds) to perform the startup probe. Default to 10 seconds. Minimum value is 1.
Number of seconds after which the startup probe times out. Defaults to 1 second. Minimum value is 1.
If True, model will be deployed using faster deployment path. Useful for quick experiments. Not for production workloads. Only available for most popular models with certain machine types.
Display name of the endpoint with the deployed model.
The access token from Hugging Face needed to read the model artifacts of gated models. It is only needed when the Hugging Face model to deploy is gated.
The machine type to deploy the model to. It should be a supported machine type from the deployment configurations of the model. Use gcloud ai model-garden models list-deployment-config to check the supported machine types.
Region resource - Cloud region to deploy the model. This represents a Cloud resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways.

To set the project attribute:

  • provide the argument --region on the command line with a fully specified name;
  • set the property ai/region with a fully specified name;
  • choose one from the prompted list of available regions with a fully specified name;
  • provide the argument --project on the command line;
  • set the property core/project.
ID of the region or fully qualified identifier for the region.

To set the region attribute:

  • provide the argument --region on the command line;
  • set the property ai/region;
  • choose one from the prompted list of available regions.
A ReservationAffinity can be used to configure a Vertex AI resource (e.g., a DeployedModel) to draw its Compute Engine resources from a Shared Reservation, or exclusively from on-demand capacity.
If true, schedule the deployment workload on Spot VM.
If true, the endpoint will be exposed through a dedicated DNS. Your request to the dedicated DNS will be isolated from other users' traffic and will have better performance and reliability.
These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.

Run $ gcloud help for details.

This command is currently in beta and might change without notice. This variant is also available:
gcloud alpha ai model-garden models deploy