gcloud beta ai model-garden models deploy

NAME
gcloud beta ai model-garden models deploy - deploy a model in Model Garden to a Vertex AI endpoint
SYNOPSIS
gcloud beta ai model-garden models deploy --model=MODEL [--accelerator-type=ACCELERATOR_TYPE] [--accept-eula] [--asynchronous] [--container-args=[ARG,…]] [--container-command=[COMMAND,…]] [--container-deployment-timeout-seconds=CONTAINER_DEPLOYMENT_TIMEOUT_SECONDS] [--container-env-vars=[KEY=VALUE,…]] [--container-grpc-ports=[PORT,…]] [--container-health-probe-exec=[HEALTH_PROBE_EXEC,…]] [--container-health-probe-period-seconds=CONTAINER_HEALTH_PROBE_PERIOD_SECONDS] [--container-health-probe-timeout-seconds=CONTAINER_HEALTH_PROBE_TIMEOUT_SECONDS] [--container-health-route=CONTAINER_HEALTH_ROUTE] [--container-image-uri=CONTAINER_IMAGE_URI] [--container-ports=[PORT,…]] [--container-predict-route=CONTAINER_PREDICT_ROUTE] [--container-shared-memory-size-mb=CONTAINER_SHARED_MEMORY_SIZE_MB] [--container-startup-probe-exec=[STARTUP_PROBE_EXEC,…]] [--container-startup-probe-period-seconds=CONTAINER_STARTUP_PROBE_PERIOD_SECONDS] [--container-startup-probe-timeout-seconds=CONTAINER_STARTUP_PROBE_TIMEOUT_SECONDS] [--enable-fast-tryout] [--endpoint-display-name=ENDPOINT_DISPLAY_NAME] [--hugging-face-access-token=HUGGING_FACE_ACCESS_TOKEN] [--machine-type=MACHINE_TYPE] [--region=REGION] [--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]] [--spot] [--use-dedicated-endpoint] [GCLOUD_WIDE_FLAG]
EXAMPLES
To deploy a Model Garden model google/gemma2/gemma2-9b under project example in region us-central1, run:
gcloud ai model-garden models deploy --model=google/gemma2@gemma-2-9b --project=example --region=us-central1

To deploy a Hugging Face model meta-llama/Meta-Llama-3-8B under project example in region us-central1, run:

gcloud ai model-garden models deploy --model=meta-llama/Meta-Llama-3-8B --hugging-face-access-token={hf_token} --project=example --region=us-central1
REQUIRED FLAGS
--model=MODEL
The model to be deployed. If it is a Model Garden model, it should be in the format of {publisher_name}/{model_name}@{model_version_name}, e.g. google/gemma2@gemma-2-2b. If it is a Hugging Face model, it should be in the convention of Hugging Face models, e.g. meta-llama/Meta-Llama-3-8B.
OPTIONAL FLAGS
--accelerator-type=ACCELERATOR_TYPE
The accelerator type to serve the model. It should be a supported accelerator type from the verified deployment configurations of the model. Use gcloud ai model-garden models list-deployment-config to check the supported accelerator types.
--accept-eula
When set, the user accepts the End User License Agreement (EULA) of the model.
--asynchronous
If set to true, the command will terminate immediately and not keep polling the operation status.
--container-args=[ARG,…]
Comma-separated arguments passed to the command run by the container image. If not specified and no --command is provided, the container image's default command is used.
--container-command=[COMMAND,…]
Entrypoint for the container image. If not specified, the container image's default entrypoint is run.
--container-deployment-timeout-seconds=CONTAINER_DEPLOYMENT_TIMEOUT_SECONDS
Deployment timeout in seconds.
--container-env-vars=[KEY=VALUE,…]
List of key-value pairs to set as environment variables.
--container-grpc-ports=[PORT,…]
Container ports to receive grpc requests at. Must be a number between 1 and 65535, inclusive.
--container-health-probe-exec=[HEALTH_PROBE_EXEC,…]
Exec specifies the action to take. Used by health probe. An example of this argument would be ["cat", "/tmp/healthy"].
--container-health-probe-period-seconds=CONTAINER_HEALTH_PROBE_PERIOD_SECONDS
How often (in seconds) to perform the health probe. Default to 10 seconds. Minimum value is 1.
--container-health-probe-timeout-seconds=CONTAINER_HEALTH_PROBE_TIMEOUT_SECONDS
Number of seconds after which the health probe times out. Defaults to 1 second. Minimum value is 1.
--container-health-route=CONTAINER_HEALTH_ROUTE
HTTP path to send health checks to inside the container.
--container-image-uri=CONTAINER_IMAGE_URI
URI of the Model serving container file in the Container Registry (e.g. gcr.io/myproject/server:latest).
--container-ports=[PORT,…]
Container ports to receive http requests at. Must be a number between 1 and 65535, inclusive.
--container-predict-route=CONTAINER_PREDICT_ROUTE
HTTP path to send prediction requests to inside the container.
--container-shared-memory-size-mb=CONTAINER_SHARED_MEMORY_SIZE_MB
The amount of the VM memory to reserve as the shared memory for the model in megabytes.
--container-startup-probe-exec=[STARTUP_PROBE_EXEC,…]
Exec specifies the action to take. Used by startup probe. An example of this argument would be ["cat", "/tmp/healthy"].
--container-startup-probe-period-seconds=CONTAINER_STARTUP_PROBE_PERIOD_SECONDS
How often (in seconds) to perform the startup probe. Default to 10 seconds. Minimum value is 1.
--container-startup-probe-timeout-seconds=CONTAINER_STARTUP_PROBE_TIMEOUT_SECONDS
Number of seconds after which the startup probe times out. Defaults to 1 second. Minimum value is 1.
--enable-fast-tryout
If True, model will be deployed using faster deployment path. Useful for quick experiments. Not for production workloads. Only available for most popular models with certain machine types.
--endpoint-display-name=ENDPOINT_DISPLAY_NAME
Display name of the endpoint with the deployed model.
--hugging-face-access-token=HUGGING_FACE_ACCESS_TOKEN
The access token from Hugging Face needed to read the model artifacts of gated models. It is only needed when the Hugging Face model to deploy is gated.
--machine-type=MACHINE_TYPE
The machine type to deploy the model to. It should be a supported machine type from the deployment configurations of the model. Use gcloud ai model-garden models list-deployment-config to check the supported machine types.
Region resource - Cloud region to deploy the model. This represents a Cloud resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways.

To set the project attribute:

  • provide the argument --region on the command line with a fully specified name;
  • set the property ai/region with a fully specified name;
  • choose one from the prompted list of available regions with a fully specified name;
  • provide the argument --project on the command line;
  • set the property core/project.
--region=REGION
ID of the region or fully qualified identifier for the region.

To set the region attribute:

  • provide the argument --region on the command line;
  • set the property ai/region;
  • choose one from the prompted list of available regions.
--reservation-affinity=[key=KEY],[reservation-affinity-type=RESERVATION-AFFINITY-TYPE],[values=VALUES]
A ReservationAffinity can be used to configure a Vertex AI resource (e.g., a DeployedModel) to draw its Compute Engine resources from a Shared Reservation, or exclusively from on-demand capacity.
--spot
If true, schedule the deployment workload on Spot VM.
--use-dedicated-endpoint
If true, the endpoint will be exposed through a dedicated DNS. Your request to the dedicated DNS will be isolated from other users' traffic and will have better performance and reliability.
GCLOUD WIDE FLAGS
These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.

Run $ gcloud help for details.

NOTES
This command is currently in beta and might change without notice. This variant is also available:
gcloud alpha ai model-garden models deploy