In this guide, you deploy and serve LLMs using single-host GPU nodes on GKE with the vLLM serving framework. This guide provides instructions and configurations for deploying the following open models:
This guide is intended for Machine learning (ML) engineers and Data and AI specialists who are interested in exploring Kubernetes container orchestration capabilities for serving open models for inference. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the required APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the required APIs.
-
Make sure that you have the following role or roles on the project: roles/artifactregistry.admin, roles/browser, roles/compute.networkAdmin, roles/container.clusterAdmin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin, and roles/serviceusage.serviceUsageAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Create a Hugging Face account.
- Ensure your project has sufficient GPU quota. For more information, see Allocation quotas.
Get access to the model
To access the model through Hugging Face, you need a Hugging Face token.
Follow these steps to generate a new token if you don't have one already:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least Read.
- Select Generate a token.
- Copy the generated token to your clipboard.
Provision the GKE inference environment
In this section, you deploy the necessary infrastructure to serve your model.
Launch Cloud Shell
This guide uses Cloud Shell to execute commands. Cloud Shell comes
preinstalled with the necessary tools, including gcloud
, kubectl
, and git
.
In the Google Cloud console, start a Cloud Shell instance:
This action launches a session in the bottom pane of Google Cloud console.
Deploy the base architecture
To provision the GKE cluster and the necessary resources for accessing models from Hugging Face, follow these steps:
In Cloud Shell, clone the following repository:
git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \ cd accelerated-platforms && \ export ACP_REPO_DIR="$(pwd)"
Set your environment variables:
export TF_VAR_platform_default_project_id=PROJECT_ID export HF_TOKEN_READ=HF_TOKEN
Replace the following values:
PROJECT_ID
: your Google Cloud project ID.HF_TOKEN
: the Hugging Face token you generated earlier.
This guide requires Terraform version 1.8.0 or later. Cloud Shell has Terraform v1.5.7 installed by default.
To update the Terraform version in Cloud Shell, you can run the following script. This script installs the
terraform-switcher
tool and makes changes to your shell environment."${ACP_REPO_DIR}/tools/bin/install_terraform.sh" source ~/.bashrc
Run the following deployment script. The deployment script enables the required Google Cloud APIs and provisions the necessary infrastructure for this guide. This includes a new VPC network, a GKE cluster with private nodes, and other supporting resources. The script can take several minutes to complete.
You can serve models using GPUs in a GKE Autopilot or Standard cluster. An Autopilot cluster provides a fully managed Kubernetes experience. For more information about choosing the GKE mode of operation that's the best fit for your workloads, see About GKE modes of operation.
Autopilot
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-ap.sh"
Standard
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-standard.sh"
After this script completes, you will have a GKE cluster ready for inference workloads.
Run the following command to set environment variables from the shared configuration:
source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
The deployment script creates a secret in Secret Manager to store your Hugging Face token. You must manually add your token to this secret before deploying the cluster. In Cloud Shell, run this command to add the token to Secret Manager.
echo ${HF_TOKEN_READ} | gcloud secrets versions add ${huggingface_hub_access_token_read_secret_manager_secret_name} \ --data-file=- \ --project=${huggingface_secret_manager_project_id}
Deploy an open model
You are now ready to download and deploy the model.
Source the environment variables from your deployment. These environment variables contain the necessary configuration details from the infrastructure you provisioned.
source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
Set the environment variables for the model you want to deploy:
Gemma 3 27B-it
export ACCELERATOR_TYPE="l4" export MODEL_ID="google/gemma-3-27b-it" MODEL_NAME="${MODEL_ID##*/}" && export MODEL_NAME="${MODEL_NAME,,}"
Llama 4 Scout 17B-16E
export ACCELERATOR_TYPE="h100" export MODEL_ID="meta-llama/llama-4-scout-17b-16e-instruct" MODEL_NAME="${MODEL_ID##*/}" && export MODEL_NAME="${MODEL_NAME,,}"
Run the following script to configure the Kubernetes Job that downloads the model to Cloud Storage:
"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"
Deploy the model download Job:
kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"
Wait for the download to complete. Monitor the Job's status and when
COMPLETIONS
is1/1
pressCtrl+C
to exit.watch --color --interval 5 --no-title "kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/hf-model-to-gcs
Deploy the inference workload to your GKE cluster.
"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/configure_deployment.sh" kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${MODEL_NAME}"
Test your deployment
Wait for the inference server Pod to be ready. When the
READY
column is1/1
, pressCtrl+C
to exit.watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-${ACCELERATOR_TYPE}-${MODEL_NAME}"
Run the following script to set up port forwarding and send a sample request to the model. This example uses the payload format for a Gemma 3 27b-it model.
kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-${ACCELERATOR_TYPE}-${MODEL_NAME} 8000:8000 >/dev/null & PF_PID=$! curl http://127.0.0.1:8000/v1/chat/completions \ --data '{ "model": "/gcs/'${MODEL_ID}'", "messages": [ { "role": "user", "content": "What is GKE?" } ] }' \ --header "Content-Type: application/json" \ --request POST \ --show-error \ --silent | jq kill -9 ${PF_PID}
You should see a JSON response from the model answering the question.
Clean up
To avoid incurring charges, delete all the resources you created.
Delete the inference workload:
kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${MODEL_NAME}"
Remove the foundational GKE cluster:
Autopilot
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-ap.sh"
Standard
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-standard.sh"
What's next
- Learn more about AI/ML model inference on GKE.
- Analyze model inference performance and costs with the GKE Inference Quickstart tool.
- Explore other use cases and patterns on GitHub built on this architecture.