Serve open LLMs on GKE with a pre-configured architecture

Autopilot Standard

This page shows you how to quickly deploy and serve popular open large language models (LLMs) on GKE for inference by using a pre-configured, production-ready reference architecture. This approach uses Infrastructure as Code (IaC), with Terraform wrapped in CLI scripts, to create a standardized, secure, and scalable GKE environment designed for AI inference workloads.

In this guide, you deploy and serve LLMs using single-host GPU nodes on GKE with the vLLM serving framework. This guide provides instructions and configurations for deploying the following open models:

This guide is intended for Machine learning (ML) engineers and Data and AI specialists who are interested in exploring Kubernetes container orchestration capabilities for serving open models for inference. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.

For a detailed analysis of model serving performance and costs for these opens models, you can also use the GKE Inference Quickstart tool. To learn more, see the GKE Inference Quickstart guide and the accompanying Colab notebook.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Make sure that you have the following role or roles on the project: roles/artifactregistry.admin, roles/browser, roles/compute.networkAdmin, roles/container.clusterAdmin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin, and roles/serviceusage.serviceUsageAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Create a Hugging Face account.
Ensure your project has sufficient GPU quota. For more information, see Allocation quotas.

Get access to the model

To access the model through Hugging Face, you need a Hugging Face token.

Follow these steps to generate a new token if you don't have one already:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name of your choice and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard.

Provision the GKE inference environment

In this section, you deploy the necessary infrastructure to serve your model.

Launch Cloud Shell

This guide uses Cloud Shell to execute commands. Cloud Shell comes preinstalled with the necessary tools, including gcloud, kubectl, and git.

In the Google Cloud console, start a Cloud Shell instance:

Open Cloud Shell

This action launches a session in the bottom pane of Google Cloud console.

Deploy the base architecture

To provision the GKE cluster and the necessary resources for accessing models from Hugging Face, follow these steps:

In Cloud Shell, clone the following repository:

git clone https://github.com/GoogleCloudPlatform/accelerated-platforms --branch hf-model-tutorial && \
cd accelerated-platforms && \
export ACP_REPO_DIR="$(pwd)"

Set your environment variables:
```
export TF_VAR_platform_default_project_id=PROJECT_ID
export HF_TOKEN_READ=HF_TOKEN
```
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- HF_TOKEN: the Hugging Face token you generated earlier.
This guide requires Terraform version 1.8.0 or later. Cloud Shell has Terraform v1.5.7 installed by default.

To update the Terraform version in Cloud Shell, you can run the following script. This script installs the terraform-switcher tool and makes changes to your shell environment.
```
"${ACP_REPO_DIR}/tools/bin/install_terraform.sh"
source ~/.bashrc
```
Run the following deployment script. The deployment script enables the required Google Cloud APIs and provisions the necessary infrastructure for this guide. This includes a new VPC network, a GKE cluster with private nodes, and other supporting resources. The script can take several minutes to complete.

You can serve models using GPUs in a GKE Autopilot or Standard cluster. An Autopilot cluster provides a fully managed Kubernetes experience. For more information about choosing the GKE mode of operation that's the best fit for your workloads, see About GKE modes of operation.
Autopilot
```
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-ap.sh"
```
Standard
```
"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-standard.sh"
```
After this script completes, you will have a GKE cluster ready for inference workloads.

Run the following command to set environment variables from the shared configuration:

source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"

The deployment script creates a secret in Secret Manager to store your Hugging Face token. You must manually add your token to this secret before deploying the cluster. In Cloud Shell, run this command to add the token to Secret Manager.
```
echo ${HF_TOKEN_READ} | gcloud secrets versions add ${huggingface_hub_access_token_read_secret_manager_secret_name} \
    --data-file=- \
    --project=${huggingface_secret_manager_project_id}
```

Deploy an open model

You are now ready to download and deploy the model.

Set the environment variables for the model you want to deploy:

Gemma 3 27B-it

export ACCELERATOR_TYPE="h100"
export HF_MODEL_ID="google/gemma-3-27b-it"

Llama 4 Scout 17B-16E-Instruct

export ACCELERATOR_TYPE="h100"
export HF_MODEL_ID="meta-llama/llama-4-scout-17b-16e-instruct"

Qwen3 32B

export ACCELERATOR_TYPE="h100"
export HF_MODEL_ID="qwen/qwen3-32b"

gpt-oss 20B

export ACCELERATOR_TYPE="h100"
export HF_MODEL_ID="openai/gpt-oss-20b"

For additional configurations, including other model variants and GPU types, see the manifests available in the accelerated-platforms GitHub repository.

Source the environment variables from your deployment. These environment variables contain the necessary configuration details from the infrastructure you provisioned.
```
source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
```

Run the following script to configure the Kubernetes Job that downloads the model to Cloud Storage:

"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"

Deploy the model download Job:

kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"

Wait for the download to complete. Monitor the Job's status and when COMPLETIONS is 1/1press Ctrl+C to exit.

watch --color --interval 5 --no-title "kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/${HF_MODEL_ID_HASH}-hf-model-to-gcs

Deploy the inference workload to your GKE cluster.

"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/configure_deployment.sh"

kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

Test your deployment

Wait for the inference server Pod to be ready. When the READY column is 1/1, press Ctrl+C to exit.

watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

Run the following script to set up port forwarding and send a sample request to the model. This example uses the payload format for a Gemma 3 27b-it model.

kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-${ACCELERATOR_TYPE}-${HF_MODEL_NAME} 8000:8000 >/dev/null &
PF_PID=$!
curl http://127.0.0.1:8000/v1/chat/completions \
  --data '{
    "model": "/gcs/'${HF_MODEL_ID}'",
    "messages": [ { "role": "user", "content": "What is GKE?" } ]
  }' \
  --header "Content-Type: application/json" \
  --request POST \
  --show-error \
  --silent | jq
kill -9 ${PF_PID}

You should see a JSON response from the model answering the question.

Clean up

To avoid incurring charges, delete all the resources you created.

Delete the inference workload:

kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${HF_MODEL_NAME}"

Remove the foundational GKE cluster:

Autopilot

"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-ap.sh"

Standard

"${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-standard.sh"

What's next

Learn more about AI/ML model inference on GKE.
Analyze model inference performance and costs with the GKE Inference Quickstart tool.
Explore other use cases and patterns on GitHub built on this architecture.

Serve open LLMs on GKE with a pre-configured architecture

Before you begin

Check for the roles

Grant the roles

Get access to the model

Provision the GKE inference environment

Launch Cloud Shell

Deploy the base architecture

Autopilot

Standard

Deploy an open model

Gemma 3 27B-it

Llama 4 Scout 17B-16E-Instruct

Qwen3 32B

gpt-oss 20B

Test your deployment

Clean up

Autopilot

Standard

What's next