Serve open LLMs on GKE with a pre-configured architecture


This page shows you how to quickly deploy and serve popular open large language models (LLMs) on GKE for inference by using a pre-configured, production-ready reference architecture. This approach uses Infrastructure as Code (IaC), with Terraform wrapped in CLI scripts, to create a standardized, secure, and scalable GKE environment designed for AI inference workloads.

In this guide, you deploy and serve LLMs using single-host GPU nodes on GKE with the vLLM serving framework. This guide provides instructions and configurations for deploying the following open models:

This guide is intended for Machine learning (ML) engineers and Data and AI specialists who are interested in exploring Kubernetes container orchestration capabilities for serving open models for inference. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.

Before you begin

  • Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Verify that billing is enabled for your Google Cloud project.

  • Enable the required APIs.

    Enable the APIs

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Verify that billing is enabled for your Google Cloud project.

  • Enable the required APIs.

    Enable the APIs

  • Make sure that you have the following role or roles on the project: roles/artifactregistry.admin, roles/browser, roles/compute.networkAdmin, roles/container.clusterAdmin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin, and roles/serviceusage.serviceUsageAdmin

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

Get access to the model

To access the model through Hugging Face, you need a Hugging Face token.

Follow these steps to generate a new token if you don't have one already:

  1. Click Your Profile > Settings > Access Tokens.
  2. Select New Token.
  3. Specify a Name of your choice and a Role of at least Read.
  4. Select Generate a token.
  5. Copy the generated token to your clipboard.

Provision the GKE inference environment

In this section, you deploy the necessary infrastructure to serve your model.

Launch Cloud Shell

This guide uses Cloud Shell to execute commands. Cloud Shell comes preinstalled with the necessary tools, including gcloud, kubectl, and git.

In the Google Cloud console, start a Cloud Shell instance:

Open Cloud Shell

This action launches a session in the bottom pane of Google Cloud console.

Deploy the base architecture

To provision the GKE cluster and the necessary resources for accessing models from Hugging Face, follow these steps:

  1. In Cloud Shell, clone the following repository:

    git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \
    cd accelerated-platforms && \
    export ACP_REPO_DIR="$(pwd)"
    
  2. Set your environment variables:

    export TF_VAR_platform_default_project_id=PROJECT_ID
    export HF_TOKEN_READ=HF_TOKEN
    

    Replace the following values:

    • PROJECT_ID: your Google Cloud project ID.
    • HF_TOKEN: the Hugging Face token you generated earlier.
  3. This guide requires Terraform version 1.8.0 or later. Cloud Shell has Terraform v1.5.7 installed by default.

    To update the Terraform version in Cloud Shell, you can run the following script. This script installs the terraform-switcher tool and makes changes to your shell environment.

    "${ACP_REPO_DIR}/tools/bin/install_terraform.sh"
    source ~/.bashrc
    
  4. Run the following deployment script. The deployment script enables the required Google Cloud APIs and provisions the necessary infrastructure for this guide. This includes a new VPC network, a GKE cluster with private nodes, and other supporting resources. The script can take several minutes to complete.

    You can serve models using GPUs in a GKE Autopilot or Standard cluster. An Autopilot cluster provides a fully managed Kubernetes experience. For more information about choosing the GKE mode of operation that's the best fit for your workloads, see About GKE modes of operation.

    Autopilot

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-ap.sh"
    

    Standard

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/deploy-standard.sh"
    

    After this script completes, you will have a GKE cluster ready for inference workloads.

  5. Run the following command to set environment variables from the shared configuration:

    source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
    
  6. The deployment script creates a secret in Secret Manager to store your Hugging Face token. You must manually add your token to this secret before deploying the cluster. In Cloud Shell, run this command to add the token to Secret Manager.

    echo ${HF_TOKEN_READ} | gcloud secrets versions add ${huggingface_hub_access_token_read_secret_manager_secret_name} \
        --data-file=- \
        --project=${huggingface_secret_manager_project_id}
    

Deploy an open model

You are now ready to download and deploy the model.

  1. Source the environment variables from your deployment. These environment variables contain the necessary configuration details from the infrastructure you provisioned.

    source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
    
  2. Set the environment variables for the model you want to deploy:

    Gemma 3 27B-it

    export ACCELERATOR_TYPE="l4"
    export MODEL_ID="google/gemma-3-27b-it"
    MODEL_NAME="${MODEL_ID##*/}" && export MODEL_NAME="${MODEL_NAME,,}"
    

    Llama 4 Scout 17B-16E

    export ACCELERATOR_TYPE="h100"
    export MODEL_ID="meta-llama/llama-4-scout-17b-16e-instruct"
    MODEL_NAME="${MODEL_ID##*/}" && export MODEL_NAME="${MODEL_NAME,,}"
    
  3. Run the following script to configure the Kubernetes Job that downloads the model to Cloud Storage:

    "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"
    
  4. Deploy the model download Job:

    kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"
    
  5. Wait for the download to complete. Monitor the Job's status and when COMPLETIONS is 1/1press Ctrl+C to exit.

    watch --color --interval 5 --no-title "kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/hf-model-to-gcs
    
  6. Deploy the inference workload to your GKE cluster.

    "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/configure_deployment.sh"
    
    kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${MODEL_NAME}"
    

Test your deployment

  1. Wait for the inference server Pod to be ready. When the READY column is 1/1, press Ctrl+C to exit.

    watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-${ACCELERATOR_TYPE}-${MODEL_NAME}"
    
  2. Run the following script to set up port forwarding and send a sample request to the model. This example uses the payload format for a Gemma 3 27b-it model.

    kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-${ACCELERATOR_TYPE}-${MODEL_NAME} 8000:8000 >/dev/null &
    PF_PID=$!
    curl http://127.0.0.1:8000/v1/chat/completions \
      --data '{
        "model": "/gcs/'${MODEL_ID}'",
        "messages": [ { "role": "user", "content": "What is GKE?" } ]
      }' \
      --header "Content-Type: application/json" \
      --request POST \
      --show-error \
      --silent | jq
    kill -9 ${PF_PID}
    

    You should see a JSON response from the model answering the question.

Clean up

To avoid incurring charges, delete all the resources you created.

  1. Delete the inference workload:

    kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm/${ACCELERATOR_TYPE}-${MODEL_NAME}"
    
  2. Remove the foundational GKE cluster:

    Autopilot

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-ap.sh"
    

    Standard

    "${ACP_REPO_DIR}/platforms/gke/base/tutorials/hf-gpu-model/teardown-standard.sh"
    

What's next