Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy

Autopilot Standard

This guide shows you how to optimize costs for LLM-serving workloads on GKE. This tutorial uses a combination of flex-start, Spot VMs, and custom compute class profiles for cost-effective inference.

This guide uses Mixtral 8x7b as an example LLM you can deploy.

This guide is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving LLMs. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

Flex-start pricing

Flex-start is recommended if your workload requires dynamically provisioned resources as needed, for up to seven days with short-term reservations, no complex quota management, and cost-effective access. Flex-start is powered by Dynamic Workload Scheduler and is billed using Dynamic Workload Scheduler pricing:

Discounted (up to 53%) for vCPUs, GPUs, and TPUs.
You pay as you go.

Background

This section describes the available techniques that you can use to obtain computing resources, including GPU accelerators, based on the requirements of your AI/ML workloads. These techniques are called accelerator obtainability strategies in GKE.

GPUs

Graphical processing units (GPUs) let you accelerate specific workloads such as machine learning and data processing. GKE offers nodes that are equipped with these powerful GPUs to optimize the performance of machine learning and data processing tasks. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, A100, and L4 GPUs.

For more information, see About GPUs in GKE.

Flex-start provisioning mode

Flex-start provisioning mode, powered by Dynamic Workload Scheduler, is a GPU consumption type where GKE persists your GPU request and automatically provisions resources when capacity becomes available. Consider using flex-start for workloads that need GPU capacity for a limited time, up to seven days, and don't have a fixed start date. For more information, see flex-start.

Spot VMs

You can use GPUs with Spot VMs if your workloads can tolerate frequent node disruptions. Using Spot VMs or flex-start reduce the price of running GPUs. Using Spot VMs combined with flex-start provides a fallback option when Spot VMs capacity is unavailable.

For more information, see Using Spot VMs with GPU node pools.

Custom compute classes

You can request GPUs by using custom compute classes. Custom compute classes let you define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware. For more information, see About custom compute classes.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project:
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Ensure that you have a GKE Autopilot or Standard cluster that runs version 1.32.2-gke.1652000 or later. Your cluster must enable node auto-provisioning and configure GPU limits .
Create a Hugging Face account, if you don't already have one.
Ensure your project has sufficient preemptible quota for NVIDIA L4 GPUs. For more information, see Preemptible quotas.

Get access to the model

If you don't already have one, generate a new Hugging Face token:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a name of your choice and a role of at least Read.
Select Generate a token.

Create custom compute class profile

In this section, you create a custom compute class profile. Custom compute class profiles define the types and relationships between multiple compute resources used by your workload.

In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. A session opens in the bottom pane of the Google Cloud console.

Create a dws-flex-start.yaml manifest file:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: dws-model-inference-class
spec:
  priorities:
    - machineType: g2-standard-24
      spot: true
    - machineType: g2-standard-24
      flexStart:
        enabled: true
        nodeRecycling:
          leadTimeSeconds: 3600
  nodePoolAutoCreation:
    enabled: true

Apply the dws-flex-start.yaml manifest:
```
kubectl apply -f dws-flex-start.yaml
```

GKE deploys g2-standard-24 machines with L4 accelerators. GKE uses compute classes to prioritize Spot VMs first, and flex-start second.

Deploy the LLM workload

Create a Kubernetes Secret that contains the Hugging Face token by using the following command:

kubectl create secret generic model-inference-secret \
    --from-literal=HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN \
    --dry-run=client -o yaml | kubectl apply -f -

Replace the HUGGING_FACE_TOKEN with your Hugging Face access token.

Create a file named mixtral-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-mixtral-ccc
spec:
  nodeSelector:
    cloud.google.com/compute-class: dws-model-inference-class
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
        resources:
          requests:
            cpu: "5"
            memory: "40Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "5"
            memory: "40Gi"
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: mistralai/Mixtral-8x7B-Instruct-v0.1
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: model-inference-secret
              key: HUGGING_FACE_TOKEN
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /tmp
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 100Gi

In this manifest, the mountPath field is set to /tmp, because it's the path where the HF_HOME environment variable in the Deep Learning Container (DLC) for Text Generation Inference (TGI) is set to, instead of the default /data path that's set within the TGI default image. The downloaded model will be stored in this directory.

Deploy the model:
```
kubectl apply -f  mixtral-deployment.yaml
```
GKE schedules a new Pod to deploy, which triggers the node pool autoscaler to add a second node before it deploys the second replica of the model.

Verify the status of the model:

watch kubectl get deploy inference-mixtral-ccc

If the model was deployed successfully, the output is similar to the following:

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
inference-mixtral-ccc  1/1     1            1           10m

To exit the watch, press CTRL + C.

View the node pools that GKE provisioned:

kubectl get nodes -L cloud.google.com/gke-nodepool

The output is similar to the following:

  NAME                                                  STATUS   ROLES    AGE   VERSION               GKE-NODEPOOL
  gke-flex-na-nap-g2-standard--0723b782-fg7v   Ready    <none>   10m   v1.32.3-gke.1152000   nap-g2-standard-24-spot-gpu2-1gbdlbxz
  gke-flex-nap-zo-default-pool-09f6fe53-fzm8   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
  gke-flex-nap-zo-default-pool-09f6fe53-lv2v   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
  gke-flex-nap-zo-default-pool-09f6fe53-pq6m   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool

The name of the created node pool indicates the type of machine. In this case, GKE provisioned Spot VMs.

Interact with the model using `curl`

This section shows how you can perform a basic inference test to verify your deployed model.

Set up port forwarding to the model:

kubectl port-forward service/llm-service 8080:8080

The output is similar to the following:

Forwarding from 127.0.0.1:8080 -> 8080

In a new terminal session, chat with your model by using curl:

curl http://localhost:8080/v1/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "mixtral-8x7b-instruct-gptq",
    "prompt": "<s>[INST]Who was the first president of the United States?[/INST]",
    "max_tokens": 40}'

The output looks similar to the following:

George Washington was a Founding Father and the first president of the United States, serving from 1789 to 1797.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the individual resource

Delete the Kubernetes resources that you created from this guide:

kubectl delete deployment inference-mixtral-ccc
kubectl delete service llm-service
kubectl delete computeclass dws-model-inference-class
kubectl delete secret model-inference-secret

Delete the cluster:

gcloud container clusters delete CLUSTER_NAME

What's next

Learn more how to Train a small workload with flex-start.
Learn more about GPUs in GKE.