Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy


This guide shows you how to optimize workload costs when you deploy a large language model (LLM). The GKE infrastructure utilizes a combination of flex-start provisioning mode, Spot VMs, and custom compute class profiles to optimize workload costs.

This guide uses Mixtral 8x7b as an example LLM you can deploy.

This guide is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving LLMs. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Background

This section describes the available techniques that you can use to obtain computing resources, including GPU accelerators, based on the requirements of your AI/ML workloads. These techniques are called accelerator obtainability strategies in GKE.

GPUs

Graphical processing units (GPUs) let you accelerate specific workloads such as machine learning and data processing. GKE offers nodes that are equipped with these powerful GPUs to optimize the performance of machine learning and data processing tasks. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, A100, and L4 GPUs.

For more information, see About GPUs in GKE.

Flex-start provisioning mode

Flex-start provisioning mode is a type of GPU reservation where GKE persists your GPU request and automatically provisions resources when capacity becomes available. Consider using flex-start provisioning mode for workloads that need GPU capacity for a limited time, up to seven days, and don't have a fixed start date. For more information, see flex-start provisioning mode.

Spot VMs

You can use GPUs with Spot VMs if your workloads can tolerate frequent node disruptions. Using Spot VMs or flex-start provisioning mode reduce the price of running GPUs. Using Spot VMs combined with flex-start provisioning mode provides a fallback option when Spot VMs capacity is unavailable.

For more information, see Using Spot VMs with GPU node pools.

Custom compute classes

You can request GPUs by using custom compute classes. Custom compute classes let you define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware. For more information, see About custom compute classes.

Before you begin

  • Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Make sure that billing is enabled for your Google Cloud project.

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Make sure that billing is enabled for your Google Cloud project.

  • Make sure that you have the following role or roles on the project:

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

Get access to the model

If you don't already have one, generate a new Hugging Face token:

  1. Click Your Profile > Settings > Access Tokens.
  2. Select New Token.
  3. Specify a name of your choice and a role of at least Read.
  4. Select Generate a token.

Create custom compute class profile

In this section, you create a custom compute class profile. Custom compute class profiles define the types and relationships between multiple compute resources used by your workload.

  1. In the Google Cloud console, launch a Cloud Shell session by clicking Cloud Shell activation icon Activate Cloud Shell in the Google Cloud console. A session opens in the bottom pane of the Google Cloud console.
  2. Create a dws-flex-start.yaml manifest file:

    apiVersion: cloud.google.com/v1
    kind: ComputeClass
    metadata:
      name: dws-model-inference-class
    spec:
      priorities:
        - machineType: g2-standard-24
          spot: true
        - machineType: g2-standard-24
          flexStart:
            enabled: true
            nodeRecycling:
              leadTimeSeconds: 3600
      nodePoolAutoCreation:
        enabled: true
    
  3. Apply the dws-flex-start.yaml manifest:

    kubectl apply -f dws-flex-start.yaml
    

GKE deploys g2-standard-24 machines with L4 accelerators. GKE uses compute classes to prioritize Spot VMs first, and flex-start provisioning mode second.

Deploy the LLM workload

  1. Create a Kubernetes Secret that contains the Hugging Face token by using the following command:

    kubectl create secret generic model-inference-secret \
        --from-literal=HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN \
        --dry-run=client -o yaml | kubectl apply -f -
    

    Replace the HUGGING_FACE_TOKEN with your Hugging Face access token.

  2. Create a file named mixtral-deployment.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: inference-mixtral-ccc
    spec:
      nodeSelector:
        cloud.google.com/compute-class: dws-model-inference-class
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
            resources:
              requests:
                cpu: "5"
                memory: "40Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "5"
                memory: "40Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: mistralai/Mixtral-8x7B-Instruct-v0.1
            - name: NUM_SHARD
              value: "2"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: model-inference-secret
                  key: HUGGING_FACE_TOKEN
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /tmp
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 100Gi
    

    In this manifest, the mountPath field is set to /tmp, because it's the path where the HF_HOME environment variable in the Deep Learning Container (DLC) for Text Generation Inference (TGI) is set to, instead of the default /data path that's set within the TGI default image. The downloaded model will be stored in this directory.

  3. Deploy the model:

    kubectl apply -f  mixtral-deployment.yaml
    

    GKE schedules a new Pod to deploy, which triggers the node pool autoscaler to add a second node before it deploys the second replica of the model.

  4. Verify the status of the model:

    watch kubectl get deploy inference-mixtral-ccc
    

    If the model was deployed successfully, the output is similar to the following:

    NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
    inference-mixtral-ccc  1/1     1            1           10m
    

    To exit the watch, press CTRL + C.

  5. View the node pools that GKE provisioned:

    kubectl get nodes -L cloud.google.com/gke-nodepool
    

    The output is similar to the following:

      NAME                                                  STATUS   ROLES    AGE   VERSION               GKE-NODEPOOL
      gke-flex-na-nap-g2-standard--0723b782-fg7v   Ready    <none>   10m   v1.32.3-gke.1152000   nap-g2-standard-24-spot-gpu2-1gbdlbxz
      gke-flex-nap-zo-default-pool-09f6fe53-fzm8   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
      gke-flex-nap-zo-default-pool-09f6fe53-lv2v   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
      gke-flex-nap-zo-default-pool-09f6fe53-pq6m   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
    

    The name of the created node pool indicates the type of machine. In this case, GKE provisioned Spot VMs.

Interact with the model using curl

This section shows how you can perform a basic inference test to verify your deployed model.

  1. Set up port forwarding to the model:

    kubectl port-forward service/llm-service 8080:8080
    

    The output is similar to the following:

    Forwarding from 127.0.0.1:8080 -> 8080
    
  2. In a new terminal session, chat with your model by using curl:

    curl http://localhost:8080/v1/completions \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mixtral-8x7b-instruct-gptq",
        "prompt": "<s>[INST]Who was the first president of the United States?[/INST]",
        "max_tokens": 40}'
    

    The output looks similar to the following:

    George Washington was a Founding Father and the first president of the United States, serving from 1789 to 1797.
    

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the individual resource

  1. Delete the Kubernetes resources that you created from this guide:

    kubectl delete deployment inference-mixtral-ccc
    kubectl delete service llm-service
    kubectl delete computeclass dws-model-inference-class
    kubectl delete secret model-inference-secret
    
  2. Delete the cluster:

    gcloud container clusters delete CLUSTER_NAME
    

What's next