This guide shows you how to optimize workload costs when you deploy a large language model (LLM). The GKE infrastructure utilizes a combination of flex-start provisioning mode, Spot VMs, and custom compute class profiles to optimize workload costs.
This guide uses Mixtral 8x7b as an example LLM you can deploy.
This guide is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving LLMs. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Background
This section describes the available techniques that you can use to obtain computing resources, including GPU accelerators, based on the requirements of your AI/ML workloads. These techniques are called accelerator obtainability strategies in GKE.
GPUs
Graphical processing units (GPUs) let you accelerate specific workloads such as machine learning and data processing. GKE offers nodes that are equipped with these powerful GPUs to optimize the performance of machine learning and data processing tasks. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, A100, and L4 GPUs.
For more information, see About GPUs in GKE.
Flex-start provisioning mode
Flex-start provisioning mode is a type of GPU reservation where GKE persists your GPU request and automatically provisions resources when capacity becomes available. Consider using flex-start provisioning mode for workloads that need GPU capacity for a limited time, up to seven days, and don't have a fixed start date. For more information, see flex-start provisioning mode.
Spot VMs
You can use GPUs with Spot VMs if your workloads can tolerate frequent node disruptions. Using Spot VMs or flex-start provisioning mode reduce the price of running GPUs. Using Spot VMs combined with flex-start provisioning mode provides a fallback option when Spot VMs capacity is unavailable.
For more information, see Using Spot VMs with GPU node pools.
Custom compute classes
You can request GPUs by using custom compute classes. Custom compute classes let you define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware. For more information, see About custom compute classes.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Make sure that you have the following role or roles on the project:
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Ensure that you have a GKE Autopilot or Standard cluster that runs version 1.32.2-gke.1652000 or later. Your cluster must enable node auto-provisioning and configure GPU limits .
- Create a Hugging Face account, if you don't already have one.
- Ensure your project has sufficient quota for NVIDIA L4 GPUs. For more information, see About GPUs and Allocation quotas.
Get access to the model
If you don't already have one, generate a new Hugging Face token:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a name of your choice and a role of at least
Read
. - Select Generate a token.
Create custom compute class profile
In this section, you create a custom compute class profile. Custom compute class profiles define the types and relationships between multiple compute resources used by your workload.
- In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in the Google Cloud console. A session opens in the bottom pane of the Google Cloud console.
Create a
dws-flex-start.yaml
manifest file:apiVersion: cloud.google.com/v1 kind: ComputeClass metadata: name: dws-model-inference-class spec: priorities: - machineType: g2-standard-24 spot: true - machineType: g2-standard-24 flexStart: enabled: true nodeRecycling: leadTimeSeconds: 3600 nodePoolAutoCreation: enabled: true
Apply the
dws-flex-start.yaml
manifest:kubectl apply -f dws-flex-start.yaml
GKE deploys g2-standard-24
machines with L4 accelerators.
GKE uses
compute classes to
prioritize Spot VMs first, and flex-start provisioning mode
second.
Deploy the LLM workload
Create a Kubernetes Secret that contains the Hugging Face token by using the following command:
kubectl create secret generic model-inference-secret \ --from-literal=HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN \ --dry-run=client -o yaml | kubectl apply -f -
Replace the
HUGGING_FACE_TOKEN
with your Hugging Face access token.Create a file named
mixtral-deployment.yaml
:apiVersion: apps/v1 kind: Deployment metadata: name: inference-mixtral-ccc spec: nodeSelector: cloud.google.com/compute-class: dws-model-inference-class replicas: 1 selector: matchLabels: app: llm template: metadata: labels: app: llm spec: containers: - name: llm image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 resources: requests: cpu: "5" memory: "40Gi" nvidia.com/gpu: "2" limits: cpu: "5" memory: "40Gi" nvidia.com/gpu: "2" env: - name: MODEL_ID value: mistralai/Mixtral-8x7B-Instruct-v0.1 - name: NUM_SHARD value: "2" - name: PORT value: "8080" - name: QUANTIZE value: bitsandbytes-nf4 - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: model-inference-secret key: HUGGING_FACE_TOKEN volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /tmp name: ephemeral-volume volumes: - name: dshm emptyDir: medium: Memory - name: ephemeral-volume ephemeral: volumeClaimTemplate: metadata: labels: type: ephemeral spec: accessModes: ["ReadWriteOnce"] storageClassName: "premium-rwo" resources: requests: storage: 100Gi
In this manifest, the
mountPath
field is set to/tmp
, because it's the path where theHF_HOME
environment variable in the Deep Learning Container (DLC) for Text Generation Inference (TGI) is set to, instead of the default/data
path that's set within the TGI default image. The downloaded model will be stored in this directory.Deploy the model:
kubectl apply -f mixtral-deployment.yaml
GKE schedules a new Pod to deploy, which triggers the node pool autoscaler to add a second node before it deploys the second replica of the model.
Verify the status of the model:
watch kubectl get deploy inference-mixtral-ccc
If the model was deployed successfully, the output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE inference-mixtral-ccc 1/1 1 1 10m
To exit the watch, press
CTRL + C
.View the node pools that GKE provisioned:
kubectl get nodes -L cloud.google.com/gke-nodepool
The output is similar to the following:
NAME STATUS ROLES AGE VERSION GKE-NODEPOOL gke-flex-na-nap-g2-standard--0723b782-fg7v Ready <none> 10m v1.32.3-gke.1152000 nap-g2-standard-24-spot-gpu2-1gbdlbxz gke-flex-nap-zo-default-pool-09f6fe53-fzm8 Ready <none> 32m v1.32.3-gke.1152000 default-pool gke-flex-nap-zo-default-pool-09f6fe53-lv2v Ready <none> 32m v1.32.3-gke.1152000 default-pool gke-flex-nap-zo-default-pool-09f6fe53-pq6m Ready <none> 32m v1.32.3-gke.1152000 default-pool
The name of the created node pool indicates the type of machine. In this case, GKE provisioned Spot VMs.
Interact with the model using curl
This section shows how you can perform a basic inference test to verify your deployed model.
Set up port forwarding to the model:
kubectl port-forward service/llm-service 8080:8080
The output is similar to the following:
Forwarding from 127.0.0.1:8080 -> 8080
In a new terminal session, chat with your model by using
curl
:curl http://localhost:8080/v1/completions \ -X POST \ -H "Content-Type: application/json" \ -d '{ "model": "mixtral-8x7b-instruct-gptq", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'
The output looks similar to the following:
George Washington was a Founding Father and the first president of the United States, serving from 1789 to 1797.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resource
Delete the Kubernetes resources that you created from this guide:
kubectl delete deployment inference-mixtral-ccc kubectl delete service llm-service kubectl delete computeclass dws-model-inference-class kubectl delete secret model-inference-secret
Delete the cluster:
gcloud container clusters delete CLUSTER_NAME
What's next
- Learn more how to Train a small workload with flex-start provisioning mode.
- Learn more about GPUs in GKE.