This tutorial shows you how to efficiently share accelerator resources between training- and inference-serving workloads within a single Google Kubernetes Engine (GKE) cluster. By distributing your mixed workloads across a single cluster, you improve resource utilization, simplify cluster management, reduce issues from accelerator quantity limitations, and enhance overall cost-effectiveness.
In this tutorial, you create a high-priority serving Deployment using the Gemma 2 large language model (LLM) for inference and the Hugging Face TGI (Text Generation Interface) serving framework, along with a low-priority LLM fine-tuning Job. Both workloads run on a single cluster that uses NVIDIA L4 GPUs. You use Kueue, an open source Kubernetes-native Job queueing system, to manage and schedule your workloads. Kueue lets you prioritize serving tasks and preempt lower-priority training Jobs to optimize resource utilization. As serving demands decrease, you reallocate the freed-up accelerators to resume training Jobs. You use Kueue and priority classes to manage resource quotas throughout the process.
This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists who want to train and host a machine learning (ML) model on a GKE cluster, and who also want to reduce costs and management overhead, especially when dealing with a limited number of accelerators. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Before reading this page, ensure that you're familiar with the following:
Objectives
By the end of this guide, you should be able to perform the following steps:
- Configure a high-priority serving Deployment.
- Set up lower-priority training Jobs.
- Implement preemption strategies to address varying demand.
- Manage resource allocation between training and serving tasks using Kueue.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required APIs.
-
Make sure that you have the following role or roles on the project:
roles/container.admin
,roles/iam.serviceAccountAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Create a Hugging Face account, if you don't already have one.
- Ensure your project has sufficient quota for GPUs. To learn more, see About GPUs and Allocation quotas.
Prepare the environment
In this section, you provision the resources that you need to deploy TGI and the model for your inference and training workloads.
Get access to the model
To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement, then generate a Hugging Face access token.
- Sign the license consent agreement. Access the model consent page, verify consent using your Hugging Face account, and accept the model terms.
Generate an access token. To access the model through Hugging Face, you need a Hugging Face token. Follow these steps to generate a new token if you don't have one already:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Select Generate a token.
- Copy the generated token to your clipboard.
Launch Cloud Shell
In this tutorial, you use Cloud Shell to manage resources hosted on
Google Cloud. Cloud Shell comes preinstalled with the software you need
for this tutorial, including
kubectl
,
gcloud CLI, and Terraform.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project)
Replace PROJECT_ID with your Google Cloud project ID.
Clone the sample code from GitHub. In Cloud Shell, run the following commands:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/ cd kubernetes-engine-samples/ai-ml/mix-train-and-inference export EXAMPLE_HOME=$(pwd)
Create a GKE cluster
You can use an Autopilot or Standard cluster for your mixed workloads. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.
Autopilot
Set the default environment variables in Cloud Shell:
export HF_TOKEN=HF_TOKEN export REGION=REGION export CLUSTER_NAME="llm-cluster" export PROJECT_NUMBER=$(gcloud projects list \ --filter="$(gcloud config get-value project)" \ --format="value(PROJECT_NUMBER)") export MODEL_BUCKET="model-bucket-$PROJECT_ID"
Replace the following values:
- HF_TOKEN: the Hugging Face token you generated earlier.
- REGION: a region that supports the accelerator
type you want to use, for example,
us-central1
for the L4 GPU.
You can adjust the MODEL_BUCKET variable—this represents the Cloud Storage bucket where you store your trained model weights.
Create an Autopilot cluster:
gcloud container clusters create-auto ${CLUSTER_NAME} \ --project=${PROJECT_ID} \ --region=${REGION} \ --release-channel=rapid
Create the Cloud Storage bucket for the fine-tuning job:
gcloud storage buckets create gs://${MODEL_BUCKET} \ --location ${REGION} \ --uniform-bucket-level-access
To grant access to the Cloud Storage bucket, run this command:
gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \ --role=roles/storage.objectAdmin \ --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \ --condition=None
To get authentication credentials for the cluster, run this command:
gcloud container clusters get-credentials llm-cluster \ --region=$REGION \ --project=$PROJECT_ID
Create a namespace for your Deployments. In Cloud Shell, run the following command:
kubectl create ns llm
Standard
Set the default environment variables in Cloud Shell:
export HF_TOKEN=HF_TOKEN export REGION=REGION export CLUSTER_NAME="llm-cluster" export GPU_POOL_MACHINE_TYPE="g2-standard-24" export GPU_POOL_ACCELERATOR_TYPE="nvidia-l4" export PROJECT_NUMBER=$(gcloud projects list \ --filter="$(gcloud config get-value project)" \ --format="value(PROJECT_NUMBER)") export MODEL_BUCKET="model-bucket-$PROJECT_ID"
Replace the following values:
- HF_TOKEN: the Hugging Face token you generated earlier.
- REGION: the region that supports the accelerator
type you want to use, for example,
us-central1
for the L4 GPU.
You can adjust these variables:
- GPU_POOL_MACHINE_TYPE: the node pool machine series that you want to
use in your selected region. This value depends on the accelerator type
you selected. To learn more, see Limitations of using GPUs on GKE. For example, this
tutorial uses
g2-standard-24
with two GPUs attached per node. For the most up-to-date list of available GPUs, see GPUs for Compute Workloads. - GPU_POOL_ACCELERATOR_TYPE: the accelerator type
that's supported in your selected region. For example, this tutorial uses
nvidia-l4
. For the latest list of available GPUs, see GPUs for Compute Workloads. - MODEL_BUCKET: the Cloud Storage bucket where you store your trained model weights.
Create a Standard cluster:
gcloud container clusters create ${CLUSTER_NAME} \ --project=${PROJECT_ID} \ --region=${REGION} \ --workload-pool=${PROJECT_ID}.svc.id.goog \ --release-channel=rapid \ --machine-type=e2-standard-4 \ --addons GcsFuseCsiDriver \ --num-nodes=1
Create the GPU node pool for inference and fine-tuning workloads:
gcloud container node-pools create gpupool \ --accelerator type=${GPU_POOL_ACCELERATOR_TYPE},count=2,gpu-driver-version=latest \ --project=${PROJECT_ID} \ --location=${REGION} \ --node-locations=${REGION}-a \ --cluster=${CLUSTER_NAME} \ --machine-type=${GPU_POOL_MACHINE_TYPE} \ --num-nodes=3
Create the Cloud Storage bucket for the fine-tuning job:
gcloud storage buckets create gs://${MODEL_BUCKET} \ --location ${REGION} \ --uniform-bucket-level-access
To grant access to the Cloud Storage bucket, run this command:
gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \ --role=roles/storage.objectAdmin \ --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \ --condition=None
To get authentication credentials for the cluster, run this command:
gcloud container clusters get-credentials llm-cluster \ --region=$REGION \ --project=$PROJECT_ID
Create a namespace for your Deployments. In Cloud Shell, run the following command:
kubectl create ns llm
Create a Kubernetes Secret for Hugging Face credentials
To create a Kubernetes Secret that contains the Hugging Face token, run the following command:
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=$HF_TOKEN \
--dry-run=client -o yaml | kubectl apply --namespace=llm --filename=-
Configure Kueue
In this tutorial, Kueue is the central resource manager, enabling efficient sharing of GPUs between your training and serving workloads. Kueue achieves this by defining resource requirements ("flavors"), prioritizing workloads through queues (with serving tasks prioritized over training), and dynamically allocating resources based on demand and priority. This tutorial uses the Workload resource type to group the inference and fine-tuning workloads, respectively.
Kueue's preemption feature ensures that high-priority serving workloads always have the necessary resources by pausing or evicting lower-priority training Jobs when resources are scarce.
To control the inference server Deployment with Kueue, you enable the v1/pod
integration by applying a custom configuration using
Kustomize, to ensure the server Pods are labeled with
"kueue-job: true"
.
In the
/kueue
directory, view the code inkustomization.yaml
. This manifest installs the Kueue resource manager with custom configurations.In the
/kueue
directory, view the code inpatch.yaml
. This ConfigMap customizes Kueue to manage Pods with the"kueue-job: true"
label.In Cloud Shell, run the following command to install Kueue:
cd ${EXAMPLE_HOME} kubectl kustomize kueue |kubectl apply --server-side --filename=-
Wait until the Kueue Pods are ready:
watch kubectl --namespace=kueue-system get pods
The output should look similar to the following:
NAME READY STATUS RESTARTS AGE kueue-controller-manager-bdc956fc4-vhcmx 2/2 Running 0 3m15s
In the
/workloads
directory, view theflavors.yaml
,cluster-queue.yaml
, andlocal-queue.yaml
files. These manifests specify how Kueue manages resource quotas:ResourceFlavor
This manifest defines a default ResourceFlavor in Kueue for resource management.
ClusterQueue
This manifest sets up a Kueue ClusterQueue with resource limits for CPU, memory, and GPU.
This tutorial uses nodes with two Nvidia L4 GPUs attached, with the corresponding node type of
g2-standard-24
, offering 24 vCPU and 96 GB RAM. The example code shows how to limit your workload's resource usage to a maximum of six GPUs.The
preemption
field in the ClusterQueue configuration references the PriorityClasses to determine which Pods can be preempted when resources are scarce.LocalQueue
This manifest creates a Kueue LocalQueue named
lq
in thellm
namespace.View the
default-priorityclass.yaml
,low-priorityclass.yaml
, andhigh-priorityclass.yaml
files. These manifests define the PriorityClass objects for Kubernetes scheduling.Default priority
Low priority
High priority
Create the Kueue and Kubernetes objects by running these commands to apply the corresponding manifests.
cd ${EXAMPLE_HOME}/workloads kubectl apply --filename=flavors.yaml kubectl apply --filename=default-priorityclass.yaml kubectl apply --filename=high-priorityclass.yaml kubectl apply --filename=low-priorityclass.yaml kubectl apply --filename=cluster-queue.yaml kubectl apply --filename=local-queue.yaml --namespace=llm
Deploy the TGI inference server
In this section, you deploy the TGI container to serve the Gemma 2 model.
In the
/workloads
directory, view thetgi-gemma-2-9b-it-hp.yaml
file. This manifest defines a Kubernetes Deployment to deploy the TGI serving runtime andgemma-2-9B-it
model.The Deployment prioritizes inference tasks and uses two GPUs for the model. It uses tensor parallelism, by setting the
NUM_SHARD
environment variable, to fit the model into GPU memory.Apply the manifest by running the following command:
kubectl apply --filename=tgi-gemma-2-9b-it-hp.yaml --namespace=llm
The deployment operation will take a few minutes to complete.
To check if GKE successfully created the Deployment, run the following command:
kubectl --namespace=llm get deployment
The output should look similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE tgi-gemma-deployment 1/1 1 1 5m13s
Verify Kueue quota management
In this section, you confirm that Kueue is correctly enforcing the GPU quota for your Deployment.
To check if Kueue is aware of your Deployment, run this command to retrieve the status of the Workload objects:
kubectl --namespace=llm get workloads
The output should look similar to the following:
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE pod-tgi-gemma-deployment-6bf9ffdc9b-zcfrh-84f19 lq cluster-queue True 8m23s
To test overriding the quota limits, scale the Deployment to four replicas:
kubectl scale --replicas=4 deployment/tgi-gemma-deployment --namespace=llm
Run the following command to see the number of replicas that GKE deploys:
kubectl get workloads --namespace=llm
The output should look similar to the following:
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE pod-tgi-gemma-deployment-6cb95cc7f5-5thgr-3f7d4 lq cluster-queue True 14s pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7 lq cluster-queue True 5m41s pod-tgi-gemma-deployment-6cb95cc7f5-tznkl-80f6b lq 13s pod-tgi-gemma-deployment-6cb95cc7f5-wd4q9-e4302 lq cluster-queue True 13s
The output shows that only three Pods are admitted due to the resource quota that Kueue enforces.
Run the following to display the Pods in the
llm
namespace:kubectl get pod --namespace=llm
The output should look similar to the following:
NAME READY STATUS RESTARTS AGE tgi-gemma-deployment-7649884d64-6j256 1/1 Running 0 4m45s tgi-gemma-deployment-7649884d64-drpvc 0/1 SchedulingGated 0 7s tgi-gemma-deployment-7649884d64-thdkq 0/1 Pending 0 7s tgi-gemma-deployment-7649884d64-znvpb 0/1 Pending 0 7s
Now, scale down the Deployment back to 1. This step is required before deploying the fine-tuning job, otherwise it won't get admitted due to the inference job having priority.
kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm
Explanation of the behavior
The scaling example results in only three replicas (despite scaling to four)
because of the GPU quota limit that you set in the ClusterQueue configuration. The
ClusterQueue's spec.resourceGroups
section defines a nominalQuota of "6" for
nvidia.com/gpu
. The Deployment specifies that each Pod requires "2" GPUs.
Therefore, the ClusterQueue can only accommodate a maximum of three replicas of
the Deployment at a time (since 3 replicas * 2 GPUs per replica = 6 GPUs, which
is the total quota).
When you attempt to scale to four replicas, Kueue recognizes that this action would
exceed the GPU quota and it prevents the fourth replica from being scheduled. This
is indicated by the SchedulingGated
status of the fourth Pod. This behavior
demonstrates Kueue's resource quota enforcement.
Deploy the training Job
In this section, you deploy a lower-priority fine-tuning Job for a Gemma 2 model that requires four GPUs across two Pods. This job will use the remaining GPU quota in the ClusterQueue. The Job uses a prebuilt image and saves checkpoints to allow restarting from intermediate results.
The fine-tuning Job uses the b-mc2/sql-create-context
dataset. The source for the fine-tuning job can be found in the repository.
View the
fine-tune-l4.yaml
file. This manifest defines the fine-tuning Job.Apply the manifest to create the fine-tuning Job:
cd ${EXAMPLE_HOME}/workloads sed -e "s/<MODEL_BUCKET>/$MODEL_BUCKET/g" \ -e "s/<PROJECT_ID>/$PROJECT_ID/g" \ -e "s/<REGION>/$REGION/g" \ fine-tune-l4.yaml |kubectl apply --filename=- --namespace=llm
Verify that your Deployments are running. To check the status of the Workload objects, run the following command:
kubectl get workloads --namespace=llm
The output should look similar to the following:
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE job-finetune-gemma-l4-3316f lq cluster-queue True 29m pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7 lq cluster-queue True 68m
Next, view the Pods in the
llm
namespace by running this command:kubectl get pod --namespace=llm
The output should look similar to the following:
NAME READY STATUS RESTARTS AGE finetune-gemma-l4-0-vcxpz 2/2 Running 0 31m finetune-gemma-l4-1-9ppt9 2/2 Running 0 31m tgi-gemma-deployment-6cb95cc7f5-cbxg2 1/1 Running 0 70m
The output shows that Kueue admits both your fine-tune Job and inference server Pods to run, reserving the correct resources based on your specified quota limits.
View the output logs to verify that your fine-tuning Job saves checkpoints to the Cloud Storage bucket. The fine-tuning Job takes around 10 minutes before it starts saving the first checkpoint.
kubectl logs --namespace=llm --follow --selector=app=finetune-job
The output for the first saved checkpoint looks similar to the following:
{"name": "finetune", "thread": 133763559483200, "threadName": "MainThread", "processName": "MainProcess", "process": 33, "message": "Fine tuning started", "timestamp": 1731002351.0016131, "level": "INFO", "runtime": 451579.89835739136} … {"name": "accelerate.utils.fsdp_utils", "thread": 136658669348672, "threadName": "MainThread", "processName": "MainProcess", "process": 32, "message": "Saving model to /model-data/model-gemma2/experiment/checkpoint-10/pytorch_model_fsdp_0", "timestamp": 1731002386.1763802, "level": "INFO", "runtime": 486753.8924217224}
Test Kueue preemption and dynamic allocation on your mixed workload
In this section, you simulate a scenario where the inference server's load increases, requiring it to scale up. This scenario demonstrates how Kueue prioritizes the high-priority inference server by suspending and preempting the lower-priority fine-tuning Job when resources are constrained.
Run the following command to scale the inference server's replicas to two:
kubectl scale --replicas=2 deployment/tgi-gemma-deployment --namespace=llm
Check the status of the Workload objects:
kubectl get workloads --namespace=llm
The output looks similar to the following:
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE job-finetune-gemma-l4-3316f lq False 32m pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7 lq cluster-queue True 70m pod-tgi-gemma-deployment-6cb95cc7f5-p49sh-167de lq cluster-queue True 14s
The output shows that the fine-tuning Job is no longer admitted because the increased inference server replicas are using the available GPU quota.
Check the status of the fine-tune Job:
kubectl get job --namespace=llm
The output looks similar to the following, indicating that the fine-tune Job status is now suspended:
NAME STATUS COMPLETIONS DURATION AGE finetune-gemma-l4 Suspended 0/2 33m
Run the following command to inspect your Pods:
kubectl get pod --namespace=llm
The output looks similar to the following, indicating that Kueue terminated the fine-tune Job Pods to free resources for the higher priority inference server Deployment.
NAME READY STATUS RESTARTS AGE tgi-gemma-deployment-6cb95cc7f5-cbxg2 1/1 Running 0 72m tgi-gemma-deployment-6cb95cc7f5-p49sh 0/1 ContainerCreating 0 91s
Next, test the scenario where the inference server load decreases and its Pods scale down. Run the following command:
kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm
Run the following command to display the Workload objects:
kubectl get workloads --namespace=llm
The output looks similar to the following, indicating that one of the inference server Deployments is terminated, and the fine-tune Job is re-admitted.
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE job-finetune-gemma-l4-3316f lq cluster-queue True 37m pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7 lq cluster-queue True 75m
Run this command to display the Jobs:
kubectl get job --namespace=llm
The output looks similar to the following, indicating that the fine-tune Job is running again, resuming from the latest available checkpoint.
NAME STATUS COMPLETIONS DURATION AGE finetune-gemma-l4 Running 0/2 2m11s 38m
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following commands:
gcloud storage rm --recursive gs://${MODEL_BUCKET}
gcloud container clusters delete ${CLUSTER_NAME} --location ${REGION}