Optimize GKE resource utilization for mixed AI/ML training and inference workloads

Standard Autopilot

This tutorial shows you how to efficiently share accelerator resources between training- and inference-serving workloads within a single Google Kubernetes Engine (GKE) cluster. By distributing your mixed workloads across a single cluster, you improve resource utilization, simplify cluster management, reduce issues from accelerator quantity limitations, and enhance overall cost-effectiveness.

In this tutorial, you create a high-priority serving Deployment using the Gemma 2 large language model (LLM) for inference and the Hugging Face TGI (Text Generation Interface) serving framework, along with a low-priority LLM fine-tuning Job. Both workloads run on a single cluster that uses NVIDIA L4 GPUs. You use Kueue, an open source Kubernetes-native Job queueing system, to manage and schedule your workloads. Kueue lets you prioritize serving tasks and preempt lower-priority training Jobs to optimize resource utilization. As serving demands decrease, you reallocate the freed-up accelerators to resume training Jobs. You use Kueue and priority classes to manage resource quotas throughout the process.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists who want to train and host a machine learning (ML) model on a GKE cluster, and who also want to reduce costs and management overhead, especially when dealing with a limited number of accelerators. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Objectives

By the end of this guide, you should be able to perform the following steps:

Configure a high-priority serving Deployment.
Set up lower-priority training Jobs.
Implement preemption strategies to address varying demand.
Manage resource allocation between training and serving tasks using Kueue.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required APIs.

Enable the APIs

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Create a Hugging Face account, if you don't already have one.
Ensure your project has sufficient quota for L4 GPUs. To learn more, see About GPUs and Allocation quotas.

Prepare the environment

In this section, you provision the resources that you need to deploy TGI and the model for your inference and training workloads.

Get access to the model

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement, then generate a Hugging Face access token.

Sign the license consent agreement. Access the model consent page, verify consent using your Hugging Face account, and accept the model terms.
Generate an access token. To access the model through Hugging Face, you need a Hugging Face token. Follow these steps to generate a new token if you don't have one already:
1. Click Your Profile > Settings > Access Tokens.
2. Select New Token.
3. Specify a Name of your choice and a Role of at least Read.
4. Select Generate a token.
5. Copy the generated token to your clipboard.

Launch Cloud Shell

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you need for this tutorial, including kubectl, gcloud CLI, and Terraform.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.

Set the default environment variables:

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)

Replace PROJECT_ID with your Google Cloud project ID.

Clone the sample code from GitHub. In Cloud Shell, run the following commands:

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/
cd kubernetes-engine-samples/ai-ml/mix-train-and-inference
export EXAMPLE_HOME=$(pwd)

Create a GKE cluster

You can use an Autopilot or Standard cluster for your mixed workloads. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

Set the default environment variables in Cloud Shell:
```
export HF_TOKEN=HF_TOKEN
export REGION=REGION
export CLUSTER_NAME="llm-cluster"
export PROJECT_NUMBER=$(gcloud projects list \
    --filter="$(gcloud config get-value project)" \
    --format="value(PROJECT_NUMBER)")
export MODEL_BUCKET="model-bucket-$PROJECT_ID"
```
Replace the following values:
- HF_TOKEN: the Hugging Face token you generated earlier.
- REGION: a region that supports the accelerator type you want to use, for example, us-central1 for the L4 GPU.
You can adjust the MODEL_BUCKET variable—this represents the Cloud Storage bucket where you store your trained model weights.

Create an Autopilot cluster:

gcloud container clusters create-auto ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --release-channel=rapid

Create the Cloud Storage bucket for the fine-tuning job:

gcloud storage buckets create gs://${MODEL_BUCKET} \
    --location ${REGION} \
    --uniform-bucket-level-access

To grant access to the Cloud Storage bucket, run this command:

gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \
    --role=roles/storage.objectAdmin \
    --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \
    --condition=None

To get authentication credentials for the cluster, run this command:

gcloud container clusters get-credentials llm-cluster \
    --location=$REGION \
    --project=$PROJECT_ID

Create a namespace for your Deployments. In Cloud Shell, run the following command:
```
kubectl create ns llm
```

Standard

Set the default environment variables in Cloud Shell:
```
export HF_TOKEN=HF_TOKEN
export REGION=REGION
export CLUSTER_NAME="llm-cluster"
export GPU_POOL_MACHINE_TYPE="g2-standard-24"
export GPU_POOL_ACCELERATOR_TYPE="nvidia-l4"
export PROJECT_NUMBER=$(gcloud projects list \
    --filter="$(gcloud config get-value project)" \
    --format="value(PROJECT_NUMBER)")
export MODEL_BUCKET="model-bucket-$PROJECT_ID"
```
Replace the following values:
- HF_TOKEN: the Hugging Face token you generated earlier.
- REGION: the region that supports the accelerator type you want to use, for example, us-central1 for the L4 GPU.
You can adjust these variables:
- GPU_POOL_MACHINE_TYPE: the node pool machine series that you want to use in your selected region. This value depends on the accelerator type you selected. To learn more, see Limitations of using GPUs on GKE. For example, this tutorial uses g2-standard-24 with two GPUs attached per node. For the most up-to-date list of available GPUs, see GPUs for Compute Workloads.
- GPU_POOL_ACCELERATOR_TYPE: the accelerator type that's supported in your selected region. For example, this tutorial uses nvidia-l4. For the latest list of available GPUs, see GPUs for Compute Workloads.
- MODEL_BUCKET: the Cloud Storage bucket where you store your trained model weights.

Create a Standard cluster:

gcloud container clusters create ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --release-channel=rapid \
    --machine-type=e2-standard-4 \
    --addons GcsFuseCsiDriver \
    --num-nodes=1

Create the GPU node pool for inference and fine-tuning workloads:

gcloud container node-pools create gpupool \
    --accelerator type=${GPU_POOL_ACCELERATOR_TYPE},count=2,gpu-driver-version=latest \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --node-locations=${REGION}-a \
    --cluster=${CLUSTER_NAME} \
    --machine-type=${GPU_POOL_MACHINE_TYPE} \
    --num-nodes=3

Create the Cloud Storage bucket for the fine-tuning job:

gcloud storage buckets create gs://${MODEL_BUCKET} \
    --location ${REGION} \
    --uniform-bucket-level-access

To grant access to the Cloud Storage bucket, run this command:

gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \
    --role=roles/storage.objectAdmin \
    --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \
    --condition=None

To get authentication credentials for the cluster, run this command:

gcloud container clusters get-credentials llm-cluster \
    --location=$REGION \
    --project=$PROJECT_ID

Create a namespace for your Deployments. In Cloud Shell, run the following command:
```
kubectl create ns llm
```

Create a Kubernetes Secret for Hugging Face credentials

To create a Kubernetes Secret that contains the Hugging Face token, run the following command:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=$HF_TOKEN \
    --dry-run=client -o yaml | kubectl apply --namespace=llm --filename=-

Configure Kueue

In this tutorial, Kueue is the central resource manager, enabling efficient sharing of GPUs between your training and serving workloads. Kueue achieves this by defining resource requirements ("flavors"), prioritizing workloads through queues (with serving tasks prioritized over training), and dynamically allocating resources based on demand and priority. This tutorial uses the Workload resource type to group the inference and fine-tuning workloads, respectively.

Kueue's preemption feature ensures that high-priority serving workloads always have the necessary resources by pausing or evicting lower-priority training Jobs when resources are scarce.

To control the inference server Deployment with Kueue, you enable pod integration and configure managedJobsNamespaceSelector to exclude the kube-system and kueue-system namespaces.

In the /kueue directory, view the code in kustomization.yaml. This manifest installs the Kueue resource manager with custom configurations.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/kubernetes-sigs/kueue/releases/download/v0.12.3/manifests.yaml
patches:
- path: patch.yaml
  target:
    version: v1
    kind: ConfigMap
    name: kueue-manager-config

In the /kueue directory, view the code in patch.yaml. This ConfigMap customizes Kueue to exclude management of Pods in the kube-system and kueue-system namespaces.

apiVersion: v1
kind: ConfigMap
metadata:
  name: kueue-manager-config
data:
  controller_manager_config.yaml: |
    apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8080
    # enableClusterQueueResources: true
    webhook:
      port: 9443
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    #pprofBindAddress: :8083
    #waitForPodsReady:
    #  enable: false
    #  timeout: 5m
    #  blockAdmission: false
    #  requeuingStrategy:
    #    timestamp: Eviction
    #    backoffLimitCount: null # null indicates infinite requeuing
    #    backoffBaseSeconds: 60
    #    backoffMaxSeconds: 3600
    #manageJobsWithoutQueueName: true
    managedJobsNamespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: [ kube-system, kueue-system ]
    #internalCertManagement:
    #  enable: false
    #  webhookServiceName: ""
    #  webhookSecretName: ""
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "kubeflow.org/jaxjob"
      - "workload.codeflare.dev/appwrapper"
      - "pod"
    #  - "deployment" # requires enabling pod integration
    #  - "statefulset" # requires enabling pod integration
    #  - "leaderworkerset.x-k8s.io/leaderworkerset" # requires enabling pod integration
    #  externalFrameworks:
    #  - "Foo.v1.example.com"
    #fairSharing:
    #  enable: true
    #  preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    #admissionFairSharing:
    #  usageHalfLifeTime: "168h" # 7 days
    #  usageSamplingInterval: "5m"
    #  resourceWeights: # optional, defaults to 1 for all resources if not specified
    #    cpu: 0    # if you want to completely ignore cpu usage
    #    memory: 0 # ignore completely memory usage
    #    example.com/gpu: 100 # and you care only about GPUs usage
    #resources:
    #  excludeResourcePrefixes: []
    #  transformations:
    #  - input: nvidia.com/mig-4g.5gb
    #    strategy: Replace | Retain
    #    outputs:
    #      example.com/accelerator-memory: 5Gi
    #      example.com/accelerator-gpc: 4
    #objectRetentionPolicies:
    #  workloads:
    #    afterFinished: null # null indicates infinite retention, 0s means no retention at all
    #    afterDeactivatedByKueue: null # null indicates infinite retention, 0s means no retention at all

In Cloud Shell, run the following command to install Kueue:

cd ${EXAMPLE_HOME}
kubectl kustomize kueue |kubectl apply --server-side --filename=-

Wait until the Kueue Pods are ready:

watch kubectl --namespace=kueue-system get pods

The output should look similar to the following:

NAME                                        READY   STATUS    RESTARTS   AGE
kueue-controller-manager-bdc956fc4-vhcmx    1/1     Running   0          3m15s

In the /workloads directory, view the flavors.yaml, cluster-queue.yaml, and local-queue.yaml files. These manifests specify how Kueue manages resource quotas:

ResourceFlavor

This manifest defines a default ResourceFlavor in Kueue for resource management.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor

ClusterQueue

This manifest sets up a Kueue ClusterQueue with resource limits for CPU, memory, and GPU.

This tutorial uses nodes with two Nvidia L4 GPUs attached, with the corresponding node type of g2-standard-24, offering 24 vCPU and 96 GB RAM. The example code shows how to limit your workload's resource usage to a maximum of six GPUs.

The preemption field in the ClusterQueue configuration references the PriorityClasses to determine which Pods can be preempted when resources are scarce.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  preemption:
    reclaimWithinCohort: LowerPriority
    withinClusterQueue: LowerPriority
  resourceGroups:
  - coveredResources: [ "cpu", "memory", "nvidia.com/gpu", "ephemeral-storage" ]
    flavors:
    - name: default-flavor
      resources:
      - name: "cpu"
        nominalQuota: 72
      - name: "memory"
        nominalQuota: 288Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 6
      - name: "ephemeral-storage"
        nominalQuota: 200Gi

LocalQueue

This manifest creates a Kueue LocalQueue named lq in the llm namespace.

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: llm # LocalQueue under llm namespace 
  name: lq
spec:
  clusterQueue: cluster-queue # Point to the ClusterQueue

View the default-priorityclass.yaml, low-priorityclass.yaml, and high-priorityclass.yaml files. These manifests define the PriorityClass objects for Kubernetes scheduling.

Default priority

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: default-priority-nonpreempting
value: 10
preemptionPolicy: Never
globalDefault: true
description: "This priority class will not cause other pods to be preempted."

Low priority

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority-preempting
value: 20
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This priority class will cause pods with lower priority to be preempted."

High priority

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-preempting
value: 30
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This high priority class will cause other pods to be preempted."

Create the Kueue and Kubernetes objects by running these commands to apply the corresponding manifests.

cd ${EXAMPLE_HOME}/workloads
kubectl apply --filename=flavors.yaml
kubectl apply --filename=default-priorityclass.yaml
kubectl apply --filename=high-priorityclass.yaml
kubectl apply --filename=low-priorityclass.yaml
kubectl apply --filename=cluster-queue.yaml
kubectl apply --filename=local-queue.yaml --namespace=llm

Deploy the TGI inference server

In this section, you deploy the TGI container to serve the Gemma 2 model.

In the /workloads directory, view the tgi-gemma-2-9b-it-hp.yaml file. This manifest defines a Kubernetes Deployment to deploy the TGI serving runtime and gemma-2-9B-it model. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.

The Deployment prioritizes inference tasks and uses two GPUs for the model. It uses tensor parallelism, by setting the NUM_SHARD environment variable, to fit the model into GPU memory.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-gemma-deployment
  labels:
    app: gemma-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2-9b-it
        ai.gke.io/inference-server: text-generation-inference
        examples.ai.gke.io/source: user-guide
        kueue.x-k8s.io/queue-name: lq
    spec:
      priorityClassName: high-priority-preempting
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310
        resources:
          requests:
            cpu: "4"
            memory: "30Gi"
            ephemeral-storage: "30Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "4"
            memory: "30Gi"
            ephemeral-storage: "30Gi"
            nvidia.com/gpu: "2"
        env:
        - name: AIP_HTTP_PORT
          value: '8000'
        - name: NUM_SHARD
          value: '2'
        - name: MODEL_ID
          value: google/gemma-2-9b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000

Apply the manifest by running the following command:
```
kubectl apply --filename=tgi-gemma-2-9b-it-hp.yaml --namespace=llm
```
The deployment operation will take a few minutes to complete.

To check if GKE successfully created the Deployment, run the following command:

kubectl --namespace=llm get deployment

The output should look similar to the following:

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
tgi-gemma-deployment   1/1     1            1           5m13s

Verify Kueue quota management

In this section, you confirm that Kueue is correctly enforcing the GPU quota for your Deployment.

To check if Kueue is aware of your Deployment, run this command to retrieve the status of the Workload objects:

kubectl --namespace=llm get workloads

The output should look similar to the following:

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
pod-tgi-gemma-deployment-6bf9ffdc9b-zcfrh-84f19   lq      cluster-queue   True                  8m23s

To test overriding the quota limits, scale the Deployment to four replicas:

kubectl scale --replicas=4 deployment/tgi-gemma-deployment --namespace=llm

Run the following command to see the number of replicas that GKE deploys:

kubectl get workloads --namespace=llm

The output should look similar to the following:

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
pod-tgi-gemma-deployment-6cb95cc7f5-5thgr-3f7d4   lq      cluster-queue   True                  14s
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  5m41s
pod-tgi-gemma-deployment-6cb95cc7f5-tznkl-80f6b   lq                                            13s
pod-tgi-gemma-deployment-6cb95cc7f5-wd4q9-e4302   lq      cluster-queue   True                  13s

The output shows that only three Pods are admitted due to the resource quota that Kueue enforces.

Run the following to display the Pods in the llm namespace:

kubectl get pod --namespace=llm

The output should look similar to the following:

NAME                                    READY   STATUS            RESTARTS   AGE
tgi-gemma-deployment-7649884d64-6j256   1/1     Running           0          4m45s
tgi-gemma-deployment-7649884d64-drpvc   0/1     SchedulingGated   0          7s
tgi-gemma-deployment-7649884d64-thdkq   0/1     Pending           0          7s
tgi-gemma-deployment-7649884d64-znvpb   0/1     Pending           0          7s

Now, scale down the Deployment back to 1. This step is required before deploying the fine-tuning job, otherwise it won't get admitted due to the inference job having priority.
```
kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm
```

Explanation of the behavior

The scaling example results in only three replicas (despite scaling to four) because of the GPU quota limit that you set in the ClusterQueue configuration. The ClusterQueue's spec.resourceGroups section defines a nominalQuota of "6" for nvidia.com/gpu. The Deployment specifies that each Pod requires "2" GPUs. Therefore, the ClusterQueue can only accommodate a maximum of three replicas of the Deployment at a time (since 3 replicas * 2 GPUs per replica = 6 GPUs, which is the total quota).

When you attempt to scale to four replicas, Kueue recognizes that this action would exceed the GPU quota and it prevents the fourth replica from being scheduled. This is indicated by the SchedulingGated status of the fourth Pod. This behavior demonstrates Kueue's resource quota enforcement.

Deploy the training Job

In this section, you deploy a lower-priority fine-tuning Job for a Gemma 2 model that requires four GPUs across two Pods. A Job controller in Kubernetes creates one or more Pods and ensures that they successfully execute a specific task.

This Job will use the remaining GPU quota in the ClusterQueue. The Job uses a prebuilt image and saves checkpoints to allow restarting from intermediate results.

The fine-tuning Job uses the b-mc2/sql-create-context dataset. The source for the fine-tuning job can be found in the repository.

View the fine-tune-l4.yaml file. This manifest defines the fine-tuning Job.

apiVersion: v1
kind: Service
metadata:
  name: headless-svc-l4
spec:
  clusterIP: None # clusterIP must be None to create a headless service
  selector:
    job-name: finetune-gemma-l4 # must match Job name
---
apiVersion: batch/v1
kind: Job
metadata:
  name: finetune-gemma-l4
  labels:
    kueue.x-k8s.io/queue-name: lq
spec:
  backoffLimit: 4
  completions: 2
  parallelism: 2
  completionMode: Indexed
  suspend: true # Set to true to allow Kueue to control the Job when it starts
  template:
    metadata:
      labels:
        app: finetune-job
      annotations:
        gke-gcsfuse/volumes: "true"
        gke-gcsfuse/memory-limit: "35Gi"
    spec:
      priorityClassName: low-priority-preempting
      containers:
      - name: gpu-job
        imagePullPolicy: Always
        image: us-docker.pkg.dev/google-samples/containers/gke/gemma-fine-tuning:v1.0.0
        ports:
        - containerPort: 29500
        resources:
          requests:
            nvidia.com/gpu: "2"
          limits:
            nvidia.com/gpu: "2"
        command:
        - bash
        - -c
        - |
          accelerate launch \
          --config_file fsdp_config.yaml \
          --debug \
          --main_process_ip finetune-gemma-l4-0.headless-svc-l4 \
          --main_process_port 29500 \
          --machine_rank ${JOB_COMPLETION_INDEX} \
          --num_processes 4 \
          --num_machines 2 \
          fine_tune.py
        env:
        - name: "EXPERIMENT"
          value: "finetune-experiment"
        - name: MODEL_NAME
          value: "google/gemma-2-2b"
        - name: NEW_MODEL
          value: "gemma-ft"
        - name: MODEL_PATH
          value: "/model-data/model-gemma2/experiment"
        - name: DATASET_NAME
          value: "b-mc2/sql-create-context"
        - name: DATASET_LIMIT
          value: "5000"
        - name: EPOCHS
          value: "1"
        - name: GRADIENT_ACCUMULATION_STEPS
          value: "2"
        - name: CHECKPOINT_SAVE_STEPS
          value: "10"
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - name: gcs-fuse-csi-ephemeral
          mountPath: /model-data
          readOnly: false
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      restartPolicy: OnFailure
      serviceAccountName: default
      subdomain: headless-svc-l4
      terminationGracePeriodSeconds: 60
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      - name: gcs-fuse-csi-ephemeral
        csi:
          driver: gcsfuse.csi.storage.gke.io
          volumeAttributes:
            bucketName: <MODEL_BUCKET>
            mountOptions: "implicit-dirs"
            gcsfuseLoggingSeverity: warning

Apply the manifest to create the fine-tuning Job:

cd ${EXAMPLE_HOME}/workloads

sed -e "s/<MODEL_BUCKET>/$MODEL_BUCKET/g" \
    -e "s/<PROJECT_ID>/$PROJECT_ID/g" \
    -e "s/<REGION>/$REGION/g" \
    fine-tune-l4.yaml |kubectl apply --filename=- --namespace=llm

Verify that your Deployments are running. To check the status of the Workload objects, run the following command:

kubectl get workloads --namespace=llm

The output should look similar to the following:

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-finetune-gemma-l4-3316f                       lq      cluster-queue   True                  29m
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  68m

Next, view the Pods in the llm namespace by running this command:

kubectl get pod --namespace=llm

The output should look similar to the following:

NAME                                    READY   STATUS    RESTARTS   AGE
finetune-gemma-l4-0-vcxpz               2/2     Running   0          31m
finetune-gemma-l4-1-9ppt9               2/2     Running   0          31m
tgi-gemma-deployment-6cb95cc7f5-cbxg2   1/1     Running   0          70m

The output shows that Kueue admits both your fine-tune Job and inference server Pods to run, reserving the correct resources based on your specified quota limits.

View the output logs to verify that your fine-tuning Job saves checkpoints to the Cloud Storage bucket. The fine-tuning Job takes around 10 minutes before it starts saving the first checkpoint.

kubectl logs --namespace=llm --follow --selector=app=finetune-job

The output for the first saved checkpoint looks similar to the following:

{"name": "finetune", "thread": 133763559483200, "threadName": "MainThread", "processName": "MainProcess", "process": 33, "message": "Fine tuning started", "timestamp": 1731002351.0016131, "level": "INFO", "runtime": 451579.89835739136}
…
{"name": "accelerate.utils.fsdp_utils", "thread": 136658669348672, "threadName": "MainThread", "processName": "MainProcess", "process": 32, "message": "Saving model to /model-data/model-gemma2/experiment/checkpoint-10/pytorch_model_fsdp_0", "timestamp": 1731002386.1763802, "level": "INFO", "runtime": 486753.8924217224}

Test Kueue preemption and dynamic allocation on your mixed workload

In this section, you simulate a scenario where the inference server's load increases, requiring it to scale up. This scenario demonstrates how Kueue prioritizes the high-priority inference server by suspending and preempting the lower-priority fine-tuning Job when resources are constrained.

Run the following command to scale the inference server's replicas to two:

kubectl scale --replicas=2 deployment/tgi-gemma-deployment --namespace=llm

Check the status of the Workload objects:

kubectl get workloads --namespace=llm

The output looks similar to the following:

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-finetune-gemma-l4-3316f                       lq                      False                 32m
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  70m
pod-tgi-gemma-deployment-6cb95cc7f5-p49sh-167de   lq      cluster-queue   True                  14s

The output shows that the fine-tuning Job is no longer admitted because the increased inference server replicas are using the available GPU quota.

Check the status of the fine-tune Job:

kubectl get job --namespace=llm

The output looks similar to the following, indicating that the fine-tune Job status is now suspended:

NAME                STATUS      COMPLETIONS   DURATION   AGE
finetune-gemma-l4   Suspended   0/2                      33m

Run the following command to inspect your Pods:

kubectl get pod --namespace=llm

The output looks similar to the following, indicating that Kueue terminated the fine-tune Job Pods to free resources for the higher priority inference server Deployment.

NAME                                    READY   STATUS              RESTARTS   AGE
tgi-gemma-deployment-6cb95cc7f5-cbxg2   1/1     Running             0          72m
tgi-gemma-deployment-6cb95cc7f5-p49sh   0/1     ContainerCreating   0          91s

Next, test the scenario where the inference server load decreases and its Pods scale down. Run the following command:

kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm

Run the following command to display the Workload objects:

kubectl get workloads --namespace=llm

The output looks similar to the following, indicating that one of the inference server Deployments is terminated, and the fine-tune Job is re-admitted.

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-finetune-gemma-l4-3316f                       lq      cluster-queue   True                  37m
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  75m

Run this command to display the Jobs:
```
kubectl get job --namespace=llm
```
The output looks similar to the following, indicating that the fine-tune Job is running again, resuming from the latest available checkpoint.
```
NAME                STATUS    COMPLETIONS   DURATION   AGE
finetune-gemma-l4   Running   0/2           2m11s      38m
```

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following commands:

gcloud storage rm --recursive gs://${MODEL_BUCKET}
gcloud container clusters delete ${CLUSTER_NAME} --location ${REGION}

Optimize GKE resource utilization for mixed AI/ML training and inference workloads Stay organized with collections Save and categorize content based on your preferences.

Objectives

Before you begin

Check for the roles

Grant the roles

Prepare the environment

Get access to the model

Launch Cloud Shell

Create a GKE cluster

Autopilot

Standard

Create a Kubernetes Secret for Hugging Face credentials

Configure Kueue

ResourceFlavor

ClusterQueue

LocalQueue

Default priority

Low priority

High priority

Deploy the TGI inference server

Verify Kueue quota management

Explanation of the behavior

Deploy the training Job

Test Kueue preemption and dynamic allocation on your mixed workload

Clean up

Delete the deployed resources

What's next

Optimize GKE resource utilization for mixed AI/ML training and inference workloads