This guide shows you how to optimize GPU provisioning for medium- and small-scale training workloads by using flex-start provisioning mode. In this guide, you use flex-start provisioning mode to deploy a workload that consists of two Kubernetes Jobs, each requiring one GPU. GKE automatically provisions a single node with two A100 GPUs to run both Jobs.
If your workload requires multi-node distributed processing, consider using flex-start provisioning mode with queued provisioning. For more information, see Run a large-scale workload with flex-start with queued provisioning.
This guide is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for running batch workloads. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Ensure that you have an Autopilot cluster or a Standard cluster in version 1.32.2-gke.1652000 or later.
- Ensure that you're familiar with limitations of flex-start provisioning mode.
- When using a Standard cluster, ensure that you maintain at least one node pool without flex-start provisioning mode enabled for the cluster to function correctly.
Create a node pool with flex-start provisioning mode
To create a node pool with flex-start provisioning mode enabled on an existing Standard cluster, you can use the gcloud CLI or Terraform.
If you use a cluster in Autopilot mode, skip this section and go to the Run a training workload section.
gcloud
Create a node pool with flex-start provisioning mode:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location LOCATION_NAME \ --project CLUSTER_PROJECT_ID \ --accelerator=type=nvidia-a100-80gb,count=2 \ --machine-type=a2-ultragpu-2g \ --flex-start \ --num-nodes=0 \ --enable-autoscaling \ --total-min-nodes=0 \ --total-max-nodes=5 \ --location-policy=ANY \ --reservation-affinity=none \ --no-enable-autorepair
Replace the following:
NODE_POOL_NAME
: the name you choose for your node pool.LOCATION_NAME
: the compute region for the cluster control plane.CLUSTER_NAME
: the name of the Standard cluster you want to modify.
In this command, the
--flex-start
flag instructsgcloud
to create a node pool with flex-start provisioning mode enabled.GKE creates a node pool with nodes that contain two A100 GPUs (
a2-ultragpu-2g
). This node pool automatically scales nodes from zero to a maximum of five nodes.Verify the status of flex-start provisioning mode in the node pool:
gcloud container node-pools describe NODE_POOL_NAME \ --cluster CLUSTER_NAME \ --location LOCATION_NAME \ --format="get(config.flexStart)"
If flex-start provisioning mode is enabled in the node pool, the field
flexStart
is set toTrue
.
Terraform
You can use flex-start provisioning mode with GPUs using a Terraform module.
Add the following block to your Terraform configuration:
resource "google_container_node_pool" " "gpu_dws_pool" { name = "gpu-dws-pool" queued_provisioning { enabled = false } } node_config { machine_type = "a3-highgpu-8g" flex_start = true }
Terraform calls Google Cloud APIs to create a cluster with a node
pool that uses flex-start provisioning mode with GPUs. The node pool initially has zero
nodes and autoscaling is enabled. To learn more about Terraform, see the
google_container_node_pool
resource spec on terraform.io.
Run a training workload
In this section, you create two Kubernetes Jobs that require one GPU each. A Job controller in Kubernetes creates one or more Pods and ensures that they successfully execute a specific task.
In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in the Google Cloud console. A session opens in the bottom pane of the Google Cloud console.
Create a file named
dws-flex-start.yaml
:apiVersion: batch/v1 kind: Job metadata: name: gpu-job-1 spec: template: spec: nodeSelector: cloud.google.com/gke-flex-start: "true" containers: - name: gpu-container-1 image: gcr.io/k8s-staging-perf-tests/sleep:latest args: ["10s"] # Sleep for 10 seconds resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 restartPolicy: OnFailure --- apiVersion: batch/v1 kind: Job metadata: name: gpu-job-2 spec: template: spec: nodeSelector: cloud.google.com/gke-flex-start: "true" containers: - name: gpu-container-2 image: gcr.io/k8s-staging-perf-tests/sleep:latest args: ["10s"] # Sleep for 10 seconds resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 restartPolicy: OnFailure
Apply the
dws-flex-start.yaml
manifest:kubectl apply -f dws-flex-start.yaml
Verify that the Jobs are running on the same node:
kubectl get pods -l "job-name in (gpu-job-1,gpu-job-2)" -o wide
The output is similar to the following:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-job-1 0/1 Completed 0 19m 10.(...) gke-flex-zonal-a2 <none> <none> gpu-job-2 0/1 Completed 0 19m 10.(...) gke-flex-zonal-a2 <none> <none>
Clean up
To avoid incurring charges to your Google Cloud account for the resources that you used on this page, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resource
Delete the Jobs:
kubectl delete job -l "job-name in (gpu-job-1,gpu-job-2)"
Delete the node pool:
gcloud container node-pools delete NODE_POOL_NAME \ --location LOCATION_NAME
Delete the cluster:
gcloud container clusters delete CLUSTER_NAME
What's next
- Learn more about GPUs in GKE.
- Learn more about node auto-provisioning.
- Learn more about Best practices for running batch workloads on GKE.