Halaman ini diterjemahkan oleh Cloud Translation API.

Mengorkestrasi workload Multislice menggunakan JobSet dan Kueue

Autopilot Standard

Tutorial ini menunjukkan cara mengorkestrasi beberapa workload multislice di Google Kubernetes Engine (GKE) untuk meningkatkan pemanfaatan resource. Anda men-deploy workload Jax sebagai contoh, menjalankannya di TPU Multislice, dan menerapkan antrean Tugas dengan JobSet dan Kueue. Kueue menentukan kapan Job harus berjalan berdasarkan resource yang tersedia, kuota, dan hierarki untuk berbagi secara adil antar-tim.

Tutorial ini ditujukan bagi engineer Machine Learning (ML) serta admin dan operator Platform yang tertarik dengan kemampuan orkestrasi penampung Kubernetes untuk melatih LLM. Untuk mempelajari lebih lanjut peran umum dan contoh tugas yang kami referensikan dalam konten, lihat Peran dan tugas pengguna umum GKE Enterprise. Google Cloud

Sebelum membaca halaman ini, pastikan Anda memahami hal-hal berikut:

Ketersediaan versi TPU saat ini dengan arsitektur sistem Cloud TPU
TPU Multislice di GKE

Tujuan

Siapkan lingkungan Anda dengan cluster GKE yang memiliki tiga slice TPU v5e. Setiap slice TPU memiliki topologi 2x4 dengan 8 chip. Oleh karena itu, total 24 chip TPU v5e.
Buat resource Kueue untuk memastikan kuota dibagikan secara adil di antara beban kerja.
Jalankan workload Multislice Anda.

Sebelum memulai

Sebelum memulai, pastikan Anda telah menjalankan tugas berikut:

Aktifkan Google Kubernetes Engine API.

Aktifkan Google Kubernetes Engine API

Jika ingin menggunakan Google Cloud CLI untuk tugas ini, instal lalu lakukan inisialisasi gcloud CLI. Jika sebelumnya Anda telah menginstal gcloud CLI, dapatkan versi terbaru dengan menjalankan gcloud components update.
Catatan: Untuk penginstalan gcloud CLI yang ada, pastikan untuk menyetel properti compute/region dan compute/zone. Dengan menyetel lokasi default, Anda dapat menghindari error di gcloud CLI yang seperti ini: One of [--zone, --region] must be supplied: Please specify location.

Instal JobSet v0.2.3 atau yang lebih baru.
Instal Kueue v0.4.1 atau yang lebih baru.

Menyiapkan lingkungan

Di konsol Google Cloud , mulai instance Cloud Shell:
Open Cloud Shell
Tetapkan variabel lingkungan default menggunakan perintah gcloud config set:
```
gcloud config set project PROJECT_ID
```
Ganti PROJECT_ID dengan Google Cloud project ID Anda.

Cluster Autopilot yang menjalankan versi 1.29.2-gke.1521000 atau yang lebih baru mengaktifkan TPU secara default. TPU di cluster Autopilot dikonfigurasi dalam spesifikasi workload. Untuk mengetahui informasi selengkapnya, lihat bagian Menentukan workload Multislice dengan JobSet.

Membuat cluster GKE

Di Cloud Shell, buat cluster GKE:

Autopilot

gcloud container clusters create-auto multislice-cluster \
    --location=LOCATION \
    --cluster-version 1.29.2-gke.1521000 \
    --release-channel rapid

Dalam perintah ini:

Flag --location menentukan lokasi Compute Engine cluster.
Flag --cluster-version menentukan versi Kubernetes untuk cluster Anda.
Flag --release-channel menentukan saluran rilis untuk cluster Anda. Dalam hal ini, saluran cepat mendukung versi terbaru yang tersedia di GKE.

Standar

gcloud container clusters create multislice-cluster \
    --location=LOCATION

Ganti LOCATION dengan lokasi tempat Anda ingin membuat cluster. Pastikan jenis mesin tersebut memiliki kapasitas untuk jenis mesin ct5lp-hightpu-4t. Pembuatan cluster mungkin memerlukan waktu beberapa menit.

Jika Anda menggunakan mode GKE Autopilot, lanjutkan ke bagian Buat resource Kueue. Cluster Autopilot yang menjalankan versi 1.29.2-gke.1521000 atau yang lebih baru mengaktifkan TPU secara default.

Buat tiga node pool slice TPU mode Standar

Di bagian ini, Anda akan membuat node pool TPU menggunakan perintah gcloud beta container node-pools create.

Buat node pool pertama bernama nodepool1:

gcloud beta container node-pools create nodepool1 \
    --location=LOCATION \
    --cluster=multislice-cluster \
    --node-locations=NODE_LOCATION \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=2x4 \
    --project=PROJECT_ID

Ganti NODE_LOCATION dengan satu atau beberapa zona di region cluster tempat Anda ingin membuat node.

Buat node pool kedua bernama nodepool2:

gcloud beta container node-pools create nodepool2 \
    --location=LOCATION \
    --cluster=multislice-cluster \
    --node-locations=NODE_LOCATION \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=2x4 \
    --project=PROJECT_ID

Buat node pool ketiga bernama nodepool3:

gcloud beta container node-pools create nodepool3 \
    --location=LOCATION \
    --cluster=multislice-cluster \
    --node-locations=NODE_LOCATION \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=2x4 \
    --project=PROJECT_ID

GKE membuat tiga node pool. Setiap node pool adalah slice TPU terpisah.

Pada langkah-langkah sebelumnya, Anda menggunakan perintah gcloud beta container node-pools create untuk membuat node pool. Perintah ini menggunakan flag berikut:

--node-locations: daftar yang dipisahkan koma untuk satu atau beberapa zona tempat GKE membuat node pool.
--machine-type: jenis mesin yang akan digunakan untuk node. Dalam hal ini, Anda menggunakan ct5lp-hightpu-4t. Untuk mengetahui informasi selengkapnya tentang jenis mesin yang kompatibel dengan TPU, gunakan tabel di Memilih versi TPU.
--tpu-topology: topologi TPU yang akan digunakan untuk node pool. Dalam hal ini, Anda menggunakan 2x4. Untuk mengetahui informasi selengkapnya tentang topologi TPU, lihat Memilih topologi TPU.

Buat resource Kueue

Buat manifes kueue.yaml berikut:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "vlp-24"
spec:
  nodeLabels:
    cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
    cloud.google.com/gke-tpu-topology: 2x4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {}
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources: ["google.com/tpu"]
    flavors:
    - name: "vlp-24"
      resources:
      - name: "google.com/tpu"
        nominalQuota: 24

---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: default
  name: multislice-queue
spec:
  clusterQueue: cluster-queue

Terapkan manifes kueue.yaml:
```
kubectl apply -f kueue.yaml
```

GKE membuat resource Kueue berikut:

ResourceFlavor: Abstraksi resource dalam cluster. Dalam contoh ini, GKE membuat tiga slice TPU dengan topologi 2x4. Setiap slice TPU memiliki topologi 2x4 dengan 8 chip (total 24 chip TPU).
ClusterQueue: Antrean global yang mengelola workload dan resource cluster.
LocalQueue: Mengelompokkan workload terkait erat yang biasanya dijalankan oleh satu tenant (pengguna). Setiap LocalQueue mengarah ke ClusterQueue tempat resource dialokasikan untuk menjalankan workload-nya. Kueue Workload adalah abstraksi yang merepresentasikan workload batch, dalam hal ini, setiap workload adalah JobSet.

Tentukan workload Multislice Anda dengan JobSet

Di bagian ini, Anda akan membuat tiga JobSet. Jobset adalah API beban kerja yang memungkinkan Anda mengelola sekelompok Tugas Kubernetes sebagai satu unit. Kasus penggunaan yang paling umum untuk JobSet adalah pelatihan terdistribusi, tetapi Anda juga dapat menggunakannya untuk menjalankan workload batch.

JobSet berikut menjalankan workload Jax yang menghasilkan jumlah global chip TPU dalam slice, lalu tidur selama 60 detik untuk menyimulasikan waktu pelatihan model, lalu keluar.

Instal JobSet API di cluster Anda:

VERSION=v0.8.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml

Buat manifes jobsets-multislice.yaml berikut:

Autopilot

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-1slice
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 1
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                command:
                - bash
                - -c
                - |
                  pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                  python -c 'import jax; print("Global device count:", jax.device_count())'
                resources:
                  limits:
                    google.com/tpu: 4

---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-2slice
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 2
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                command:
                - bash
                - -c
                - |
                  pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                  python -c 'import jax; print("Global device count:", jax.device_count())'
                  sleep 60
                resources:
                  limits:
                    google.com/tpu: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-3slice
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 3
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                command:
                - bash
                - -c
                - |
                  sleep 60
                resources:
                  limits:
                    google.com/tpu: 4

Standar

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-1slice
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 1
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              hostNetwork: true
              dnsPolicy: ClusterFirstWithHostNet
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                securityContext:
                  privileged: true
                command:
                - bash
                - -c
                - |
                  pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                  python -c 'import jax; print("Global device count:", jax.device_count())'
                resources:
                  limits:
                    google.com/tpu: 4

---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-2slice
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 2
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              hostNetwork: true
              dnsPolicy: ClusterFirstWithHostNet
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                securityContext:
                  privileged: true
                command:
                - bash
                - -c
                - |
                  pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                  python -c 'import jax; print("Global device count:", jax.device_count())'
                  sleep 60
                resources:
                  limits:
                    google.com/tpu: 4
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-3slice
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 3
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              hostNetwork: true
              dnsPolicy: ClusterFirstWithHostNet
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                securityContext:
                  privileged: true
                command:
                - bash
                - -c
                - |
                  sleep 60
                resources:
                  limits:
                    google.com/tpu: 4

Terapkan manifes jobsets-multislice.yaml:

kubectl apply -f jobsets-multislice.yaml

GKE membuat Job dengan permintaan resource berikut:

JobSet multislice-1slice membuat satu Job yang memerlukan total satu slice TPU.
JobSet multislice-2slice membuat dua Tugas yang memerlukan total dua slice TPU.
JobSet multislice-3slice membuat tiga Tugas yang memerlukan total tiga slice TPU.

Karena cluster hanya memiliki tiga slice TPU, tidak semua JobSet dapat berjalan sekaligus. Saat mengantrekan ketiga multislice-3slice JobSet, Kueue akan menjalankan Job-nya sendiri hingga selesai. multislice-1slice dan multislice-2slice menunggu dan berjalan bersama setelahnya.

Memverifikasi bahwa Kueue menerima workload

Periksa workload yang diantrekan di Kueue:

kubectl get workloads

Outputnya mirip dengan hal berikut ini:

NAME                             QUEUE              ADMITTED BY     AGE
jobset-multislice-1slice-2530a   multislice-queue                   3s
jobset-multislice-2slice-ffb02   multislice-queue                   4s
jobset-multislice-3slice-8c695   multislice-queue   cluster-queue   10s

Kueue mengantrekan satu atau beberapa workload, bergantung pada resource TPU yang diperlukan.

Memantau workload

Metrik dan dasbor kemampuan observasi JobSet dan node pool di Google Cloud konsol sudah tersedia secara umum.

Dasbor

Untuk melihat status node pool multi-host TPU di GKE, buka dasbor Status Node Pool TPU GKE yang disediakan oleh Cloud Monitoring:

Buka Status Kumpulan Node TPU GKE

Untuk mengetahui informasi selengkapnya, lihat Memantau metrik kondisi untuk node dan node pool TPU.

Di halaman Kubernetes Engine AI/ML di konsol Google Cloud , tab AI deployment > Jobs menampilkan dasbor pemantauan JobSet dengan informasi komprehensif tentang performa dan kondisi JobSet serta infrastruktur yang mendasarinya, seperti status JobSet, kesiapan replika, status replika. Dasbor ini juga mencakup metrik infrastruktur, termasuk metrik CPU, GPU, TPU, memori, dan penyimpanan. Untuk mengetahui informasi selengkapnya, lihat Memantau kondisi JobSet dengan metrik.

Memantau Pod mana yang sedang berjalan

kubectl get pods

Outputnya mirip dengan hal berikut ini:

NAME                                READY   STATUS      RESTARTS   AGE
multislice-1slice-slice-0-0-pf2ll   1/1     Running     0          1s
multislice-1slice-slice-0-1-55g62   1/1     Running     0          1s
multislice-2slice-slice-0-0-f4hf7   1/1     Running     0          3s
multislice-2slice-slice-0-1-c8kv7   1/1     Running     0          3s
multislice-2slice-slice-1-0-7h46t   1/1     Running     0          3s
multislice-2slice-slice-1-1-lj9hb   1/1     Running     0          3s
multislice-3slice-slice-0-0-wzq9t   0/1     Completed   0          2m31s
multislice-3slice-slice-0-1-zf4dp   0/1     Completed   0          2m30s
multislice-3slice-slice-1-0-hbfn5   0/1     Completed   0          2m31s
multislice-3slice-slice-1-1-45fgl   0/1     Completed   0          2m30s
multislice-3slice-slice-2-0-wjbp4   0/1     Completed   0          2m30s
multislice-3slice-slice-2-1-lwnvs   0/1     Completed   0          2m30s

Pastikan GKE menjadwalkan, membuat, dan menjalankan Pod untuk multislice-3slice terlebih dahulu. Kemudian, GKE menjalankan Pod dari JobSet multislice-1slice dan multislice-2slice.

Memantau kondisi JobSet dengan metrik

Untuk memahami apakah JobSet berjalan seperti yang diharapkan, atau untuk menyimpulkan apakah JobSet terganggu, Anda dapat menggunakan metrik Prometheus dari paket metrik JobSet, seperti kube_jobset_succeeded_replicas.

Perhatikan bahwa metrik kesehatan Jobset hanya didukung di GKE versi 1.32.1-gke.135700 atau yang lebih baru. Metrik kesehatan JobSet diaktifkan secara default di cluster yang baru dibuat dengan versi yang didukung. Untuk cluster yang ada yang diupgrade ke versi yang didukung, pelanggan harus mengaktifkan paket metrik JobSet secara manual. Untuk mengetahui informasi selengkapnya, lihat dokumentasi.

Untuk tutorial ini, periksa penyelesaian JobSet dengan kueri PromQL ini:

kube_jobset_succeeded_replicas{
  cluster="multislice-cluster",
  jobset_name=~"mulitslice-.*"}

Memantau waktu aktif JobSet, waktu untuk memulihkan (TTR), dan waktu di antara gangguan (TBI)

Metrik berikut berguna untuk memantau ketersediaan JobSet:

kubernetes.io/jobset/uptime: total waktu JobSet tersedia.
kubernetes.io/jobset/times_to_recover: Distribusi periode pemulihan untuk JobSet. Setiap sampel menunjukkan satu peristiwa pemulihan dari periode gangguan untuk JobSet.
kubernetes.io/jobset/times_between_interruptions: Distribusi interval antara akhir gangguan sebelumnya dan awal gangguan saat ini untuk JobSet. Setiap sampel menunjukkan durasi tunggal antara gangguan sebelumnya dan saat ini.

Metrik ini berlaku untuk JobSet yang memiliki tepat satu tugas yang direplikasi GPU atau TPU. Penghitungan metrik hanya didasarkan pada ketersediaan satu tugas yang direplikasi tersebut. Metrik ini didukung di semua versi GKE.

Untuk melihat waktu aktif JobSet yang Anda gunakan dalam tutorial ini, jalankan kueri PromQL berikut:

avg_over_time(
  kubernetes_io:jobset_uptime{
    monitored_resource="k8s_entity", entity_type="jobset",
    entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}])

Untuk melihat distribusi TBI untuk JobSet dari tutorial ini, jalankan kueri PromQL berikut:

histogram_quantile(0.50,
  sum_over_time(
    kubernetes_io:jobset_times_between_interruptions_bucket{
      monitored_resource="k8s_entity",entity_type="jobset",
      entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}]))

Anda dapat memperluas interval kueri ke cakupan waktu yang lebih panjang, seperti 7 hari, dan menghitung waktu rata-rata antar-gangguan (MTBI) selama periode ini:

sum(sum_over_time(
  kubernetes_io:jobset_times_between_interruptions_sum{
    monitored_resource="k8s_entity",entity_type="jobset",
    entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}]))
/
sum(sum_over_time(
  kubernetes_io:jobset_times_between_interruptions_count{
    monitored_resource="k8s_entity",entity_type="jobset",
    entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}]))

Untuk melihat distribusi TTR, Anda dapat menjalankan kueri PromQL berikut:

histogram_quantile(0.50,
  sum_over_time(
    kubernetes_io:jobset_times_to_recover_bucket{
      monitored_resource="k8s_entity",entity_type="jobset",
      entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}]))

Setelah meningkatkan interval kueri ke jangka waktu yang lebih lama, seperti 7 hari, Anda dapat menghitung waktu rata-rata untuk pemulihan (MTTR) selama periode ini:

sum(sum_over_time(
  kubernetes_io:jobset_times_to_recover_sum{
    monitored_resource="k8s_entity",entity_type="jobset",
    entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}]))
/
sum(sum_over_time(
  kubernetes_io:jobset_times_to_recover_count{
    monitored_resource="k8s_entity",entity_type="jobset",
    entity_name=~"multislice-.*",cluster_name="multislice-cluster"}[${__interval}]))

Mengaktifkan prioritas dan pendahuluan beban kerja Kueue

Secara opsional, Anda dapat menetapkan prioritas beban kerja Kueue yang menentukan urutan beban kerja yang diantrekan diterima oleh Kueue.

Perbarui ClusterQueue Anda agar memiliki kebijakan preempti:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "vlp-24"
spec:
  nodeLabels:
    cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
    cloud.google.com/gke-tpu-topology: 2x4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["google.com/tpu"]
    flavors:
    - name: "vlp-24"
      resources:
      - name: "google.com/tpu"
        nominalQuota: 24
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: default
  name: multislice-queue
spec:
  clusterQueue: cluster-queue

Buat PriorityClass untuk setiap tingkat prioritas berbeda yang ingin Anda tetapkan ke workload:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "This low priority class should be used for some Pods only."

Tetapkan priorityClassName ke JobSet Anda:

Autopilot

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: low-priority
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 1
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              priorityClassName: low-priority
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                command:
                - bash
                - -c
                - |
                  sleep 60
                resources:
                  limits:
                    google.com/tpu: 4 # Number of TPU chips per worker

Standar

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: low-priority
  labels:
    kueue.x-k8s.io/queue-name: multislice-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: slice
      replicas: 1
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              hostNetwork: true
              dnsPolicy: ClusterFirstWithHostNet
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 2x4
              priorityClassName: low-priority
              containers:
              - name: jax-tpu
                image: python:3.8
                ports:
                - containerPort: 8471
                - containerPort: 8080
                securityContext:
                  privileged: true
                command:
                - bash
                - -c
                - |
                  sleep 60
                resources:
                  limits:
                    google.com/tpu: 4 # Number of TPU chips per worker

GKE menyertakan kebijakan preempti, yang menentukan cara Kueue menetapkan resource yang tersedia. Kebijakan ini menetapkan bahwa beban kerja dapat dihentikan sementara jika beban kerja dengan prioritas yang lebih tinggi memerlukan resource. Beban kerja dengan nilai prioritas yang lebih rendah lebih mungkin didahului oleh beban kerja dengan prioritas yang lebih tinggi.

Pembersihan

Agar tidak perlu membayar biaya pada akun Google Cloud Anda untuk resource yang digunakan dalam tutorial ini, hapus project yang berisi resource tersebut, atau simpan project dan hapus setiap resource.

Menghapus project

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

Menghapus resource satu per satu

Hapus resource Kueue:

kubectl delete -f jobsets-multislice.yaml
kubectl delete -f kueue.yaml

Hapus cluster:

gcloud container clusters delete multislice-cluster --region=LOCATION

Langkah berikutnya

Pelajari Kueue lebih lanjut.
Pelajari cara Menerapkan sistem antrean Tugas dengan pembagian kuota antar-namespace di GKE.

Mengorkestrasi workload Multislice menggunakan JobSet dan Kueue Tetap teratur dengan koleksi Simpan dan kategorikan konten berdasarkan preferensi Anda.

Tujuan

Sebelum memulai

Menyiapkan lingkungan

Membuat cluster GKE

Autopilot

Standar

Buat tiga node pool slice TPU mode Standar

Buat resource Kueue

Tentukan workload Multislice Anda dengan JobSet

Autopilot

Standar

Memverifikasi bahwa Kueue menerima workload

Memantau workload

Dasbor

Memantau Pod mana yang sedang berjalan

Memantau kondisi JobSet dengan metrik

Memantau waktu aktif JobSet, waktu untuk memulihkan (TTR), dan waktu di antara gangguan (TBI)

Mengaktifkan prioritas dan pendahuluan beban kerja Kueue

Autopilot

Standar

Pembersihan

Menghapus project

Menghapus resource satu per satu

Langkah berikutnya

Mengorkestrasi workload Multislice menggunakan JobSet dan Kueue