Create a GKE Cluster with Pathways

You can use the Accelerated Processing Kit (XPK) to create pre-configured Google Kubernetes Engine (GKE) clusters for Pathways-based workloads. You can also use gcloud to manually create GKE clusters for Pathways-based workloads

Before you begin

Make sure you have:

Installed Kubernetes tools
Installed XPK
Enabled the TPU API
Enabled the Google Kubernetes Engine API
Ensure your Google Cloud project is allowlisted for Pathways

Set up your local environment

gcloud auth application-default login

Define the following environment variables with values appropriate to your workload.

Required variables

Create a GKE cluster

In the following example, you create a cluster with two v5e 2x4 node pools. You can create a cluster using XPK or the gcloud command.

XPK

Set some environment variables
```
CLUSTER_NODEPOOL_COUNT=CLUSTER_NODEPOOL_COUNT
PROJECT=PROJECT_ID
ZONE=ZONE
CLUSTER=GKE_CLUSTER_NAME
TPU_TYPE="v5litepod-8"
PW_CPU_MACHINE_TYPE="n2-standard-64"
NETWORK=NETWORK
SUBNETWORK=SUB_NETWORK
```
Replace the following:
- CLUSTER_NODEPOOL_COUNT: the maximum number of node pools a workload can use
- PROJECT_ID: your Google Cloud project name
- ZONE: the zone where you are creating resources
- CLUSTER: the GKE cluster name
- TPU_TYPE: the TPU type. For more information, see supported types in XPK
- PW_CPU_MACHINE_TYPE: the CPU node type for the Pathways controller
- NETWORK: [Optional] set a Virtual Private Cloud name if using XPK, this must be created before creating your cluster
- SUBNETWORK: [Optional] set a subnetwork name if using XPK, this must be created before creating your cluster

Use XPK to create a GKE Pathways cluster. This command can take several minutes to provision the capacity. Once completed, your capacity is allocated and you will start incurring charges.

xpk cluster create-pathways \
--num-slices=${CLUSTER_NODEPOOL_COUNT} \
--tpu-type=${TPU_TYPE} \
--pathways-gce-machine-type=${PW_CPU_MACHINE_TYPE} \
--on-demand \
--project=${PROJECT} \
--zone=${ZONE} \
--cluster=${CLUSTER} \
--custom-cluster-arguments="--network=${NETWORK} --subnetwork=${SUBNETWORK} --enable-ip-alias"

Once the cluster is created, you can create and delete workloads as needed. You don't need to re-provision the TPU capacity.

gcloud

Set some environment variables
```
CLUSTER=GKE_CLUSTER_NAME
PROJECT=PROJECT_ID
ZONE=ZONE
REGION=REGION
CLUSTER_VERSION=GKE_CLUSTER_VERSION
PW_CPU_MACHINE_TYPE="n2-standard-64"
NETWORK=NETWORK
SUBNETWORK=SUB_NETWORK
CLUSTER_NODEPOOL_COUNT=3
TPU_MACHINE_TYPE="ct5lp-hightpu-4t"
WORKERS_PER_SLICE=2
TOPOLOGY="2x4"
NUM_CPU_NODES=1
```
Replace the following:
- CLUSTER: the GKE cluster name
- PROJECT_ID: your Google Cloud project name
- ZONE: the zone where you are creating resources
- REGION: the region where you are creating resources
- CLUSTER_VERSION: [Optional] the GKE cluster version, use 1.32.2-gke.1475000 or later
- PW_CPU_MACHINE_TYPE: the CPU node type for the Pathways controller
- NETWORK: [Optional] set a Virtual Private Cloud name if using XPK, this must be created before creating your cluster
- SUBNETWORK: [Optional] set a subnetwork name if using XPK, this must be created before creating your cluster
- CLUSTER_NODEPOOL_COUNT: the maximum number of node pools a workload can use
- TPU_MACHINE_TYPE: the TPU machine type you want to use
- WORKERS_PER_SLICE: the number of nodes per node pool
- GKE_ACCELERATOR_TYPE: the Google Kubernetes Engine accelerator type, see Choose a TPU version
- TOPOLOGY: the TPU topology
- NUM_CPU_NODES: the Pathways CPU node pool size

The following steps explain how to create a GKE cluster and set it up for running Pathways workloads.

Create a GKE cluster:

gcloud beta container clusters create ${CLUSTER} \
--project=${PROJECT} \
--zone=${ZONE} \
--cluster-version=${CLUSTER_VERSION} \
--scopes=storage-full,gke-default,cloud-platform \
--machine-type ${PW_CPU_MACHINE_TYPE} \
--network=${NETWORK} \
--subnetwork=${SUBNETWORK}

Create TPU node pools:

for i in $(seq 1 ${CLUSTER_NODEPOOL_COUNT}); do
gcloud container node-pools create "tpu-np-${i}" \
--project=${PROJECT} \
--zone=${ZONE} \
--cluster=${CLUSTER} \
--machine-type=${TPU_MACHINE_TYPE} \
--num-nodes=${WORKERS_PER_SLICE} \
--placement-type=COMPACT \
--tpu-topology=${TOPOLOGY} \
--scopes=storage-full,gke-default,cloud-platform \
--workload-metadata=GCE_METADATA
done

Create a CPU node pool:

gcloud container node-pools create "cpu-pathways-np" \
--project ${PROJECT} \
--zone ${ZONE} \
--cluster ${CLUSTER} \
--machine-type ${PW_CPU_MACHINE_TYPE} \
--num-nodes ${NUM_CPU_NODES} \
--scopes=storage-full,gke-default,cloud-platform \
--workload-metadata=GCE_METADATA

Install the JobSet and PathwaysJob APIs

Get credentials for the cluster and add them to your local kubectl context.

gcloud container clusters get-credentials ${CLUSTER} \
    [--zone=${ZONE} | --region=${REGION}] \
    --project=${PROJECT} \
    && kubectl config set-context --current --namespace=default

To use the Pathways architecture on your GKE cluster, you need to install the JobSet API and the PathwaysJob API.

kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.2/install.yaml