You can use the Accelerated Processing Kit (XPK)
to create pre-configured Google Kubernetes Engine (GKE) clusters for
Pathways-based workloads. You can also use gcloud
to manually create
GKE clusters for Pathways-based workloads
Before you begin
Make sure you have:
- Installed Kubernetes tools
- Installed XPK
- Enabled the TPU API
- Enabled the Google Kubernetes Engine API
- Ensure your Google Cloud project is allowlisted for Pathways
Set up your local environment
Log in with your Google Cloud credentials.
gcloud auth application-default login
Define the following environment variables with values appropriate to your workload.
Required variables
Create a GKE cluster
In the following example, you create a cluster with two v5e 2x4 node pools.
You can create a cluster using XPK or the gcloud
command.
XPK
Set some environment variables
CLUSTER_NODEPOOL_COUNT=CLUSTER_NODEPOOL_COUNT PROJECT=PROJECT_ID ZONE=ZONE CLUSTER=GKE_CLUSTER_NAME TPU_TYPE="v5litepod-8" PW_CPU_MACHINE_TYPE="n2-standard-64" NETWORK=NETWORK SUBNETWORK=SUB_NETWORK
Replace the following:
CLUSTER_NODEPOOL_COUNT
: the maximum number of node pools a workload can usePROJECT_ID
: your Google Cloud project nameZONE
: the zone where you are creating resourcesCLUSTER
: the GKE cluster nameTPU_TYPE
: the TPU type. For more information, see supported types in XPKPW_CPU_MACHINE_TYPE
: the CPU node type for the Pathways controllerNETWORK
: [Optional] set a Virtual Private Cloud name if using XPK, this must be created before creating your clusterSUBNETWORK
: [Optional] set a subnetwork name if using XPK, this must be created before creating your cluster
Use XPK to create a GKE Pathways cluster. This command can take several minutes to provision the capacity. Once completed, your capacity is allocated and you will start incurring charges.
xpk cluster create-pathways \ --num-slices=${CLUSTER_NODEPOOL_COUNT} \ --tpu-type=${TPU_TYPE} \ --pathways-gce-machine-type=${PW_CPU_MACHINE_TYPE} \ --on-demand \ --project=${PROJECT} \ --zone=${ZONE} \ --cluster=${CLUSTER} \ --custom-cluster-arguments="--network=${NETWORK} --subnetwork=${SUBNETWORK} --enable-ip-alias"
Once the cluster is created, you can create and delete workloads as needed. You don't need to re-provision the TPU capacity.
gcloud
Set some environment variables
CLUSTER=GKE_CLUSTER_NAME PROJECT=PROJECT_ID ZONE=ZONE REGION=REGION CLUSTER_VERSION=GKE_CLUSTER_VERSION PW_CPU_MACHINE_TYPE="n2-standard-64" NETWORK=NETWORK SUBNETWORK=SUB_NETWORK CLUSTER_NODEPOOL_COUNT=2 TPU_MACHINE_TYPE="ct5lp-hightpu-4t" WORKERS_PER_SLICE=2 TOPOLOGY="2x4" NUM_CPU_NODES=1
Replace the following:
CLUSTER
: the GKE cluster namePROJECT_ID
: your Google Cloud project nameZONE
: the zone where you are creating resourcesREGION
: the region where you are creating resourcesCLUSTER_VERSION
: [Optional] the GKE cluster version, use 1.32.2-gke.1475000 or laterPW_CPU_MACHINE_TYPE
: the CPU node type for the Pathways controllerNETWORK
: [Optional] set a Virtual Private Cloud name if using XPK, this must be created before creating your clusterSUBNETWORK
: [Optional] set a subnetwork name if using XPK, this must be created before creating your clusterCLUSTER_NODEPOOL_COUNT
: the maximum number of node pools a workload can useTPU_MACHINE_TYPE
: the TPU machine type you want to useWORKERS_PER_SLICE
: the number of nodes per node poolGKE_ACCELERATOR_TYPE
: the Google Kubernetes Engine accelerator type, see Choose a TPU versionTOPOLOGY
: the TPU topologyNUM_CPU_NODES
: the Pathways CPU node pool size
The following steps explain how to create a GKE cluster and set it up for running Pathways workloads.
Create a GKE cluster:
gcloud beta container clusters create ${CLUSTER} \ --project=${PROJECT} \ --zone=${ZONE} \ --cluster-version=${CLUSTER_VERSION} \ --scopes=storage-full,gke-default,cloud-platform \ --machine-type ${PW_CPU_MACHINE_TYPE} \ --network=${NETWORK} \ --subnetwork=${SUBNETWORK}
Create TPU node pools:
for i in $(seq 1 ${CLUSTER_NODEPOOL_COUNT}); do gcloud container node-pools create "tpu-np-${i}" \ --project=${PROJECT} \ --zone=${ZONE} \ --cluster=${CLUSTER} \ --machine-type=${TPU_MACHINE_TYPE} \ --num-nodes=${WORKERS_PER_SLICE} \ --placement-type=COMPACT \ --tpu-topology=${TOPOLOGY} \ --scopes=storage-full,gke-default,cloud-platform \ --workload-metadata=GCE_METADATA done
Create a CPU node pool:
gcloud container node-pools create "cpu-pathways-np" \ --project ${PROJECT} \ --zone ${ZONE} \ --cluster ${CLUSTER} \ --machine-type ${PW_CPU_MACHINE_TYPE} \ --num-nodes ${NUM_CPU_NODES} \ --scopes=storage-full,gke-default,cloud-platform \ --workload-metadata=GCE_METADATA
Install the
JobSet
andPathwaysJob
APIsGet credentials for the cluster and add them to your local kubectl context.
gcloud container clusters get-credentials ${CLUSTER} \ [--zone=${ZONE} | --region=${REGION}] \ --project=${PROJECT} \ && kubectl config set-context --current --namespace=default
To use the Pathways architecture on your GKE cluster, you need to install the
JobSet
API and thePathwaysJob
API.kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.1/install.yaml
What's next
- Run a batch workload with Pathways
- Pathways interactive mode
- Multihost inference with Pathways
- Resilient training with Pathways
- Porting JAX workloads to Pathways
- Troubleshoot Pathways