This guide shows you how to serve state-of-the-art large language models (LLMs) such as Llama 3.1 405B on Google Kubernetes Engine (GKE) using tensor processing units (TPUs) across multiple nodes.
This guide demonstrates how to use portable open-source technologies—Kubernetes, JetStream, Pathways on Cloud, and the LeaderWorkerSet (LWS) API—to deploy and serve AI/ML workloads on GKE, by taking advantage of GKE's granular control, scalability, resilience, portability, and cost-effectiveness.
Background
Large language models have grown in size and they no longer fit on a single host TPU slice. For ML inference, you can use Pathways on Cloud to run large scale multi-host inference on GKE across multiple interconnected TPU nodes. In this guide, you will walk through how to provision a GKE cluster with the multi-host TPU slices and use the Pathways on Cloud binaries, launch the JetStream Server with MaxText framework, and make multi-host inference requests.
By serving an LLM using TPUs on GKE with JetStream, MaxText, and Pathways, you can build a robust, production-ready serving solution with all the benefits of managed Kubernetes, including cost-efficiency, scalability, and higher availability. This section describes the key technologies used in this tutorial.
About TPUs
TPUs are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning and AI models that are built using frameworks such as TensorFlow, PyTorch, and JAX.
Before you use TPUs in GKE, we recommend that you complete the following learning path:
- Learn about current TPU version availability with the Cloud TPU system architecture.
- Learn about TPUs in GKE.
This tutorial covers serving the Llama 3.1-405B model. GKE deploys the model on multi-host TPU v6e nodes with TPU topologies that are configured based on the model requirements for serving prompts with low latency.
Pathways on Cloud
Pathways is a large-scale orchestration layer for accelerators. Pathways is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways enables a single JAX client process to coordinate computation across one or more large TPU slices, streamlining ML computations that span hundreds or thousands of TPU chips.
JetStream
JetStream is an open source inference serving framework developed by Google. JetStream enables high-performance, high-throughput, and memory-optimized inference on TPUs and GPUs. JetStream provides advanced performance optimizations, including continuous batching, KV cache optimizations, and quantization techniques, to facilitate LLM deployment. JetStream enables PyTorch/XLA and JAX TPU serving to to optimize performance.
MaxText
MaxText is a performant, scalable, and adaptable JAX LLM implementation, built on open source JAX libraries such as Flax, Orbax, and Optax. MaxText's decoder-only LLM implementation is written in Python. It leverages the XLA compiler heavily to achieve high performance without needing to build custom kernels.
For more information about the latest models and parameter sizes that MaxText supports, see the MaxText project repository.
Llama 3.1 405B
Llama 3.1 405B is a large language model by Meta that's designed for a range of natural language processing tasks, including text generation, translation, and question answering. GKE offers the infrastructure required to support the distributed training and serving needs of models of this scale.
For more information, see the Llama documentation.
Architecture
This section describes the GKE architecture used in this tutorial. The architecture includes a GKE Standard cluster that provisions TPUs and hosts JetStream and Pathways components to deploy and serve the model.
The following diagram shows you the components of this architecture:
This architecture includes the following components:
- A GKE Standard regional cluster.
- A multi-host TPU slice node pool that host the JetStream deployment and Pathways components.
- The
Pathways resource manager
manages accelerator resources and coordinates allocation of accelerators for user jobs. - The
Pathways client
coordinates with thePathways resource manager
to determine where the compiled programs are placed for execution. - The
Pathways worker
runs and performs computations on accelerator machines, and sends data back to your workload over the IFRT proxy server. - The
IFRT proxy client
implements the OSS Interim Framework Runtime (IFRT) API and acts as the communication bridge between your workload and Pathways components. - The
IFRT proxy server
receives requests from theIFRT proxy client
and forwards them to thePathways client
, distributing the work. - The
JetStream-Pathways
container provides a JAX-based inference server that receives inference requests and delegates its execution processes to thePathways workers
- The Service component spreads inbound traffic to all
JetStream HTTP
replicas. JetStream HTTP
is an HTTP server which accepts requests as a wrapper to JetStream's required format and sends it to JetStream's GRPC client.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Ensure that you have sufficient quota for sixteen TPU v6e PodSlice Lite chips. In this tutorial, you use on-demand instances.
- Ensure your Google Cloud project is allowlisted for Pathways.
Get access to the model
To get access to the Meta Llama 3.1-405B checkpoint for deployment to GKE, follow these steps:
- Sign the license consent agreement.
- Access the Meta Llama downloads page.
- Review and accept the model Terms and Conditions and to obtain the URL needed to download the model.
- To download the model checkpoint, find the model ID for the appropriate model. For a list of supported models and their IDs, see the llama CLI documentation. For example, use Llama 3.1-405B-Instruct:bf16-mp16 for the Llama 3.1-405B model.
Prepare the environment
In this tutorial, you use Cloud Shell to manage resources hosted on
Google Cloud. Cloud Shell comes preinstalled with the software you need
for this tutorial, including
kubectl
and
gcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export CLUSTER_NAME=CLUSTER_NAME export BUCKET_NAME=BUCKET_NAME export REGION=REGION export LOCATION=LOCATION export CLUSTER_VERSION=CLUSTER_VERSION export MACHINE_TYPE=ct6e-standard-4t export TPU_TYPE=v6e export TOPOLOGY=4x4 export WORKERS_PER_SLICE=4
Replace the following values:
PROJECT_ID
: your Google Cloud project ID.CLUSTER_NAME
: the name of your GKE cluster.BUCKET_NAME
: the name of your Cloud Storage bucket. You don't need to specify thegs://
prefix.REGION
: the region where your GKE cluster, Cloud Storage bucket, and TPU nodes are located. The region contains zones where TPU v6e machine types are available (for example,us-east1
,us-east5
,europe-west4
,asia-northeast1
, orus-south1
).LOCATION
: the zone where the TPU resources are available (for example,us-east1-d
).CLUSTER_VERSION
: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. For a list of minimum GKE versions available by TPU machine type, see TPU availability in GKE.MACHINE_TYPE
: the v6e machine type.TPU_TYPE
: a prefix used for naming node pools (v6e).TOPOLOGY
: the TPU v6e topology.WORKERS_PER_SLICE
: the number of nodes per node pool or TPU slice.
Create and configure Google Cloud resources
To create the required resources, follow these instructions:
Create a GKE cluster
Create a regional GKE Standard cluster:
gcloud container clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --cluster-version=CLUSTER_VERSION \ --region=REGION \ --scopes=cloud-platform \ --machine-type=n2-standard-32
The cluster creation might take several minutes.
Replace
CLUSTER_VERSION
with the appropriate cluster version.Create one TPU v6e node pool with a
4x4
topology and four nodes each:gcloud container node-pools create multihost-np \ --project=PROJECT_ID \ --region=REGION \ --node-locations=LOCATION \ --cluster=CLUSTER_NAME \ --machine-type=MACHINE_TYPE \ --num-nodes=WORKERS_PER_SLICE \ --tpu-topology=TOPOLOGY \ --scopes cloud-platform \ --placement-type=COMPACT \ --workload-metadata=GCE_METADATA
Configure a service account for Storage Object access
Configure a Kubernetes service account to act as an IAM service account.
Create an IAM service account for your application:
gcloud iam service-accounts create jetstream-pathways
Add an IAM policy binding for your IAM service account to manage Cloud Storage. This is to enable your IAM service account to access the storage bucket where your checkpoint will be stored:
gcloud projects add-iam-policy-binding ${PROJECT} \ --member "serviceAccount:jetstream-pathways@${PROJECT}.iam.gserviceaccount.com" \ --role roles/storage.objectUser gcloud projects add-iam-policy-binding ${PROJECT} \ --member "serviceAccount:jetstream-pathways@${PROJECT}.iam.gserviceaccount.com" \ --role roles/storage.insightsCollectorService
Annotate the Kubernetes service account with the email address of the IAM service account.
kubectl annotate serviceaccount default \ iam.gke.io/gcp-service-account=jetstream-pathways@${PROJECT}.iam.gserviceaccount.com
Configure Docker to authenticate to Artifact Registry
Configure Docker to authenticate to Artifact Registry so it can pull the allowlisted Pathways images:
gcloud auth login
gcloud auth configure-docker
Checkpoint conversion
To convert a Meta Llama 3.1-405B checkpoint to a MaxText-compatible int8 inference checkpoint, complete the steps in Checkpoint conversion with Llama3.1-405B. Your deployment uses the checkpoint with the load_parameters_path
flag.
Create a Cloud Storage bucket to store Pathways temporary files
Create a Cloud Storage bucket to store your Pathways temporary files, like the compilation cache:
export PATHWAYS_BUCKET=PATHWAYS_BUCKET
gcloud storage buckets create gs://$PATHWAYS_BUCKET
Deploy JetStream-MaxText and Pathways
Deploy the JetStream-MaxText and Pathways model server.
Connect to the GKE cluster
gcloud container clusters get-credentials "${CLUSTER}" --project "${PROJECT}" --location "${ZONE}"
Deploy the LeaderWorkerSet (LWS) API
LWS is a custom resource designed for deploying and managing stateful, distributed applications, particularly those with a leader-worker architecture. It's especially well-suited for AI/ML workloads where a large model is sharded and served across multiple devices on multiple nodes.
VERSION=v0.6.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
Wait for the LeaderWorkerSet controller to be fully available:
kubectl wait deploy/lws-controller-manager -n lws-system --for=condition=available --timeout=5m
The output should be similar to the following:
deployment.apps/lws-controller-manager condition met
Validate that the LeaderWorkerSet controller is running in the lws-system
namespace:
kubectl get pod -n lws-system
The output should be similar to the following:
NAME READY STATUS RESTARTS AGE
lws-controller-manager-abcd 1/1 Running 0 40s
lws-controller-manager-efgh 1/1 Running 0 40s
Deploy the workload manifest
Save the following manifest as
jetstream-pathways-llama-3-1-405b-4x4.yaml
:Set the value of the
load_parameters_path
field to your checkpoint path that was produced in the checkpoint conversion process.- For a bf16 checkpoint, the path should be similar to
gs://OUTPUT_BUCKET_DIRECTORY/bf16/unscanned/checkpoints/0/items
. - For an int8 checkpoint, it should be similar to
gs://OUTPUT_BUCKET_DIRECTORY/int8
.
Set the value of the
gcs_scratch_location
field to the Pathways bucket that you created earlier.perl -pi -e 's|CHECKPOINT_PATH|gs://OUTPUT_BUCKET_DIRECTORY/int8|g' jetstream-pathways-llama-3-1-405b-4x4.yaml perl -pi -e 's|PATHWAYS_BUCKET|gs://PATHWAYS_BUCKET|g' jetstream-pathways-llama-3-1-405b-4x4.yaml
- For a bf16 checkpoint, the path should be similar to
Apply the Deployment manifest
Apply the manifest to deploy the server:
kubectl apply -f jetstream-pathways-llama-3-1-405b-4x4.yaml
The model server should start.
Verify model server start
A 405B model might take about 10 to 20 minutes to restore the checkpoint. You might also wait for additional time during model warmup if you enabled the enable_model_warmup
flag.
kubectl logs -f jetstream-pathways-0 -c jax-tpu
The output is similar to the following:
2025-03-02 02:15:07,682 - JetstreamLogger - INFO - Initializing the driver with 1 prefill engines and 1 generate engines in interleaved mode
2025-03-02 02:15:07,683 - JetstreamLogger - INFO - Spinning up prefill thread 0.
2025-03-02 02:15:07,683 - JetstreamLogger - INFO - Spinning up transfer thread 0.
2025-03-02 02:15:07,684 - JetstreamLogger - INFO - Spinning up generate thread 0.
2025-03-02 02:15:07,684 - JetstreamLogger - INFO - Spinning up detokenize thread 0.
2025-03-02 02:15:07,685 - JetstreamLogger - INFO - Driver initialized.
...
...
...
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9999 (Press CTRL+C to quit)
Serve Llama 3.1-405b
To serve the Llama 3.1-405b model, set up port forwarding:
kubectl port-forward svc/jetstream-svc 8000:8000
Port forwarding lets you access the Service from outside the cluster. You can access the JetStream-Pathways Deployment through the ClusterIP Service of GKE. The ClusterIP Services are reachable only from within the cluster.
Interact with the model
In a new terminal, run the following command:
curl --request POST \
--header "Content-type: application/json" \
-s \
localhost:8000/generate \
--data \
'{
"prompt": "What are the top 5 programming languages",
"max_tokens": 200
}'
The initial request can take several seconds to complete due to model warmup. The output should be similar to the following:
{
"response": " for web development?\nThe top 5 programming languages for web development are:\n1. **JavaScript**: JavaScript is the most popular language for web development, used by over 90% of websites for client-side scripting. It's also popular for server-side programming with technologies like Node.js.\n2. **HTML/CSS**: HTML (Hypertext Markup Language) and CSS (Cascading Style Sheets) are not programming languages, but are essential for building websites. HTML is used for structuring content, while CSS is used for styling and layout.\n3. **Python**: Python is a popular language for web development, especially with frameworks like Django and Flask. It's known for its simplicity, flexibility, and large community of developers.\n4. **Java**: Java is a popular language for building enterprise-level web applications, especially with frameworks like Spring and Hibernate. It's known for its platform independence, strong security features, and large community of developers.\n5. **PHP**: PHP is a mature language for web"
}
You've successfully done the following:
- Deployed the JetStream model server with MaxText and Pathways on GKE by using TPUs.
- Created a Llama 3.1-405B int8 checkpoint at
gs://BUCKET_NAME
. - Served and interacted with the model.
Disaggregated serving
Disaggregated serving is a technique for serving LLMs while splitting the prefill and decode stages into different hosts. This approach optimizes resource utilization, which can lead to improved throughput and latency.
Prefill: a forward pass on the input prompt to initialize the key-value cache.
Decode: a procedure that generates output tokens incrementally, one token per step, and one KV-cache value per iteration.
Set the default environment variables:
export NODE_POOL_NAME=dis-v6e-8 export NODE_POOL_SIZE=2 export MACHINE_TYPE=ct6e-standard-4t export TOPOLOGY=2x4 export WORKERS_PER_SLICE=2
Create two node pools that use
v6e-8
nodes:for i in $(seq 1 NODE_POOL_SIZE); do gcloud container node-pools create NODE_POOL_NAME-${i}-np \ --project=PROJECT \ --zone=ZONE \ --cluster=CLUSTER_NAME \ --machine-type=MACHINE_TYPE \ --num-nodes=WORKERS_PER_SLICE \ --tpu-topology=TOPOLOGY \ --scopes=cloud-platform \ --workload-metadata=GCE_METADATA done
Checkpoint conversion
To convert a Meta Llama 2-70B checkpoint to a MaxText-compatible int8 inference checkpoint, complete the steps in Checkpoint conversion with Llama2-70B. Select Llama2-70B as your model when you acknowledge the Meta Terms and Conditions. Your deployment uses the checkpoint with the load_parameters_path
flag.
Replace the following parameters in your checkpoint-job.yaml
file:
- --meta_url=META_URL
- --model_name=llama-2
- --model_path=Llama-2-70b-chat
- --output_directory=gs://BUCKET_NAME/maxtext/llama-2-70b
The checkpoint will be used in your deployment with the load_parameters_path
flag.
Deploy JetStream Pathways with disaggregated serving
Save the following manifest as
jetstream-pathways-disagg-llama-2-70b-2-2x4.yaml
:Set the value of the
load_parameters_path
field to your checkpoint path that was produced in the checkpoint conversion process.- For a bf16 checkpoint, the path should be similar to
gs://OUTPUT_BUCKET_DIRECTORY/bf16/unscanned/checkpoints/0/items
. - For an int8 checkpoint, it should be similar to
gs://OUTPUT_BUCKET_DIRECTORY/int8
.
Set the value of the
gcs_scratch_location
field to the Pathways bucket that you created earlier.perl -pi -e 's|CHECKPOINT_PATH|BUCKET_NAME/maxtext/llama-2-70b/int8|g' jetstream-pathways-disagg-llama-2-70b-2-2x4.yaml perl -pi -e 's|PATHWAYS_BUCKET|gs://PATHWAYS_BUCKET|g' jetstream-pathways-disagg-llama-2-70b-2-2x4.yaml
- For a bf16 checkpoint, the path should be similar to
Apply the manifest:
kubectl apply -f jetstream-pathways-disagg-llama-2-70b-2-2x4.yaml
The model server might take some time to restore the checkpoint, depending on the checkpoint size. A 70B model might take about 8 minutes to restore the checkpoint, including model warmup. You can further observe the logs to determine point of readiness by verifying the model server start and serve the model by setting up port forwarding so you can interact with the model.
You've successfully done the following:
- Deployed the JetStream model server with MaxText and Pathways on GKE by using TPUs and disaggregated serving.
- Created a Llama 2-70B int8 checkpoint at
gs://BUCKET_NAME
. - Served and interacted with the model.
Troubleshoot issues
- If you get the
Empty reply from server
message, it's possible the container hasn't finished downloading the model data. Check the Pod's logs again for theConnected
message, which indicates that the model is ready to serve. - If you see the
Connection refused
message, verify that your port forwarding is active.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following commands and follow the prompts:
gcloud container clusters delete CLUSTER_NAME --region=REGION
gcloud iam service-accounts delete jetstream-pathways@PROJECT_ID.iam.gserviceaccount.com
gcloud storage rm --recursive gs://BUCKET_NAME
What's next
- Discover how you can run Gemma models on GKE and how to run optimized AI/ML workloads with GKE platform orchestration capabilities.
- Learn more about TPUs in GKE.
- Explore the JetStream GitHub repository.
- Explore the Vertex AI Model Garden.