This tutorial shows you how to serve large language model (LLM) open source models, using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the Optimum TPU serving framework from Hugging Face. In this tutorial, you download open source models from Hugging Face and deploy the models on a GKE Standard cluster using a container that runs Optimum TPU.
This guide provides a starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.
This tutorial is intended for Generative AI customers in the Hugging Face ecosystem, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving LLMs.
As a reminder, you have multiple options for LLM inference on Google Cloud spanning offerings like Vertex AI, GKE, and Google Compute Engine where you can incorporate serving libraries like JetStream, vLLM, and other partner offerings. For example, you can use JetStream to get the latest optimizations from the project. If you prefer Hugging Face options, you can use Optimum TPU.
Optimum TPU supports the following features:
- Continuous batching
- Token streaming
- Greedy search and multinomial sampling using transformers.
Objectives
- Prepare a GKE Standard cluster with the recommended TPU topology based on the model characteristics.
- Deploy Optimum TPU on GKE.
- Use Optimum TPU to serve the supported models through curl.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role colunn to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Create a Hugging Face account, if you don't already have one.
- Ensure your project has sufficient quota for Cloud TPU in GKE.
Prepare the environment
In this tutorial, you use Cloud Shell to manage resources hosted on
Google Cloud. Cloud Shell comes preinstalled with the software you'll need
for this tutorial, including
kubectl
and
gcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export CLUSTER_NAME=CLUSTER_NAME export REGION=REGION_NAME export ZONE=ZONE
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- CLUSTER_NAME: the name of your GKE cluster.
- REGION_NAME: the region where your GKE
cluster, Cloud Storage bucket, and TPU nodes are located. The region
contains zones where TPU v5e machine types are available (for example,
us-west1
,us-west4
,us-central1
,us-east1
,us-east5
, oreurope-west4
). - (Standard cluster only) ZONE: the zone where the TPU resources are available (for example,
us-west4-a
). For Autopilot clusters, you don't need to specify the zone, only the region.
Clone the Optimum TPU repository:
git clone https://github.com/huggingface/optimum-tpu.git
Get access to the model
You can use the Gemma 2B or Llama3 8B models. This tutorial focuses on these two models, but Optimum TPU supports more models.
Gemma 2B
To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.
Sign the license consent agreement
You must sign the consent agreement to use Gemma. Follow these instructions:
- Access the model consent page.
- Verify consent using your Hugging Face account.
- Accept the model terms.
Generate an access token
Generate a new Hugging Face token if you don't already have one:
- Click Your Profile > Settings > Access Tokens.
- Click New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Click Generate a token.
- Copy the generated token to your clipboard.
Llama3 8B
You must sign the consent agreement to use Llama3 8b in the Hugging Face Repo
Generate an access token
Generate a new Hugging Face token if you don't already have one:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Select Generate a token.
- Copy the generated token to your clipboard.
Create a GKE cluster
Create a GKE Standard cluster with 1 CPU node:
gcloud container clusters create CLUSTER_NAME \
--project=PROJECT_ID \
--num-nodes=1 \
--location=ZONE
Create TPU node pool
Create a v5e TPU node pool with 1 node and 8 chips:
gcloud container node-pools create tpunodepool \
--location=ZONE \
--num-nodes=1 \
--machine-type=ct5lp-hightpu-8t \
--cluster=CLUSTER_NAME
Configure kubectl to communicate with your cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}
Build the container
Run the make command to build the image
cd optimum-tpu && make tpu-tgi
Push the image to the Artifact Registry
gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
Create a Kubernetes Secret for Hugging Face credentials
Create a Kubernetes Secret that contains the Hugging Face token:
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -
Deploy Optimum TPU
Deploy Optimum TPU:
Gemma 2B
Save the following manifest as
optimum-tpu-gemma-2b-2x4.yaml
:apiVersion: apps/v1 kind: Deployment metadata: name: tgi-tpu spec: replicas: 1 selector: matchLabels: app: tgi-tpu template: metadata: labels: app: tgi-tpu spec: nodeSelector: cloud.google.com/gke-tpu-topology: 2x4 cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice containers: - name: tgi-tpu image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest args: - --model-id=google/gemma-2b - --max-concurrent-requests=4 - --max-input-length=32 - --max-total-tokens=64 - --max-batch-size=1 securityContext: privileged: true env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token ports: - containerPort: 80 resources: limits: google.com/tpu: 8 livenessProbe: httpGet: path: /health port: 80 initialDelaySeconds: 300 periodSeconds: 120 --- apiVersion: v1 kind: Service metadata: name: service spec: selector: app: tgi-tpu ports: - name: http protocol: TCP port: 8080 targetPort: 80
This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.
Apply the manifest
kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml
Llama3 8B
Save the following manifest as
optimum-tpu-llama3-8b-2x4.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: tgi-tpu spec: replicas: 1 selector: matchLabels: app: tgi-tpu template: metadata: labels: app: tgi-tpu spec: nodeSelector: cloud.google.com/gke-tpu-topology: 2x4 cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice containers: - name: tgi-tpu image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest args: - --model-id=meta-llama/Meta-Llama-3-8B - --max-concurrent-requests=4 - --max-input-length=32 - --max-total-tokens=64 - --max-batch-size=1 env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_api_token ports: - containerPort: 80 resources: limits: google.com/tpu: 8 livenessProbe: httpGet: path: /health port: 80 initialDelaySeconds: 300 periodSeconds: 120 --- apiVersion: v1 kind: Service metadata: name: service spec: selector: app: tgi-tpu ports: - name: http protocol: TCP port: 8080 targetPort: 80
This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.
Apply the manifest
kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml
View the logs from the running Deployment:
kubectl logs -f -l app=tgi-tpu
The output should be similar to the following:
2024-07-09T22:39:34.365472Z WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0
Make sure the model is fully downloaded before proceeding to the next section.
Serve the model
Set up port forwarding to the model:
kubectl port-forward svc/service 8080:8080
Interact with the model server using curl
Verify your deployed models:
In a new terminal session, use curl
to chat with the model:
curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}' -H 'Content-Type: application/json'
The output should be similar to the following:
{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command:
gcloud container clusters delete CLUSTER_NAME \
--location=ZONE
What's next
- Explore the Optimum TPU documentation.
- Discover how you can run Gemma models on GKE and how to run optimized AI/ML workloads with GKE platform orchestration capabilities.
- Learn more about TPUs in GKE.