Serve open source models using TPUs on GKE with Optimum TPU

Standard

This tutorial shows you how to serve large language model (LLM) open source models, using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the Optimum TPU serving framework from Hugging Face. In this tutorial, you download open source models from Hugging Face and deploy the models on a GKE Standard cluster using a container that runs Optimum TPU.

This guide provides a starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.

This tutorial is intended for Generative AI customers in the Hugging Face ecosystem, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving LLMs.

As a reminder, you have multiple options for LLM inference on Google Cloud—which span offerings like Vertex AI, GKE, and Google Compute Engine—where you can incorporate serving libraries like JetStream, vLLM, and other partner offerings. For example, you can use JetStream to get the latest optimizations from the project. If you prefer Hugging Face options, you can use Optimum TPU.

Optimum TPU supports the following features:

Continuous batching
Token streaming
Greedy search and multinomial sampling using transformers.

Objectives

Prepare a GKE Standard cluster with the recommended TPU topology based on the model characteristics.
Deploy Optimum TPU on GKE.
Use Optimum TPU to serve the supported models through curl.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/artifactregistry.admin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Create a Hugging Face account, if you don't already have one.
Ensure your project has sufficient quota for Cloud TPU in GKE.

Prepare the environment

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you'll need for this tutorial, including kubectl and gcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
export HF_TOKEN=HF_TOKEN
```
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- CLUSTER_NAME: the name of your GKE cluster.
- REGION_NAME: the region where your GKE cluster, Cloud Storage bucket, and TPU nodes are located. The region contains zones where TPU v5e machine types are available (for example, us-west1, us-west4, us-central1, us-east1, us-east5, or europe-west4).
- (Standard cluster only) ZONE: the zone where the TPU resources are available (for example, us-west4-a). For Autopilot clusters, you don't need to specify the zone, only the region.
- HF_TOKEN: your HuggingFace token.

Clone the Optimum TPU repository:

git clone https://github.com/huggingface/optimum-tpu.git

Get access to the model

You can use the Gemma 2B or Llama3 8B models. This tutorial focuses on these two models, but Optimum TPU supports more models.

Gemma 2B

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.

You must sign the consent agreement to use Gemma. Follow these instructions:

Access the model consent page.
Verify consent using your Hugging Face account.
Accept the model terms.

Generate an access token

Generate a new Hugging Face token if you don't already have one:

Click Your Profile > Settings > Access Tokens.
Click New Token.
Specify a Name of your choice and a Role of at least Read.
Click Generate a token.
Copy the generated token to your clipboard.

Llama3 8B

You must sign the consent agreement to use Llama3 8b in the Hugging Face Repo

Generate an access token

Generate a new Hugging Face token if you don't already have one:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name of your choice and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard.

Create a GKE cluster

Create a GKE Standard cluster with 1 CPU node:

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=ZONE

Create TPU node pool

Create a v5e TPU node pool with 1 node and 8 chips:

gcloud container node-pools create tpunodepool \
    --location=ZONE \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --cluster=CLUSTER_NAME

If TPU resources are available, GKE provisions the node pool. If TPU resources are temporarily unavailable, the output shows a GCE_STOCKOUT error message. To help ensure TPU availability, you can use TPU reservations.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${ZONE}

Build the container

Run the make command to build the image

cd optimum-tpu && make tpu-tgi

Push the image to the Artifact Registry

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

Create a Kubernetes Secret for Hugging Face credentials

Create a Kubernetes Secret that contains the Hugging Face token:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

Deploy Optimum TPU

To deploy Optimum TPU, this tutorial uses a Kubernetes Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.

Gemma 2B

Save the following Deployment manifest as optimum-tpu-gemma-2b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.

Apply the manifest

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

Save the following manifest as optimum-tpu-llama3-8b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.

Apply the manifest

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

View the logs from the running Deployment:

kubectl logs -f -l app=tgi-tpu

The output should be similar to the following:

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

Make sure the model is fully downloaded before proceeding to the next section.

Serve the model

Set up port forwarding to the model:

kubectl port-forward svc/service 8080:8080

Interact with the model server using curl

Verify your deployed models:

In a new terminal session, use curl to chat with the model:

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

The output should be similar to the following:

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command:

gcloud container clusters delete CLUSTER_NAME \
  --location=ZONE

What's next

Explore the Optimum TPU documentation.
Discover how you can run Gemma models on GKE and how to run optimized AI/ML workloads with GKE platform orchestration capabilities.
Learn more about TPUs in GKE.

Serve open source models using TPUs on GKE with Optimum TPU Stay organized with collections Save and categorize content based on your preferences.

Objectives

Before you begin

Check for the roles

Grant the roles

Prepare the environment

Get access to the model

Gemma 2B

Sign the license consent agreement

Generate an access token

Llama3 8B

Generate an access token

Create a GKE cluster

Create TPU node pool

Configure kubectl to communicate with your cluster:

Build the container

Push the image to the Artifact Registry

Create a Kubernetes Secret for Hugging Face credentials

Deploy Optimum TPU

Gemma 2B

Llama3 8B

Serve the model

Interact with the model server using curl

Clean up

Delete the deployed resources

What's next

Serve open source models using TPUs on GKE with Optimum TPU