Overview
This tutorial shows you how to serve Llama 3.1 405b using Graphical Processing Units (GPUs) across multiple nodes on Google Kubernetes Engine (GKE), using the vLLM serving framework and the LeaderWorkerSet (LWS) API.
This document is a good starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.
LeaderWorkerSet (LWS)
LWS is a Kubernetes deployment API that addresses common deployment patterns of AI/ML multi-node inference workloads. LWS enables treating multiple Pods as a group.
Multi-Host Serving with vLLM
When deploying exceptionally large language models that cannot fit into a single GPU node, use multiple GPU nodes to serve the model. vLLM supports both tensor parallelism and pipeline parallelism to run workloads across GPUs.
Tensor parallelism splits the matrix multiplications in the transformer layer across multiple GPUs. However, this strategy requires a fast network due to the communication needed between the GPUs, making it less suitable for running workloads across nodes.
Pipeline parallelism splits the model by layer, or vertically. This strategy does not require constant communication between GPUs, making it a better option when running models across nodes.
You can use both strategies in multi-node serving. For example, when using two nodes with 8 H100 GPUs each, you can use two-way pipeline parallelism to shard the model across the two nodes, and eight-way tensor parallelism to shard the model across the eight GPUs on each node.
Objectives
- Prepare a GKE Standard cluster.
- Deploy vLLM across multiple nodes in your cluster.
- Use vLLM to serve Llama3 405b model through
curl
.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role colunn to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Create a Hugging Face account, if you don't already have one.
- Ensure your project has sufficient quota for GPUs. To learn more, see About GPUs and Allocation quotas.
Get access to the model
Generate an access token
If you don't already have one, generate a new Hugging Face token:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Select Generate a token.
Prepare the environment
In this tutorial, you use Cloud Shell to manage resources hosted on
Google Cloud. Cloud Shell comes preinstalled with the software you'll need
for this tutorial, including
kubectl
and
gcloud CLI.
To set up your environment with Cloud Shell, follow these steps:
In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export CLUSTER_NAME=CLUSTER_NAME export ZONE=ZONE export HF_TOKEN=HUGGING_FACE_TOKEN
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- CLUSTER_NAME: the name of your GKE cluster.
- ZONE: A zone that supports H100s.
Create a GKE cluster
Create a GKE Standard cluster with two CPU nodes:
gcloud container clusters create CLUSTER_NAME \
--project=PROJECT_ID \
--num-nodes=2 \
--location=ZONE \
--machine-type=e2-standard-16
Create GPU node pool
Create an A3 node pool with two nodes, with eight H100s each:
gcloud container node-pools create gpu-nodepool \
--location=ZONE \
--num-nodes=2 \
--machine-type=a3-highgpu-8g \
--accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST \
--placement-type=COMPACT \
--cluster=CLUSTER_NAME
Configure kubectl
to communicate with your cluster:
gcloud container clusters get-credentials CLUSTER_NAME --location=ZONE
Create a Kubernetes Secret for Hugging Face credentials
Create a Kubernetes Secret that contains the Hugging Face token:
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -
Install LeaderWorkerSet
To install LWS, run the following command:
VERSION=v0.4.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
Validate that the LeaderWorkerSet controller is running in the lws-system
namespace:
kubectl get pod -n lws-system
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
lws-controller-manager-5c4ff67cbd-9jsfc 2/2 Running 0 6d23h
Deploy vLLM Model Server
To deploy the vLLM model server, follow these steps:
Inspect the manifest
vllm-llama3-405b-A3.yaml
.Apply the manifest by running the following command:
kubectl apply -f vllm-llama3-405b-A3.yaml
View the logs from the running model server
kubectl logs vllm-0 -c vllm-leader
The output should look similar to the following:
INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST INFO: Started server process [7428] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Serve the model
Run the following command to set up port forwarding to the model
kubectl port-forward svc/vllm-leader 8080:8080
Interact with the model using curl
In a new terminal, send a request to the server:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
The output should be similar to the following:
{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command:
gcloud container clusters delete CLUSTER_NAME \
--location=ZONE
What's next
- Learn more about GPUs in GKE.
- Explore the vLLM GitHub repository and documentation.
- Explore the LWS GitHub repository