vLLM inference on v6e TPUs
This tutorial shows you how to run vLLM inference on v6e TPUs. It also shows you how to run the benchmark script for the Meta Llama-3.1 8B model.
To get started with vLLM on v6e TPUs, see the vLLM quickstart.
If you are using GKE, also see the GKE tutorial.
Before you begin
You must sign the consent agreement to use Llama3 family of models in the HuggingFace repo. Go to https://huggingface.co/meta-llama/Llama-3.1-8B, fill out the consent agreement, and wait until you are approved.
Prepare to provision a TPU v6e with 4 chips:
Follow Set up the Cloud TPU environment guide to ensure you have appropriate access to use Cloud TPUs.
Create a service identity for the TPU VM.
gcloud alpha compute tpus tpu-vm service-identity create --zone=zone
Create a TPU service account and grant access to Google Cloud services.
Service accounts allow the Google Cloud TPU service to access other Google Cloud services. A user-managed service account is recommended. You can create a service account from the Google Cloud console or through the
gcloud
command.Create a service account using the
gcloud
command-line tool:gcloud iam service-accounts create your-service-account-name \ --description="your-sa-description" \ --display-name="your-sa-display-name" export SERVICE_ACCOUNT_NAME=your-service-account-name
Create a service account from the Google Cloud console:
- Go to the Service Accounts page in the Google Cloud console.
- Click Create service account.
- Enter the service account name.
- (Optional) Enter a description for the service account.
- Click Create and continue.
- Choose the roles you want to grant to the service account.
- Click Continue.
- (Optional) Specify users or groups that can manage the service account.
- Click Done to finish creating the service account.
After creating your service account, follow these steps to grant service account roles.
The following roles are necessary:
- TPU Admin: Needed to create a TPU
- Storage Admin: Needed for accessing Cloud Storage
- Logs Writer
- Monitoring Metric Writer: Needed for writing metrics to Cloud Monitoring
Your administrator must grant you the
roles/resourcemanager.projectIamAdmin
in order for you to assign IAM roles to users. A user with the Project IAM Adminroles/resourcemanager.projectIamAdmin
role can also grant this role.Use the following
gcloud
commands to add service account roles:gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \ --role roles/tpu.admin gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \ --role roles/storage.admin gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \ --role roles/logging.logWriter gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \ --role roles/monitoring.metricWriter
You can also assign roles using the Google Cloud console.
From the Google Cloud console, select the following roles:
- Select your service account and click Add Principal.
- In the New Principals field, enter the email address of your service account.
- In the Select a role drop-down, search for role (for example, Storage Admin) and select it.
- Click Save.
Authenticate with Google Cloud and configure the default project and zone for Google Cloud CLI.
gcloud auth login gcloud config set project PROJECT_ID gcloud config set compute/zone ZONE
Secure capacity
When you are ready to secure TPU capacity, review the quotas page to learn about the Cloud Quotas system. If you have additional questions about securing capacity, contact your Cloud TPU sales or account team.
Provision the Cloud TPU environment
You can provision TPU VMs with GKE, with GKE and XPK, or as queued resources.
Prerequisites
- This tutorial has been tested with Python 3.10 or later.
- Verify that your project has enough
TPUS_PER_TPU_FAMILY
quota, which specifies the maximum number of chips you can access within your Google Cloud project. - Verify that your project has enough TPU quota for:
- TPU VM quota
- IP Address quota
- Hyperdisk balanced quota
- User project permissions
- If you are using GKE with XPK, see Cloud Console Permissions on the user or service account for the permissions needed to run XPK.
Provision a TPU v6e
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ --node-id TPU_NAME \ --project PROJECT_ID \ --zone ZONE \ --accelerator-type v6e-4 \ --runtime-version v2-alpha-tpuv6e \ --service-account SERVICE_ACCOUNT
Variable | Description |
NODE_ID | The user-assigned ID of the TPU that is created when the queued resource request is allocated. |
PROJECT_ID | Google Cloud project name. Use an existing project or create a new one.> |
ZONE | See the TPU regions and zones document for the supported zones. |
ACCELERATOR_TYPE | See the Accelerator Typesdocumentation for the supported accelerator types. |
RUNTIME_VERSION | v2-alpha-tpuv6e
|
SERVICE_ACCOUNT | This is the email address for your service account that you can find in
Google Cloud console -> IAM -> Service Accounts
For example: tpu-service-account@<your_project_ID>.iam.gserviceaccount.com.com |
Use the list
or describe
commands
to query the status of your queued resource.
gcloud alpha compute tpus queued-resources describe ${QUEUED_RESOURCE_ID} \
--project ${PROJECT_ID} --zone ${ZONE}
For a complete list of queued resource request statuses, see the Queued Resources documentation.
Connect to the TPU using SSH
gcloud compute tpus tpu-vm ssh TPU_NAME
Install dependencies
Create a directory for Miniconda:
mkdir -p ~/miniconda3
Download the Miniconda installer script:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
Install Miniconda:
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
Remove the Miniconda installer script:
rm -rf ~/miniconda3/miniconda.sh
Add Miniconda to your
PATH
variable:export PATH="$HOME/miniconda3/bin:$PATH"
Reload
~/.bashrc
to apply the changes to thePATH
variable:source ~/.bashrc
Create a Conda environment:
conda create -n vllm python=3.11 -y conda activate vllm
Clone the vLLM repository and navigate to the vLLM directory:
git clone https://github.com/vllm-project/vllm.git && cd vllm
Clean up the existing torch and torch-xla packages:
pip uninstall torch torch-xla -y
Install other build dependencies:
pip install -r requirements-tpu.txt VLLM_TARGET_DEVICE="tpu" python setup.py develop sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
Get access to the model
Generate a new Hugging Face token if you don't already have one:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role with at least
Read
permissions. - Select Generate a token.
Copy the generated token to your clipboard, set it as an environment variable and authenticate with the huggingface-cli:
export TOKEN=YOUR_TOKEN git config --global credential.helper store huggingface-cli login --token $TOKEN
Download benchmarking data
Create a
/data
directory and download the ShareGPT dataset from Hugging Face.mkdir ~/data && cd ~/data wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Launch the vLLM server
The following command downloads the model weights from
Hugging Face Model Hub
to the TPU VM's /tmp
directory, pre-compiles a range of input shapes, and
writes the model compilation to ~/.cache/vllm/xla_cache
.
For more details, refer to the vLLM docs.
cd ~/vllm
vllm serve "meta-llama/Meta-Llama-3.1-8B" --download_dir /tmp --num-scheduler-steps 4 --swap-space 16 --disable-log-requests --tensor_parallel_size=4 --max-model-len=2048 &> serve.log &
Run vLLM benchmarks
Run the vLLM benchmarking script:
python benchmarks/benchmark_serving.py \
--backend vllm \
--model "meta-llama/Meta-Llama-3.1-8B" \
--dataset-name sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000
Clean up
Delete the TPU:
gcloud compute tpus queued-resources delete QUEUED_RESOURCE_ID \ --project PROJECT_ID \ --zone ZONE \ --force \ --async