JetStream PyTorch inference on v6e TPU VMs

This tutorial shows how to use JetStream to serve PyTorch models on TPU v6e. JetStream is a throughput and memory optimized engine for large language model (LLM) inference on XLA devices (TPUs). In this tutorial, you run the inference benchmark for the Llama2-7B model.

Before you begin

Prepare to provision a TPU v6e with 4 chips:

  1. Follow Set up the Cloud TPU environment guide to ensure you have appropriate access to use Cloud TPUs.

  2. Create a service identity for the TPU VM.

    gcloud alpha compute tpus tpu-vm service-identity create --zone=zone
  3. Create a TPU service account and grant access to Google Cloud services.

    Service accounts allow the Google Cloud TPU service to access other Google Cloud services. A user-managed service account is recommended. You can create a service account from the Google Cloud console or through the gcloud command.

    Create a service account using the gcloud command-line tool:

    gcloud iam service-accounts create your-service-account-name \
    --description="your-sa-description" \
    --display-name="your-sa-display-name"
    export SERVICE_ACCOUNT_NAME=your-service-account-name

    Create a service account from the Google Cloud console:

    1. Go to the Service Accounts page in the Google Cloud console.
    2. Click Create service account.
    3. Enter the service account name.
    4. (Optional) Enter a description for the service account.
    5. Click Create and continue.
    6. Choose the roles you want to grant to the service account.
    7. Click Continue.
    8. (Optional) Specify users or groups that can manage the service account.
    9. Click Done to finish creating the service account.

    After creating your service account, follow these steps to grant service account roles.

    The following roles are necessary:

    • TPU Admin: Needed to create a TPU
    • Storage Admin: Needed for accessing Cloud Storage
    • Logs Writer
    • Monitoring Metric Writer: Needed for writing metrics to Cloud Monitoring

    Your administrator must grant you the roles/resourcemanager.projectIamAdmin in order for you to assign IAM roles to users. A user with the Project IAM Admin roles/resourcemanager.projectIamAdmin role can also grant this role.

    Use the following gcloud commands to add service account roles:

    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
       --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
       --role roles/tpu.admin
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
       --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
       --role roles/storage.admin
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
       --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
       --role roles/logging.logWriter
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
       --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
       --role roles/monitoring.metricWriter

    You can also assign roles using the Google Cloud console.

    From the Google Cloud console, select the following roles:

    1. Select your service account and click Add Principal.
    2. In the New Principals field, enter the email address of your service account.
    3. In the Select a role drop-down, search for role (for example, Storage Admin) and select it.
    4. Click Save.
  4. Authenticate with Google Cloud and configure the default project and zone for Google Cloud CLI.

    gcloud auth login
    gcloud config set project PROJECT_ID
    gcloud config set compute/zone ZONE

Secure capacity

When you are ready to secure TPU capacity, review the quotas page to learn about the Cloud Quotas system. If you have additional questions about securing capacity, contact your Cloud TPU sales or account team.

Provision the Cloud TPU environment

You can provision TPU VMs with GKE, with GKE and XPK, or as queued resources.

Prerequisites

  • This tutorial has been tested with Python 3.10 or later.
  • Verify that your project has enough TPUS_PER_TPU_FAMILY quota, which specifies the maximum number of chips you can access within your Google Cloud project.
  • Verify that your project has enough TPU quota for:
    • TPU VM quota
    • IP Address quota
    • Hyperdisk balanced quota
  • User project permissions

Create environment variables

In a Cloud Shell, create the following environment variables:

export NODE_ID=TPU_NODE_ID # TPU name
export PROJECT_ID=PROJECT_ID
export ACCELERATOR_TYPE=v6e-4
export ZONE=us-central2-b
export RUNTIME_VERSION=v2-alpha-tpuv6e
export SERVICE_ACCOUNT=YOUR_SERVICE_ACCOUNT
export QUEUED_RESOURCE_ID=QUEUED_RESOURCE_ID
export VALID_DURATION=VALID_DURATION

# Additional environment variable needed for Multislice:
export NUM_SLICES=NUM_SLICES

# Use a custom network for better performance as well as to avoid having the
# default network becoming overloaded.
export NETWORK_NAME=${PROJECT_ID}-mtu9k
export NETWORK_FW_NAME=${NETWORK_NAME}-fw

Command flag descriptions

Variable Description
NODE_ID The user-assigned ID of the TPU that is created when the queued resource request is allocated.
PROJECT_ID Google Cloud project name. Use an existing project or create a new one.
ZONE See the TPU regions and zones document for the supported zones.
ACCELERATOR_TYPE See the Accelerator Typesdocumentation for the supported accelerator types.
RUNTIME_VERSION v2-alpha-tpuv6e
SERVICE_ACCOUNT This is the email address for your service account that you can find in Google Cloud console -> IAM -> Service Accounts
For example: tpu-service-account@<your_project_ID>.iam.gserviceaccount.com.com
NUM_SLICES The number of slices to create (needed for Multislice only)
QUEUED_RESOURCE_ID The user-assigned text ID of the queued resource request.
VALID_DURATION The duration for which the queued resource request is valid.
NETWORK_NAME The name of a secondary network to use.
NETWORK_FW_NAME The name of a secondary network firewall to use.

Provision a TPU v6e

    gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
        --node-id TPU_NAME \
        --project PROJECT_ID \
        --zone ZONE \
        --accelerator-type v6e-4 \
        --runtime-version v2-alpha-tpuv6e \
        --service-account SERVICE_ACCOUNT
    

Use the list or describe commands to query the status of your queued resource.

   gcloud alpha compute tpus queued-resources describe ${QUEUED_RESOURCE_ID}  \
      --project ${PROJECT_ID} --zone ${ZONE}

For a complete list of queued resource request statuses, see the Queued Resources documentation.

Connect to the TPU using SSH

  gcloud compute tpus tpu-vm ssh TPU_NAME

Run the JetStream PyTorch Llama2-7B benchmark

To set up JetStream-PyTorch, convert the model checkpoints, and run the inference benchmark, follow the instructions in the GitHub repository.

When the inference benchmark is complete, be sure to clean up the TPU resources.

Clean up

Delete the TPU:

   gcloud compute tpus queued-resources delete ${QUEUED_RESOURCE_ID} \
      --project ${PROJECT_ID} \
      --zone ${ZONE} \
      --force \
      --async