Train Llama2 with Megatron-LM on A3 Mega virtual machines

Standard

Overview

In this quickstart, you learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega. The code is available on this GitHub repository: megatron-gke.

Before you begin

Take the following steps to enable the Google Kubernetes Engine (GKE) API:

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Create an A3 Mega cluster

Create a A3 Mega GKE cluster with GPUDirect-TCPXO and multi-networking. For more information, see Maximize GPU network bandwidth with GPUDirect and multi-networking.

Set up your environment

Create environment variables for some common parameters
```
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION
export ZONE=ZONE
export PROJECT_ID=PROJECT_ID
```
Replace the following:
- CLUSTER_NAME: the name of your A3 Mega GKE cluster that has GPUDirect-TCPXO and multi-networking enabled.
- REGION: the region where you created your cluster.
- ZONE: the zone where you created your cluster.
- PROJECT_ID: your Google Cloud project ID.
Configure the Google Cloud CLI to use your Google Cloud credentials for authentication:
```
gcloud auth login
```
For more information, see Authenticate for using the Google Cloud CLI.

Install kubectl and the GKE gcloud CLI plugin:

sudo apt-get install kubectl
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

Fetch credentials for your GKE cluster:

gcloud container clusters get-credentials ${CLUSTER_NAME} \
  --zone=${ZONE} \
  --project=${PROJECT_ID}

If not already installed, install Helm:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh && rm get_helm.sh
sudo chmod +x /usr/local/bin/helm

Use topology-aware scheduler to deploy your Pods

You can use the topology-aware scheduler to deploy your GKE Pods to nodes that have a specified GPU topology.

In the following kubectl commands, you will use the files directly from a repository. Alternatively, you can clone the repository locally and the kubectl commands can reference the local files instead.

For more information, see Topology scheduler.

Set up the service account:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml

Install the topology scheduler scripts in a configmap:

curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py
curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py

kubectl -n kube-system create configmap topology-scheduler-scripts \
    --from-file=schedule-daemon.py=schedule-daemon.py \
    --from-file=label-nodes-daemon.py=label-nodes-daemon.py

Install the topology label daemonset and topology scheduler Pod:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml

Observe the actions of the topology scheduler:

kubectl -n kube-system logs topology-scheduler-pod

Run the workload

Build the Dockerfile and push to the Google Cloud Artifact Registry

Create a Cloud Storage bucket and a Docker repository. In the scripts/setup-and-configure-resources.sh script, replace the bucket and repository names with the ones you created, and then run the script:
```
bash scripts/setup-and-configure-resources.sh
```
Build and push the pytorch-megatron:23.11-py3 image to your repository. Ensure the Docker repository name in the scripts/build-and-push-docker-image.sh file matches the repository name you used in the scripts/setup-and-configure-resources.sh script. You can also edit the Docker image tag name before pushing.
```
bash scripts/build-and-push-docker-image.sh
```
Note: This image is based off of nvcr.io/nvidia/pytorch:23.11-py3 with minimal changes.

Launch Megatron-LM Llama2 benchmark

Edit the helm/values.yaml file to specify your Cloud Storage bucket and Docker image created in previous sections. For some example configurations, see sample-configurations.
Optional: You can also edit the selected-configuration.sh file to specify any changes you made to the default Helm configuration.
```
helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
```
Replace HELM_EXPERIMENT_NAME with an arbitrary name for your experiment.

Note: If you want to run the Helm experiment multiple times, you can either erase the existing experiment using the helm uninstall command or you can create a new experiment with a different name.

The experiment writes metrics from the Nsight Systems profiling tool to the Cloud Storage bucket specified in the megatron-experiments directory.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete the GKE cluster:

Go to the Clusters page:

Go to Clusters

Select the checkbox for CLUSTER_NAME.
Click Delete.
To confirm deletion, type CLUSTER_NAME and click Delete.

Delete the Cloud Storage bucket

Go to the Buckets page:

Go to Buckets

Select the checkbox for the Cloud Storage bucket you created for this quickstart.
Click Delete.
To confirm deletion, type DELETE and click Delete.

What's next

Learn more about using GPUs in GKE

Train Llama2 with Megatron-LM on A3 Mega virtual machines

Overview

Before you begin

Check for the roles

Grant the roles

Create an A3 Mega cluster

Set up your environment

Use topology-aware scheduler to deploy your Pods

Run the workload

Build the Dockerfile and push to the Google Cloud Artifact Registry

Launch Megatron-LM Llama2 benchmark

Clean up

Delete the GKE cluster:

Delete the Cloud Storage bucket

What's next