本頁面由 Cloud Translation API 翻譯而成。

使用 TPU 加速器進行訓練

Vertex AI 支援使用 TPU VM，透過各種架構和程式庫進行訓練。設定運算資源時，您可以指定 TPU v2、TPU v3 或 TPU v5e VM。TPU v5e 支援 JAX 0.4.6 以上版本、TensorFlow 2.15 以上版本和 PyTorch 2.1 以上版本。TPU v6e 支援 Python 3.10 以上版本、JAX 0.4.37 以上版本和 PyTorch 2.1 以上版本，並以 PJRT 做為預設執行階段。

如要瞭解如何設定 TPU VM 進行自訂訓練，請參閱「設定自訂訓練工作的運算資源」。

TensorFlow 訓練

預先建構的容器

使用支援 TPU 的預建訓練容器，並建立 Python 訓練應用程式。

自訂容器

使用自訂容器，其中已安裝專為 TPU VM 建構的 tensorflow 和 libtpu 版本。這些程式庫由 Cloud TPU 服務維護，並列於「支援的 TPU 設定」說明文件中。

選取所需 tensorflow 版本和對應的 libtpu 程式庫。接著，在建構容器時，將這些項目安裝在 Docker 容器映像檔中。

舉例來說，如要使用 TensorFlow 2.12，請在 Dockerfile 中加入下列指令：

  # Download and install `tensorflow`.
  RUN pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.15.0/tensorflow-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

  # Download and install `libtpu`.
  # You must save `libtpu.so` in the '/lib' directory of the container image.
  RUN curl -L https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/libtpu/1.9.0/libtpu.so -o /lib/libtpu.so

  # TensorFlow training on TPU v5e requires the PJRT runtime. To enable the PJRT
  # runtime, configure the following environment variables in your Dockerfile.
  # For details, see https://cloud.google.com/tpu/docs/runtimes#tf-pjrt-support.
  # ENV NEXT_PLUGGABLE_DEVICE_USE_C_API=true
  # ENV TF_PLUGGABLE_DEVICE_LIBRARY_PATH=/lib/libtpu.so

TPU Pod

tensorflow訓練，則需要在訓練容器中進行額外設定。TPU PodVertex AI 會維護處理初始設定的基礎 Docker 映像檔。

圖片 URI	Python 版本和 TPU 版本
`us-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest` `europe-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest` `asia-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest`	3.8
`us-docker.pkg.dev/vertex-ai/training/tf-tpu.2-15-pod-base-cp310:latest` `europe-docker.pkg.dev/vertex-ai/training/tf-tpu.2-15-pod-base-cp310:latest` `asia-docker.pkg.dev/vertex-ai/training/tf-tpu.2-15-pod-base-cp310:latest`	3.10

如要建立自訂容器，請按照下列步驟操作：

選擇所選 Python 版本的基礎映像檔。TPU TensorFlow 2.12 以下版本的 TensorFlow 輪子支援 Python 3.8。TensorFlow 2.13 以上版本支援 Python 3.10 以上版本。如需特定 TensorFlow 輪子，請參閱「Cloud TPU 設定」。
使用訓練師程式碼和啟動指令擴充映像檔。

# Specifies base image and tag
FROM us-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest
WORKDIR /root

# Download and install `tensorflow`.
RUN pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.12.0/tensorflow-2.12.0-cp38-cp38-linux_x86_64.whl

# Download and install `libtpu`.
# You must save `libtpu.so` in the '/lib' directory of the container image.
RUN curl -L https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/libtpu/1.6.0/libtpu.so -o /lib/libtpu.so

# Copies the trainer code to the docker image.
COPY your-path-to/model.py /root/model.py
COPY your-path-to/trainer.py /root/trainer.py

# The base image is setup so that it runs the CMD that you provide.
# You can provide CMD inside the Dockerfile like as follows.
# Alternatively, you can pass it as an `args` value in ContainerSpec:
# (https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#containerspec)
CMD ["python3", "trainer.py"]

PyTorch 訓練

使用 TPU 訓練時，您可以針對 PyTorch 使用預建或自訂容器。

預先建構的容器

使用支援 TPU 的預建訓練容器，並建立 Python 訓練應用程式。

自訂容器

使用已安裝 PyTorch 程式庫的自訂容器。

舉例來說，您的 Dockerfile 可能如下所示：

FROM python:3.10

# v5e, v6e specific requirement - enable PJRT runtime
ENV PJRT_DEVICE=TPU

# install pytorch and torch_xla
RUN pip3 install torch~=2.1.0 torchvision torch_xla[tpu]~=2.1.0
 -f https://storage.googleapis.com/libtpu-releases/index.html

# Add your artifacts here
COPY trainer.py .

# Run the trainer code
CMD ["python3", "trainer.py"]

TPU Pod

訓練會在 TPU Pod 的所有主機上執行 (請參閱「在 TPU Pod 配量上執行 PyTorch 程式碼」)。

Vertex AI 會等待所有主機的回應，再決定是否完成工作。

JAX 訓練

預先建構的容器

JAX 沒有預先建構的容器。

自訂容器

使用已安裝 JAX 程式庫的自訂容器。

舉例來說，您的 Dockerfile 可能如下所示：

# Install JAX.
RUN pip install 'jax[tpu]>=0.4.6' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Add your artifacts here
COPY trainer.py trainer.py

# Set an entrypoint.
ENTRYPOINT ["python3", "trainer.py"]

TPU Pod

訓練會在 TPU Pod 的所有主機上執行 (請參閱「在 TPU Pod 配量上執行 JAX 程式碼」)。

Vertex AI 會監控 TPU Pod 的第一個主機，以決定工作是否完成。您可以使用下列程式碼片段，確保所有主機同時結束：

# Your training logic
...

if jax.process_count() > 1:
  # Make sure all hosts stay up until the end of main.
  x = jnp.ones([jax.local_device_count()])
  x = jax.device_get(jax.pmap(lambda x: jax.lax.psum(x, 'i'), 'i')(x))
  assert x[0] == jax.device_count()

環境變數

下表詳細列出可在容器中使用的環境變數：

名稱	值
TPU_NODE_NAME	my-first-tpu-node
TPU_CONFIG	{"project": "tenant-project-xyz", "zone": "us-central1-b", "tpu_node_name": "my-first-tpu-node"}

自訂服務帳戶

自訂服務帳戶可用於 TPU 訓練。如要瞭解如何使用自訂服務帳戶，請參閱這篇文章。

訓練用的私人 IP (虛擬私有雲網路對等互連)

私人 IP 可用於 TPU 訓練。請參閱如何使用私人 IP 進行自訂訓練頁面。

VPC Service Controls

啟用 VPC Service Controls 的專案可以提交 TPU 訓練工作。

限制

使用 TPU VM 訓練時，有下列限制：

TPU 僅適用於特定 Vertex AI 區域。

TPU 類型

如要進一步瞭解 TPU 加速器 (例如記憶體限制)，請參閱「TPU 類型」。

使用 TPU 加速器進行訓練 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

TensorFlow 訓練

預先建構的容器

自訂容器

TPU Pod

PyTorch 訓練

預先建構的容器

自訂容器

TPU Pod

JAX 訓練

預先建構的容器

自訂容器

TPU Pod

環境變數

自訂服務帳戶

訓練用的私人 IP (虛擬私有雲網路對等互連)

VPC Service Controls

限制

TPU 類型

使用 TPU 加速器進行訓練