Visualizing jobs with Vertex AI TensorBoard

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

With Managed Training, you can visualize your training logs in near real-time using Vertex AI TensorBoard. Simply configure your workload to save logs to a Cloud Storage bucket, and they will be automatically streamed to the TensorBoard interface for analysis.

Prerequisites

Before you begin, ensure you have the following:

A running Managed Training cluster.
A Cloud Storage bucket to store your TensorBoard logs. This bucket must be in the same region as your TensorBoard instance. For setup instructions, see Create a Cloud Storage bucket.
A Vertex AI TensorBoard instance. For creation instructions, see Create a Vertex AI TensorBoard instance.
The correct IAM permissions. To allow Cloud Storage FUSE to read from and write to the storage bucket, the service account used by your cluster's VMs requires the Storage Object User (roles/storage.objectUser) role.

Enabling Tensorboard upload

To configure the TensorBoard integration for your job, pass the following arguments using the --extra flag in your Slurm job submission:

tensorboard_base_output_dir: Specifies the Cloud Storage path to upload logs to. For example, gs://my-bucket/my-logs.
tensorboard_url: Specifies the Vertex AI TensorBoard instance, experiment, or run URL. If only an instance is provided, a new experiment and run are created. If omitted, the default TensorBoard instance for the project is used. For example, projects/123/locations/us-central1/tensorboards/456.

Example

# Using specific tensorboard instance
sbatch --extra="tensorboard_base_output_dir=<your-cloud-storage-dir>,tensorboard_url=projects/<project-id>/locations/<location>/tensorboards/<tensorboard-instance-id>" your_script.sbatch

Writing logs from your training job

Within your training script, access the AIP_TENSORBOARD_LOG_DIR environment variable. This variable provides the unique Cloud Storage path where your script should write its TensorBoard logs.

The path follows this structure:

gs://<your-cloud-storage-path>/<cluster-id>-<cluster-uuid>/tensorboard/job-<job-id>/

The following example shows a complete workflow with two key components: the Slurm submission script that configures the job, and the Python training script that reads the environment variable to write its logs.

Slurm Job Script (simple_job.sbatch):

#!/bin/bash
#SBATCH --job-name=tensorboard-simple-test
#SBATCH --output=tensorboard-simple-test-%j.out

# Activate your Python virtual environment if needed
# source /path/to/your/venv/bin/activate
python3 simple_logger.py

Python Script (simple_logger.py):

import tensorflow as tf
import os

# Get the log directory from the environment variable
log_dir = os.environ.get("AIP_TENSORBOARD_LOG_DIR")

print(f"Writing TensorBoard logs to: {log_dir}")
writer = tf.summary.create_file_writer(log_dir)

with writer.as_default():
    for step in range(10):
        # Simulate some metrics
        loss = 1.0 - (step * 0.1)
        accuracy = 0.6 + (step * 0.04)

        # Log the metrics
        tf.summary.scalar('loss', loss, step=step)
        tf.summary.scalar('accuracy', accuracy, step=step)
        writer.flush()
        print(f"Step {step}: loss={loss:.4f}, accuracy={accuracy:.4f}")

writer.close()
print(f"--- Finished writing metrics to {log_dir} ---")

Real-time Log Synchronization

To visualize metrics from a running job, you must periodically close and recreate the summary writer in your training code. This is necessary because gcsfuse only syncs log files to Cloud Storage once they are closed. This "flushing" technique ensures that intermediate results are visible in the TensorBoard console before the job completes.

Viewing Vertex AI TensorBoard

Once your job is submitted, you can monitor its progress by going to the to the Vertex AI Experiments page in the Google Cloud console.