OS and Docker images

Google Cloud provides images that contain common operating systems, frameworks, libraries, and drivers. Google Cloud optimizes these pre-configured images to support your artificial intelligence (AI) and machine learning (ML) workloads.

This document provides an overview of the images that you use to deploy, manage, and run workloads in your AI Hypercomputer environment.

Understand the image categories

Images are grouped into the following categories:

  • AI and ML frameworks and libraries: Docker images that are pre-configured with binaries for ML frameworks and libraries to simplify the creation, training, and use of ML models. On AI Hypercomputer, you can use Deep Learning Software Layer (DLSL) Docker images to run ML models such as NeMO and MaxText on a Google Kubernetes Engine (GKE) cluster.
  • Cluster deployment and orchestration: operating system (OS) images that you use to deploy and manage the performance-optimized infrastructure on which your AI workloads run. You can deploy your AI workloads on GKE clusters, Slurm clusters, or Compute Engine instances. For more information, see VM and cluster creation overview. The following operating system images are available for cluster or instance deployment:
    • GKE node images: you can use these images to deploy GKE clusters.
    • Slurm OS images: Cluster Toolkit builds and deploys these images, which install the necessary system software for Slurm nodes.
    • Accelerator OS images: you can use these images to create individual or groups of VM instances.

AI and ML frameworks and libraries

Google Cloud provides Docker images that package popular AI and ML frameworks and libraries. These images provide the software needed to simplify the development, training, and deployment of models on your AI-optimized clusters running on AI Hypercomputer.

Deep Learning Software Layer (DLSL) Docker images

These images package NVIDIA CUDA, NCCL, an ML framework, and a model. They provide a ready-to-use environment for deep learning workloads. These prebuilt DLSL Docker images work seamlessly with your GKE clusters because we test and verify these images during internal reproducibility and regression testing.

DLSL Docker images provide the following benefits:

  • Preconfigured software: DLSL Docker images replicate the setup that internal reproducibility and regression testing use. These images provide a pre-configured, tested, and optimized environment, which saves significant time and effort in installation and configuration.
  • Version management: DLSL Docker images are frequently updated. These version updates provide the latest stable version of frameworks and drivers, and the updates also address security patches.
  • Infrastructure compatibility: DLSL Docker images are built and tested to work seamlessly with the GPU machine types available on AI Hypercomputer.
  • Quickstart instructions: some DLSL Docker images have accompanying sample recipes that show you how to start your workloads that use the pre-configured images.

NeMo + PyTorch + NCCL gIB plugin

These Docker images are based on the NVIDIA NeMo NGC image. They contain Google's NCCL gIB plugin and bundle all NCCL binaries required to run workloads on each supported accelerator machine. These images also include Google Cloud tools such as gcsfuse and gcloud CLI for deploying workloads to Google Kubernetes Engine.

DLSL image version Dependencies version Machine series Release date End of support date DLSL image name
nemo25.04-gib1.0.6-A4
  • NeMo NGC:25.04.01
  • NCCL giB plugin: 1.0.6
A4 July 3, 2025 July 3, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.04-gib1.0.6-A4
nemo25.02-gib1.0.5-A4
  • NeMo NGC:25.02
  • NCCL giB plugin: 1.0.5
A4 March 14, 2025 March 14, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.02-gib1.0.5-A4
nemo24.07-gib1.0.2-A3U
  • NeMo NGC:24.07
  • NCCL giB plugin: 1.0.2
A3 Ultra February 2, 2025 February 2, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.2-A3U
nemo24.07-gib1.0.3-A3U
  • NeMo NGC:24.07
  • NCCL giB plugin: 1.0.3
A3 Ultra February 2, 2025 February 2, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.3-A3U
nemo24.12-gib1.0.3-A3U
  • NeMo NGC:24.12
  • NCCL giB plugin: 1.0.3
A3 Ultra February 7, 2025 February 7, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.12-gib1.0.3-A3U
nemo24.07-tcpx1.0.5-A3Mega
  • NeMo NGC:24.07
  • GPUDirect-TCPX: 1.0.5
A3 Mega March 12, 2025 March 12, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-tcpx1.0.5-A3Mega

NeMo + PyTorch

This Docker image is based on the NVIDIA NeMo NGC image and includes Google Cloud tools such as gcsfuse and gcloud CLI for deploying workloads to Google Kubernetes Engine.

DLSL image version Dependencies version Machine series Release date End of support date DLSL image name
nemo24.07--A3U NeMo NGC:24.07 A3 Ultra December 19, 2024 December 19, 2025 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo:nemo24.07-A3U

MaxText + JAX toolbox

This Docker image is based on the NVIDIA JAX toolbox image and includes Google Cloud tools such as gcsfuse and gcloud CLI for deploying workloads to Google Kubernetes Engine.

DLSL image version Dependencies version Machine series Release date End of support date DLSL image name
toolbox-maxtext-2025-01-10-A3U JAX toolbox: maxtext-2025-01-10 A3 Ultra March 11, 2025 March 11, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:toolbox-maxtext-2025-01-10-A3U

MaxText + JAX stable stack

This Docker image is based on the JAX stable stack and MaxText. This image also includes dependencies such as dnsutils for running workloads on Google Kubernetes Engine.

DLSL image version Dependencies version Machine series Release date End of support date DLSL image name
jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317
  • JAX Stable stacks:jax0.5.1-cuda_dl25.02-rev1
  • maxtext commit: 54e98c9e62caa426cf5902be068533ddb4fb79f5
A4 March 17, 2025 March 17, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317

Cluster deployment and orchestration

OS images include all the necessary software components to deploy an operating system on a Compute Engine virtual machine instance or GKE node. The operating system manages underlying hardware resources, such as accelerators and networking. This provides the compute resources for your AI workload.

GKE node images

GKE deploys clusters using node images. These node images are available for various operating systems such as Container-Optimized OS, Ubuntu, and Windows Server. The Container-Optimized OS with containerd (cos_containerd) node images that you need to deploy GKE Autopilot clusters include optimizations to support your AI and ML workloads.

For more information about these node images, see Node images.

Slurm OS images

Slurm clusters deploy compute and controller nodes as virtual machine instances on Compute Engine.

To provision AI-optimized Slurm clusters, you must use Cluster Toolkit. During Slurm cluster deployment, the cluster blueprint automatically builds a custom OS image that installs the required system software for cluster and workload management on the Slurm nodes. You can modify the default blueprints before you deploy them to customize some of the software that your images include.

The following section summarizes the software that the cluster blueprint installs on your A4 and A3 Ultra Slurm nodes. Cluster blueprints extend the Ubuntu LTS Accelerator OS images.

A4

The A4 blueprint available on GitHub includes the following software by default:

  • Ubuntu 22.04 LTS
  • Slurm: version 24.11.2
  • The following Slurm dependencies:
    • munge
    • mariadb
    • libjwt
    • lmod
  • Open MPI: the latest release of 4.1.x
  • PMIx: version 4.2.9
  • NFS client and server
  • NVIDIA 570 series drivers
  • NVIDIA enroot container runtime: version 3.5.0 with post-release bugfix
  • NVIDIA pyxis
  • The following NVIDIA tools:
    • Data Center GPU Manager (dcgmi)
    • nvidia-utils-570
    • nvidia-container-toolkit
    • libnvidia-nscq-570
  • CUDA Toolkit: version 12.8
  • Infiniband support, including ibverbs-utils
  • Ops Agent
  • Cloud Storage FUSE

A3 Ultra

The A3 Ultra blueprint available on GitHub includes the following software by default:

  • Ubuntu 22.04 LTS
  • Slurm: version 24.11.2
  • The following Slurm dependencies:
    • munge
    • mariadb
    • libjwt
    • lmod
  • Open MPI: the latest release of 4.1.x
  • PMIx: version 4.2.9
  • NFS client and server
  • NVIDIA 570 series drivers
  • NVIDIA enroot container runtime: version 3.5.0 with post-release bugfix
  • NVIDIA pyxis
  • The following NVIDIA tools:
    • Data Center GPU Manager (dcgmi)
    • libnvidia-cfg1-570-server
    • libnvidia-nscq-570
    • nvidia-compute-utils-570-server
    • nsight-compute
    • nsight-systems
  • CUDA Toolkit: version 12.8
  • Infiniband support, including ibverbs-utils
  • Ops Agent
  • Cloud Storage FUSE

Accelerator OS images

AI Hypercomputer lets you provision individual instances or groups of instances. If you want to create these instances, you must specify an OS image during instance creation.

Google Cloud offers a suite of OS images for instance creation. Google Cloud also offers a specialized set of accelerator OS images for AI-optimized instances. These OS images include core drivers for GPU and networking functionality, such as NVIDIA drivers, Mellanox drivers, and their dependencies.

For more information about each OS, see the Operating system details page in the Compute Engine documentation.

Accelerator OS images are available for the Rocky Linux and Ubuntu LTS operating systems.

Rocky Linux accelerator

The following Rocky Linux accelerator OS images are available for each machine series:

OS version Image family Machine series Image project
Rocky Linux 9 accelerator rocky-linux-9-optimized-gcp-nvidia-570 A4, A3 Ultra rocky-linux-accelerator-cloud
Rocky Linux 8 accelerator rocky-linux-8-optimized-gcp-nvidia-570 A4, A3 Ultra rocky-linux-accelerator-cloud

Ubuntu LTS accelerator

The following Ubuntu LTS accelerator OS images are available for each machine series:

OS version Image family Architecture Machine series Image project
Ubuntu 24.04 LTS accelerator ubuntu-accelerator-2404-arm64-with-nvidia-570 Arm A4X ubuntu-os-accelerator-images
ubuntu-accelerator-2404-amd64-with-nvidia-570 x86 A4, A3 Ultra ubuntu-os-accelerator-images
Ubuntu 22.04 LTS accelerator ubuntu-accelerator-2204-arm64-with-nvidia-570 Arm A4X ubuntu-os-accelerator-images
ubuntu-accelerator-2204-amd64-with-nvidia-570 x86 A4, A3 Ultra ubuntu-os-accelerator-images

What's next