Stay organized with collections
Save and categorize content based on your preferences.
This page documents production updates to AI Hypercomputer.
Check this page for announcements about new or updated features, bug fixes,
known issues, and deprecated functionality.
You can see the latest product updates for all of Google Cloud on the
Google Cloud page, browse and filter all release notes in the
Google Cloud console,
or programmatically access release notes in
BigQuery.
July 18, 2025
Generally available: You can troubleshoot workloads with slow performance by using straggler detection metrics and logs.
Stragglers are single-point, non-crashing failures that eventually
slow down your entire workload. Large-scale ML workloads are very susceptible to stragglers, and VMs with stragglers are often very difficult to notice and pinpoint without straggler detection.
Generally available: You can now manage the Collective Communication Analyzer (CoMMA), a library that uses the NVIDIA Collective Communication Library (NCCL) profiler plugin to collect detailed NCCL telemetry for GPU machine types. The collected performance metrics and operational events are used for analyzing and optimizing large-scale AI and ML training workloads.
CoMMA is automatically installed and enabled on A4X, A4 High, and A3 Ultra machine types when using specific images. You can manage this data collection by disabling the plugin, adjusting its data granularity levels, or manually installing it on other GPU machine types. For more information, see Enable, disable, and configure CoMMA.
July 07, 2025
Preview: You can use future reservations in calendar mode to obtain resources for up to 90 days. By creating a request in calendar mode, you can reserve up to 80 GPU VMs for a future date and time. Then, you can use that capacity to run the following workloads:
Generally available: You can apply a workload policy in a managed instance group (MIG) to specify the type of the workload to run on the MIG. Workload policies help improve the workload performance by optimizing the underlying infrastructure. The supported type, high-throughput, is ideal for workloads that require high networking performance. For more information, see Workload policy for MIGs.
May 22, 2025
Generally available: You can proactively manage upcoming maintenance host events on your reserved blocks of capacity, whether VMs are running on them or not. This approach helps you minimize disruptions and maintain optimal performance. For more information, see Manage host events across reservations.
May 15, 2025
Preview: You can use the Flex-start consumption option to obtain resources for up to seven days. Flex-start provisions capacity from a secured resource pool. Using this feature increases your chance to obtain high-demand resources like GPUs. For more information, see Choose a consumption option.
March 18, 2025
Generally available: The A4 accelerator-optimized machine type is now generally available. A4 VMs are powered by NVIDIA B200 GPUs and provide up to 3x performance of previous GPU machine types for most GPU accelerated workloads. A4 is especially recommended for ML training workloads at large scales. A4 is available in the following region and zone:
Generally available: The A3 Ultra accelerator-optimized machine type is now generally available. A3 Ultra VMs are powered by NVIDIA H200 Tensor Core GPUs and support the new Titanium ML network adapter, which delivers non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). A3 Ultra VMs are ideal for foundation ML model training and serving. The A3 Ultra machine type is available in the following region and zone:
Preview: Hypercompute Cluster is now available in preview. With Hypercompute Cluster, you can streamline the provisioning of up to tens of thousands of A3 Ultra accelerator-optimized machines.
With features such as dense co-location of resources, ultra-low latency networking, targeted workload placement, and advanced maintenance controls to minimize workload disruptions, Hypercompute Cluster is built to deliver exceptional performance and resilience, so you can run your most demanding AI, ML, and HPC workloads with confidence.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-03 UTC."],[[["\u003cp\u003eThis page provides updates on AI Hypercomputer, including new features, bug fixes, known issues, and deprecated functionalities.\u003c/p\u003e\n"],["\u003cp\u003eThe A3 Ultra accelerator-optimized machine type, powered by NVIDIA H200 GPUs and Titanium ML network, is now generally available in the \u003ccode\u003eeurope-west1-b\u003c/code\u003e zone.\u003c/p\u003e\n"],["\u003cp\u003eHypercompute Cluster, which streamlines the provisioning of large numbers of A3 Ultra machines, is now available in preview.\u003c/p\u003e\n"],["\u003cp\u003eThe A3 Ultra machine type can be provisioned using Hypercompute Cluster to create VMs or clusters.\u003c/p\u003e\n"],["\u003cp\u003eGoogle cloud release notes can be found on the Google Cloud release notes page, the Google Cloud console, and BigQuery.\u003c/p\u003e\n"]]],[],null,["# AI Hypercomputer release notes\n\nThis page documents production updates to AI Hypercomputer. Check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.\nYou can see the latest product updates for all of Google Cloud on the\n[Google Cloud](/release-notes) page, browse and filter all release notes in the\n[Google Cloud console](https://console.cloud.google.com/release-notes),\nor programmatically access release notes in\n[BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=google_cloud_release_notes&t=release_notes&page=table).\n\nJuly 18, 2025\n-------------\n\n**Generally available**: You can troubleshoot workloads with slow performance by using straggler detection metrics and logs.\n\n*Stragglers* are single-point, non-crashing failures that eventually\nslow down your entire workload. Large-scale ML workloads are very susceptible to stragglers, and VMs with stragglers are often very difficult to notice and pinpoint without straggler detection.\n\nFor more information, see [Monitor VMs and Slurm clusters](https://cloud.google.com/ai-hypercomputer/docs/monitor) and [Troubleshoot slow performance](https://cloud.google.com/ai-hypercomputer/docs/troubleshooting/troubleshoot-slow-performance).\n\nJuly 10, 2025\n-------------\n\n**Generally available** : You can now manage the [Collective Communication Analyzer (CoMMA)](https://cloud.google.com/ai-hypercomputer/docs/nccl/comma), a library that uses the NVIDIA Collective Communication Library (NCCL) profiler plugin to collect detailed NCCL telemetry for GPU machine types. The collected performance metrics and operational events are used for analyzing and optimizing large-scale AI and ML training workloads.\n\nCoMMA is automatically installed and enabled on A4X, A4 High, and A3 Ultra machine types when using specific images. You can manage this data collection by disabling the plugin, adjusting its data granularity levels, or manually installing it on other GPU machine types. For more information, see [Enable, disable, and configure CoMMA](https://cloud.google.com/ai-hypercomputer/docs/nccl/configure-comma).\n\nJuly 07, 2025\n-------------\n\n**Preview**: You can use future reservations in calendar mode to obtain resources for up to 90 days. By creating a request in calendar mode, you can reserve up to 80 GPU VMs for a future date and time. Then, you can use that capacity to run the following workloads:\n\n- Model pre-training\n\n- Model fine-tuning\n\n- Simulations\n\n- Inference\n\nFor more information, see [Choose a consumption option](https://cloud.google.com/ai-hypercomputer/docs/consumption-models).\n\nJune 11, 2025\n-------------\n\n**Generally available** : You can apply a workload policy in a managed instance group (MIG) to specify the type of the workload to run on the MIG. Workload policies help improve the workload performance by optimizing the underlying infrastructure. The supported type, `high-throughput`, is ideal for workloads that require high networking performance. For more information, see [Workload policy for MIGs](https://cloud.google.com/ai-hypercomputer/docs/placement-policy-and-workload-policy#workload-policy).\n\nMay 22, 2025\n------------\n\n**Generally available** : You can proactively manage upcoming maintenance host events on your reserved blocks of capacity, whether VMs are running on them or not. This approach helps you minimize disruptions and maintain optimal performance. For more information, see [Manage host events across reservations](https://cloud.google.com/ai-hypercomputer/docs/manage/host-events-reservations).\n\nMay 15, 2025\n------------\n\n**Preview:** You can use the Flex-start consumption option to obtain resources for up to seven days. Flex-start provisions capacity from a secured resource pool. Using this feature increases your chance to obtain high-demand resources like GPUs. For more information, see [Choose a consumption option](https://cloud.google.com/ai-hypercomputer/docs/consumption-models).\n\nMarch 18, 2025\n--------------\n\n**Generally available** : The [A4 accelerator-optimized machine type](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) is now generally available. A4 VMs are powered by NVIDIA B200 GPUs and provide up to 3x performance of previous GPU machine types for most GPU accelerated workloads. A4 is especially recommended for ML training workloads at large scales. A4 is available in the following region and zone:\n\n- Council Bluffs, Iowa: `us-central1-b`\n\nWhen provisioning A4 machine types, you can use [Hypercompute Cluster](https://cloud.google.com/ai-hypercomputer/docs/hypercompute-cluster) to request capacity and create VMs or clusters. To get started see [Overview of creating VMs and clusters](https://cloud.google.com/ai-hypercomputer/docs/create/create-overview).\n\n**Software stack updates**\n\nThe following new Docker images are also released to support workloads running on your [A4 GKE clusters that are deployed using Hypercompute Cluster](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute).\n\n- NeMo docker image: `nemo25.02-gib1.0.5-A4`\n- MaxText docker image: `jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317`\n\nFor more information, see [AI Hypercomputer images](https://cloud.google.com/ai-hypercomputer/docs/software-stack).\n\nDecember 31, 2024\n-----------------\n\n**Generally available** : The [A3 Ultra accelerator-optimized machine type](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) is now generally available. A3 Ultra VMs are powered by NVIDIA H200 Tensor Core GPUs and support the new Titanium ML network adapter, which delivers non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). A3 Ultra VMs are ideal for foundation ML model training and serving. The A3 Ultra machine type is available in the following region and zone:\n\n- St. Ghislain, Belgium, Europe - `europe-west1-b`\n\nWhen provisioning A3 Ultra machine types, you must use [Hypercompute Cluster](https://cloud.google.com/ai-hypercomputer/docs/hypercompute-cluster) to request capacity and create VMs or clusters. To get started see [Overview of creating VMs and clusters](https://cloud.google.com/ai-hypercomputer/docs/create/create-overview) in the AI Hypercomputer documentation. \n**Preview** : [Hypercompute Cluster](https://cloud.google.com/ai-hypercomputer/docs/hypercompute-cluster) is now available in preview. With Hypercompute Cluster, you can streamline the provisioning of up to tens of thousands of A3 Ultra accelerator-optimized machines.\n\nWith features such as dense co-location of resources, ultra-low latency networking, targeted workload placement, and advanced maintenance controls to minimize workload disruptions, Hypercompute Cluster is built to deliver exceptional performance and resilience, so you can run your most demanding AI, ML, and HPC workloads with confidence.\n\nTo get started, review the overview for [VM and cluster creation](https://cloud.google.com/ai-hypercomputer/docs/create/create-overview)."]]