Overview

This document describes the Collective Communication Analyzer (CoMMA), a library for collecting NCCL telemetry for Google Cloud services. NCCL telemetry collects performance metrics and operational events that NCCL generates during its execution. The NVIDIA Collective Communication Library (NCCL) accelerates high-performance communication between GPUs running in parallel and distributed computing systems. This high-performance communication is especially useful for deep learning and high performance computing (HPC).

For NCCL versions 2.23 and later, NVIDIA introduced the NCCL profiler plugin API, which lets developers register function callbacks to collect telemetry during NCCL collective operations. Google provides the Collective Communication Analyzer (CoMMA), which is a library that uses NVIDIA's NCCL profiler plugin API to collect NCCL telemetry for Google Cloud services. CoMMA automatically installs and enables for some images, but you can also disable, re-enable, or manually install and enable CoMMA to control data collection.

Images that have CoMMA enabled

For A4X, A4 High, and A3 Ultra machine types, CoMMA is installed and automatically enabled when you use any images that packages the NCCL Google Infrastructure Bundle (gIB) plugin. The following images contain the NCCL gIB plugin:

Container-Optimized OS with containerd (cos_containerd) node images: Google Kubernetes Engine (GKE) uses these images for creating GKE Autopilot clusters. The CoMMA binaries are available in the /home/kubernetes/bin/gib directory.
Deep learning Software Layer container images: you use these images to deploy and configure AI and ML frameworks and libraries on GKE clusters.

If you use any of images and want to disable CoMMA from collecting NCCL telemetry, see Disable CoMMA. However, CoMMA must be enabled for features such as straggler detection to function. If you don't use these images and want to enable CoMMA to collect NCCL telemetry, see Install CoMMA.

Benefits

The NCCL telemetry that CoMMA collects helps identify performance bottlenecks, specifically stragglers, in GPU communication. CoMMA collects fine-grained data, such as latency histograms for collective communication operations. A diagnostic service can then process and use this data to pinpoint stragglers.

Using CoMMA to collect telemetry offers the following benefits:

Required for straggler detection: CoMMA collects the fine-grained NCCL telemetry to identify performance bottlenecks or stragglers in GPU-to-GPU communication. CoMMA provides detailed NCCL telemetry that helps identify and resolve issues in large-scale AI and ML training workloads.

For example, CoMMA captures the algorithm used in NCCL operations. This information is valuable for performance analysis and tuning because different algorithms can have significantly varying performance characteristics based on workload and system configuration.

CoMMA also helps with the troubleshooting of suboptimal performance and errors. It traces errors originating in lower-level transport layers, such as TCP, RDMA, or switch fabrics, back to specific NCCL collectives and initiating nodes.
Low-overhead tracing: CoMMA uses minimal computational resources during active NCCL telemetry collection, making it ideal for performance-sensitive and long-running machine learning workloads like large language model (LLM) training.
Broaden NCCL telemetry scope: CoMMA uses the NCCL profiler plugin API. This API collects a broader scope of NCCL telemetry in comparison to transport-based plugins. Transport-based plugins primarily collect telemetry about the underlying network transport, including data transfers over network hardware and network protocols. The profiler plugin collects telemetry for NCCL's communication operations, including the timing of collective communications, proxy operations, and data transfers.

Understand how CoMMA works

During application runtime, NCCL automatically loads the CoMMA libraries that are installed in the location specified by the LD_LIBRARY_PATH environment variable. CoMMA then collects NCCL telemetry, which other Google services can then use. You can also optionally export this data to your local file system.

What's next

Learn how to enable, disable, and configure CoMMA.
Learn how to troubleshoot issues with CoMMA.
Learn how to detect and resolve stragglers.