This document describes the Collective Communication Analyzer (CoMMA), a library for collecting NCCL telemetry for Google Cloud services. NCCL telemetry collects performance metrics and operational events that NCCL generates during its execution. The NVIDIA Collective Communication Library (NCCL) accelerates high-performance communication between GPUs running in parallel and distributed computing systems. This high-performance communication is especially useful for deep learning and high performance computing (HPC).
For NCCL versions 2.23 and later, NVIDIA introduced the NCCL profiler plugin API, which lets developers register function callbacks to collect telemetry during NCCL collective operations. Google provides the Collective Communication Analyzer (CoMMA), which is a library that uses NVIDIA's NCCL profiler plugin API to collect NCCL telemetry for Google Cloud services. CoMMA automatically installs and enables for some images, but you can also disable, re-enable, or manually install and enable CoMMA to control data collection.
Images that have CoMMA enabled
For A4X, A4 High, and A3 Ultra machine types, CoMMA is installed and automatically enabled when you use any images that packages the NCCL Google Infrastructure Bundle (gIB) plugin. The following images contain the NCCL gIB plugin:
- Container-Optimized OS with containerd (cos_containerd)
node images: Google Kubernetes Engine (GKE) uses these images for creating
GKE Autopilot clusters. The CoMMA binaries are
available in the
/home/kubernetes/bin/gib
directory. - Deep learning Software Layer container images: you use these images to deploy and configure AI and ML frameworks and libraries on GKE clusters.
If you use any of images and want to disable CoMMA from collecting NCCL telemetry, see Disable CoMMA. If you don't use these images and want to enable CoMMA to collect NCCL telemetry, see Install CoMMA.
Benefits
The NCCL telemetry that CoMMA collects helps identify performance bottlenecks, specifically stragglers, in GPU communication. CoMMA collects fine-grained data, such as latency histograms for collective communication operations. A diagnostic service can then process and use this data to pinpoint stragglers.
Using CoMMA to collect telemetry offers the following benefits:
- Low-overhead tracing: CoMMA uses minimal computational resources during active NCCL telemetry collection, making it ideal for performance-sensitive and long-running machine learning workloads like large language model (LLM) training.
- Broaden NCCL telemetry scope: CoMMA uses the NCCL profiler plugin API. This API collects a broader scope of NCCL telemetry in comparison to transport-based plugins. Transport-based plugins primarily collect telemetry about the underlying network transport, including data transfers over network hardware and network protocols. The profiler plugin collects telemetry for NCCL's communication operations, including the timing of collective communications, proxy operations, and data transfers.
Understand how CoMMA works
During application runtime, NCCL automatically loads the CoMMA libraries
that are installed in the location specified by the LD_LIBRARY_PATH
environment variable. CoMMA then collects NCCL telemetry, which other Google
services can then use. You can also optionally export this data to your local
file system.
What's next
- Learn how to enable, disable, and configure CoMMA.
- Learn how to troubleshoot issues with CoMMA.