This guide describes how to enable, disable, and manage the Collective Communication Analyzer (CoMMA) library. CoMMA collects NCCL telemetry for Google Cloud services. For more information about CoMMA, see Collective Communication Analyzer (CoMMA).
Enable CoMMA
CoMMA is pre-installed and enabled if you use images that contain the NCCL gIB plugin. For a list of these images, see Images that have CoMMA enabled.
Installation options
If you don't use any of these images and want to install CoMMA, use one of the following methods.
Installation method | Supported machine types |
---|---|
NCCL Google Infrastructure Bundle (gIB) image (Recommended for newer machine types) | A4X, A4 High, and A3 Ultra |
CoMMA installer image | A4X, A4 High, and A3 Ultra |
Build from source (Required for older machine types) | A3 Mega, A3 High, A3 Edge, A2 Ultra, A2 Standard, and N1 with attached GPUs |
Install CoMMA
To install CoMMA, select one of the following options:
NCCL gIB image
To install CoMMA by using the NCCL gIB image, run the following command.
docker run --rm --name nccl-gib-installer --volume /usr/local/gib:/var/lib/gib \ us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib install \ --install-nccl
CoMMA installer image
You can get CoMMA binaries in a standalone Docker image. You can use the
CoMMA Docker image, us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/comma-installer
,
as initContainers
to install CoMMA binaries into your workload container.
The container stores the binaries in the /artifacts
directory.
To use the CoMMA installer image, complete the following steps:
Install CoMMA into your workload by adding the following snippet to your
initContainer
:- name: profiler-plugin-installer image: http://us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/comma-installer:latest imagePullPolicy: Always volumeMounts: - name: nccl-plugin-volume mountPath: /usr/local/nccl-plugin resources: requests: cpu: 150m command: - /bin/sh - -c - | set -ex rm -rf /usr/local/nccl-plugin/lib64/libnccl-profiler.so
The JSON configuration snippet specifies a container for installing a profiler
plugin. The snippet specifies the Docker image, its pull policy, and a volume
mount for the plugin. The container requires a small amount of CPU resources.
The command
section is central to the configuration. The section runs a
shell script to remove any existing profiler library. It then copies the new
profiler library into the designated plugin directory. The script verifies that
the correct version of the profiler plugin is installed and ready for use.
Build from source
To build the CoMMA library from source, install the following software:
- Rust Programming Language, which the compiler and Cargo require.
Libclang-dev
, whichbindgen
requires.CMake
version 3.10 or later
To build from source, complete the following steps:
Clone the repository and its submodules.
git clone --recurse-submodules https://github.com/google/CoMMA
Compile the binaries by using Cargo.
cargo build --release
Cargo saves the binary in
target/release/libnccl_profiler.so
.Enable NCCL to load the CoMMA libraries by using one of the following methods:
- Copy the compiled
libnccl_profiler.so
to a directory in yourLD_LIBRARY_PATH
. Rename it tolibnccl-profiler.so
(use a hyphen instead of an underscore). - Alternatively, set the
NCCL_PROFILER_PLUGIN
environment variable to specify the path of the.so
file.
- Copy the compiled
Verify installation or enablement
To verify that NCCL loads the CoMMA libraries, review the NCCL logs:
- Enable NCCL debug logging. Enable logging by setting the
NCCL_DEBUG=INFO
environment variable. You can also specify a more detailed debug level. For more debug options, see theNCCL_DEBUG
section in the NVIDIA documentation. - Specify the
INIT
subsystem for debugging. SpecifyINIT
by setting theNCCL_DEBUG_SUBSYS=INIT
environment variable. You can also specify other subsystems. For more subsystem options, see theNCCL_DEBUG_SUBSYS
section. - Find a line in the NCCL log that is similar to the following:
NCCL INFO PROFILER/Plugin: Plugin name set by env to PATH_TO_PROFILER_PLUGIN
Disable CoMMA
If CoMMA is already installed, prevent it from collecting
NCCL telemetry by setting the NCCL_TELEMETRY_MODE=0
CoMMA environment variable
before running your workloads. To set CoMMA environment
variables, see Set environment variables.
To re-enable CoMMA after disabling it, follow these steps:
Set the
NCCL_TELEMETRY_MODE
environment variable to a non-zero value; for example, to use the default mode, specifyNCCL_TELEMETRY_MODE=3
.To review the full list of options, see
NCCL_TELEMETRY_MODE
in the Configuration options table.
Configure and view CoMMA NCCL telemetry
If CoMMA is enabled in your environment, you can configure the type of telemetry data that it collects by setting the level of data granularity. This section explains how to set data granularity and the available options.
You can also review the data that CoMMA collects to verify that it aligns with your organization's security policies or to analyze it with your own NCCL telemetry analysis tools. To do so, export the raw data to a local file.
Set data granularity
CoMMA collects NCCL telemetry at different granularity levels. Configure the granularity level by using environment variables. To set CoMMA environment variables, see Set environment variables.
- Default behavior: By default, CoMMA tracks NCCL operations,
including both collective and peer-to-peer, the metadata of those
operations, and completion
times. It uses the following environment variables:
NCCL_PROFILER_TRACK_NCCLOP=true
NCCL_PROFILER_AGGREGATE_STEPS=true
NCCL_PROFILER_TRACK_INTERPROCESS_PROXYOP=true
- To enable more granular levels of data collection,
set the following environment variables:
- Track completion time for proxy operations by setting
NCCL_PROFILER_TRACK_PROXYOP=true
. - Track the time spent on each networking I/O operation by setting
NCCL_PROFILER_TRACK_STEPS=true
. This setting provides the highest level of granularity.
- Track completion time for proxy operations by setting
To review the full list of environment variables, see Configuration options.
Export data to a local file
Export the raw data to a local file to view it. To export the data to a local file and view the output, follow the steps:
- Set the
NCCL_TELEMETRY_MODE
to either1
or4
. To learn about theNCCL_TELEMETRY_MODE
environment variable, see Configuration options. Set one of the following export paths:
- Set
NCCL_PROFILER_LATENCY_FILE=PATH
to export detailed event traces to a local file. ReplacePATH
with a path such as/tmp/latency-%p.txt
. Set
NCCL_PROFILER_SUMMARY_FILE=PATH
to export aggregated summary statistics. ReplacePATH
with a path such as/tmp/summary-%p.txt
.The system replaces
%p
with the process ID.
- Set
Review the output. The raw output is a JSON file.
Configuration options
The following sections summarize all the environment variables that you can configure for CoMMA. They also explain how to set any environment variable.
Set CoMMA environment variables
To set a CoMMA environment variable to a non-default value, set environment
variables. You can set environment variables on the command-line for the
instance or add them to a startup script. If you set the environment variables
at the command-line, the value only persists per session. To make the
environment variables permanent, place them into the ~/.bashrc
file, ~/.profile
,
or whichever startup file your operating system uses. For more information,
review your operating system's documentation.
You need to set CoMMA environment variables before your workload starts as the workload reads the variables during NCCL initialization. You can set environment variables as follows:
export ENVIRONMENT_VARIABLE=VALUE
Replace the following:
ENVIRONMENT_VARIABLE
: the environment variable you want to set; for example,NCCL_TELEMETRY_MODE
.VALUE
: the value for the environment variable; for example,0
.
CoMMA environment variables
This section lists the environment variables that you can set for CoMMA and their default values.
Name | Description | Default | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
NCCL_PROFILER_AGGREGATE_STEPS |
Enables (true ) or disables (false )
aggregating network chunk operations. |
true |
||||||||||
NCCL_PROFILER_GPUVIZ_LIB |
Specifies the path to libGPUViz.so , a library that uploads
NCCL telemetry to Google Cloud services. This library wraps the
agent communication API.
The agent communication API is the interface that agents, such as processes
running within your guest operating system, use to initiate secure and
reliable connections with Google Cloud services.
If you use a NCCL gIB image as an installer or use any of the images that bundle the NCCL gIB plugin, you don't need to set this environment variable. |
|||||||||||
NCCL_PROFILER_LATENCY_FILE |
Specifies the path template for the latency trace file. For example,
/tmp/latency-%p.txt .
The system replaces %p in the name with the process ID (pid).
To disable file-based export, unset this variable. |
|||||||||||
NCCL_PROFILER_PLUGIN |
Specifies the path to the profiler plugin binary. If you don't specify this setting, NCCL looks for libnccl-profiler.so in the
LD_LIBRARY_PATH . |
|||||||||||
NCCL_PROFILER_SUMMARY_FILE |
Specifies the path for the aggregated summary file. For example,
/tmp/summary-%p.txt .
The system replaces %p in the name with the process ID (pid).
To disable file-based export, unset this variable. |
|||||||||||
NCCL_PROFILER_SUMMARY_INTERVAL |
Specifies the interval for summary reporting. For example,
10s , 1m .
Supports d , h , m , s ,
ms , us , ns |
1m |
||||||||||
NCCL_PROFILER_TRACK_INTERPROCESS_PROXYOP |
Enables (true ) or disables (false )
monitoring inter-process NCCL proxy operations. |
true |
||||||||||
NCCL_PROFILER_TRACK_NCCLOP |
Enables (true ) or disables (false )
tracking and reporting for NCCL operations,
including both collective and point-to-point communications. |
true |
||||||||||
NCCL_PROFILER_TRACK_PROXYOP |
Enables (true ) or disables (false ) proxy
operation tracking and reporting. |
false |
||||||||||
NCCL_PROFILER_TRACK_STEPS |
Enables (true ) or disables (false )
tracking and reporting network chunk operations. |
false |
||||||||||
NCCL_TELEMETRY_MODE |
Controls the export location of the NCCL telemetry data.
The options include the following:
|
3 |
What's next
- Learn how to troubleshoot issues with CoMMA.