This page shows you how to resolve common issues that you might encounter when using the Collective Communication Analyzer (CoMMA). CoMMA is a library that collects telemetry data for Google Cloud services. For more information, see Collective Communication Analyzer (CoMMA).
Troubleshoot CoMMA loading issues
CoMMA might not load correctly. To verify that the binaries load correctly, complete these steps:
- Enable NCCL debug logging. To enable logging, set the environment variable
NCCL_DEBUG=INFO
. You might also use a more detailed debug level. For options, see theNCCL_DEBUG
section in the NVIDIA documentation. - Specify the
INIT
subsystem for debugging. To specifyINIT
, setNCCL_DEBUG_SUBSYS=INIT
. You might also add other subsystems. For more subsystem options, see theNCCL_DEBUG_SUBSYS
section. Look for a line in the NCCL log that is similar to the following:
NCCL INFO PROFILER/Plugin: Plugin name set by env to PATH_TO_PROFILER_PLUGIN
If the
NCCL_PROFILER_PLUGIN
environment variable is unset, NCCL might attempt to load thelibnccl-profiler.so
binary from the path specified in theLD_LIBRARY_PATH
environment variable.
To resolve this issue, consider the following solutions:
Verify that the plugin shared library (
libnccl-profiler.so
) is correctly named.Check that it is located in a directory specified in
LD_LIBRARY_PATH
environment variable. Alternatively, check that theNCCL_PROFILER_PLUGIN
environment variable points directly to the location of thelibnccl-profiler.so
binary.Check that your NCCL version is
2.23
or later, as the NCCL profiler API requires this version.
Troubleshoot missing output files
If you configured your environment to send data collected by CoMMA to a local file, but the output file is missing, check the NCCL logs or application logs for messages that are similar to the following:
Failed to open file Failed to log <telemetry type> to file
These errors indicate an underlying file system issue, such as a missing directory or insufficient free space. CoMMA ceases to export telemetry to files after these errors occur.
To resolve this issue, consider these solutions:
- Check that the
NCCL_PROFILER_LATENCY_FILE
orNCCL_PROFILER_SUMMARY_FILE
environment variables are set correctly. Provide a valid path and filename template, such as/tmp/latency-%p.txt
. - Check that the process has write permissions to the specified output directory.
- If you modified the
NCCL_TELEMETRY_MODE
environment variable, check that you set it to a value that enables local file output (for example,1
or4
).
Troubleshoot unexpected data or missing events
CoMMA might capture unexpected data or miss expected events.
To resolve this issue, check that the required level of granularity is set.