Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version #391

Open
sean-smith opened this issue Apr 15, 2024 · 3 comments
Open

Version #391

sean-smith opened this issue Apr 15, 2024 · 3 comments

Comments

@sean-smith
Copy link

sean-smith commented Apr 15, 2024

Need some advise here on how to get the version of aws-ofi-nccl across different runtime environments:

Typically I run the following to get the version:

$ strings /opt/aws-ofi-nccl/lib/libnccl-net.so | grep 'NET/OFI Initializing aws-ofi-nccl'
NET/OFI Initializing aws-ofi-nccl 1.7.3-aws

However, this same command, run inside a container, needs have a slightly different path:

strings /opt/aws-ofi-nccl/install/lib/libnccl-net.so | grep 'NET/OFI Initializing aws-ofi-nccl'

i.e.

$ docker run --gpus all megatron-training strings /opt/aws-ofi-nccl/install/lib/libnccl-net.so | grep 'NET/OFI Initializing aws-ofi-nccl'
NET/OFI Initializing aws-ofi-nccl 1.7.4-aws

Is there any easier way to get the version?

@rajachan
Copy link
Member

The path is a function of the installation prefix chosen when building the aws-ofi-nccl plugin (--prefix passed to the configure script), so that depends on how the container you use was built. The plugin is a library only and there is no binary/utility that will help grab version numbers for you. While looking at the shared-object strings like you do will give a sense for the version of that specific build, a deterministic way to tell which version of the plugin an application loaded at runtime is to look for that string in the NCCL_DEBUG=INFO logs.

@sean-smith
Copy link
Author

sean-smith commented Apr 16, 2024

Thanks Raghu, perhaps an example will make it more clear, I'm trying to write a script to help debug efa/nccl performance issues. That script grabs versions like:

$ srun python3 efa-versions.py
+--------------------------+--------------+
|  Package                 |  Version     |
+--------------------------+--------------+
|  EFA installer version:  |  1.26.1      |
+--------------------------+--------------+
|  NCCL Version            |  2.18.5      |
+--------------------------+--------------+
|  Libfabric Version       |  1.18.2      |
+--------------------------+--------------+
|  AWS OFI NCCL version:   |  1.7.3-aws   |
+--------------------------+--------------+
|  Nvidia Driver           |  535.104.12  |
+--------------------------+--------------+
|  CUDA Version:           |  12.1.105    |
+--------------------------+--------------+

I'm looking for a way to grab these versions regardless of how the libraries are installed as often we don't get to set the install --prefix flag. A utility similar to fi_info --version for this library would be extremely helpful.

@aws-nslick
Copy link
Contributor

I'd really like to see us package this as a manylinux wheel, and potentially we could ship a small bit of python bindings that loads the library and is able to provide a few helpers like:

  • nccl_net_ofi.version()
  • nccl_net_ofi.libfabric.version()
  • nccl_net_ofi.cuda.version()
  • nccl_net_ofi.enable_tuner()/nccl_net_ofi.disable_tuner()
  • nccl_net_ofi.set_recommended_nccl_params()
  • nccl_net_ofi.update_check()

A few things standing in the way of this:

  1. Ideally, we would statically link hwloc, and maybe also libfabric. I'm not even sure hwloc can be statically linked. (It should be feasible for us to just drop hwloc as a dependency and parse it out of /sys ourselves, especially if we port to C++ and can use std::filesystem)
  2. we would need to take care to ensure that we detect if torch/etc. has already initialized NCCL and warn if it is, so that we have an opportunity to change our environment parameters and/or NCCL parameters before NCCL initializes. Less of a problem for any specific tunables that we maintain (we can just write bindings that directly modify the parameters instead of relying on channeling thru os.environ) but definitely a problem for any of the NCCL envs that we would want to configure through this.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants