Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance to broader Tag releases #395

Open
caio-davi opened this issue Apr 19, 2024 · 6 comments
Open

Assistance to broader Tag releases #395

caio-davi opened this issue Apr 19, 2024 · 6 comments

Comments

@caio-davi
Copy link

Hello team,

I noticed you have been updating most of the last tagged releases with the -aws suffix. I think I can assist if you need help testing it on other libfabric providers.
We have some nightly tests running on slingshot fabric, and I could add your test bed to it. This way the releases would have been tested in at least two providers.
Please, let me know if it is something you are interested!

@rajachan
Copy link
Member

Thanks for reaching out, we absolutely want to get back to general releases. As you rightly call out, test coverage across providers is one reason why we have not done that yet and any additional testing is welcome. Are you saying you have nightly tests running with the OFI NCCL plugin and the CXI provider already? If so, are you testing with the master branch?

FWIW, we currently track unit tests, functional tests using just the plugin interfaces, NCCL-level tests using the nccl-tests benchmark suite, and application tests to gain confidence in a release. This wiki page has some more information about what gets tested with each Pull Request: https://github.com/aws/aws-ofi-nccl/wiki/Testing-and-Benchmarking. In addition, we have a more exhaustive nightly test suite that tests changes with real applications.

@caio-davi
Copy link
Author

caio-davi commented Apr 20, 2024

Actually, these nightly tests I mentioned are tests for our internal software stack, and we are using nccl-tests as part of these tests. I apologize for the bad wording.
And no, we are not testing on the master branch. Since they are sanity tests, we use version 1.6.0, as it was the latest non-specific release (although I have been using a few -aws versions for other purposes with no major issues).
Please let me know how I can help with the testing.

@caio-davi
Copy link
Author

Hey @rajachan, just checking in...
Do you think we can move forward with this?

@rajachan
Copy link
Member

@caio-davi Yes! One thing we can do in the near term is extending the PR presubmit checks to include your tests with the cxi provider, to ensure commits that go into the master branch build and work as expected . We can setup any Github hooks needed for your automation to see new PRs, run the tests on your systems, and report back test results. Feel free to reach out to us at [email protected] and we can hash out specifics.

@aws-nslick
Copy link
Contributor

Is there any update on this? One unfortunate aspect of the -aws releases is that there are a handful of tools that fail to parse our versioning correctly because it's no longer semver compatible. It's overall just confusing to everyone to have these vendored releases. (see recent discussion on #597 for an example of someone using 1.4.0, presumably because it's the most recent semver compatible release visible from our master branch)

I'd go as far as to suggest that even if we can't validate nccl-net-ofi with a provider other than efa in CI, we should just put a disclaimer in the readme and still go back to doing general releases.

@rajachan
Copy link
Member

@caio-davi and I talked and we are going to meet next week to find a path forward here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants