Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin support #263

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Plugin support #263

wants to merge 40 commits into from

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Oct 14, 2024

Summary

This PR introduces plugins to CloudAI. Plugins are tests that run either before or after each test in a test scenario. They are defined globally within a test scenario and are automatically executed for each test. There are two types of plugins: prologues and epilogues. Prologues run before the tests, while epilogues are executed after the tests. Multiple prologues and epilogues can be specified in each scenario.

An example of how plugins are defined within a test scenario:

name = "nccl-test"

prologue = "nccl_test_prologue"
epilogue = "nccl_test_epilogue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
num_nodes = "2"
time_limit = "00:20:00"

[[Tests]]
id = "Tests.2"
test_name = "nccl_test_all_gather"
num_nodes = "2"
time_limit = "00:20:00"
  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.1"

You can see the prologue and epilogue fields. These are used to look up the corresponding plugin file. A plugin file is a separate test scenario file as shown below:

name = "nccl_test_prologue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
time_limit = "00:20:00"

If any of the tests in the prologue fail, the main test or the epilogue tests will not run. In other words, the main test and epilogue run conditionally when the prologue is successful. The tests in plugins have time limits, just as tests in the main scenario do. Output files should be stored in the output directory, in a subdirectory called "prologue" or "epilogue," following a proper directory hierarchy. Plugins are not supported for NeMo 1.0 (NeMo launcher).

Note

Test Plan

  1. CI passes
  2. Manual run
    2.1 Success
$ cloudai run --system-config ~/cloudaix-main/conf/common/system/israel_1.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nccl_tes
t.toml
/.autodirect/mswg2/E2E/theo/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.19) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported ver
sion!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "                                                                                                                
[INFO] System Name: Israel-1                                                                            
[INFO] Scheduler: slurm                                                                                 
[INFO] Test Scenario Name: nccl-test                                                                    
[INFO] Checking if test templates are installed.                                                        
[INFO] Test Scenario: nccl-test                                                                         

Section Name: Tests.1                                                                                   
  Test Name: nccl_test_all_reduce                                                                       
  Description: all_reduce                                                                               
  No dependencies                                                                                       
[INFO] Initializing Runner [RUN] mode                                                                   
[INFO] Creating SlurmRunner                                                                             
[INFO] Starting test scenario execution.                                                                
[INFO] Starting test: Tests.1                                                                           
[INFO] Running test: Tests.1                                                                            
[INFO] Executing command for test Tests.1: sbatch /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.
$ cd /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46/Tests.1/0
$ ls
cloudai_sbatch_script.sh  epilogue  prologue  stderr.txt  stdout.txt

$ ls prologue/nccl_test_all_reduce/
stderr.txt  stdout.txt

$ ls epilogue/nccl_test_all_gather/
stderr.txt  stdout.txt

2.2 Failure

$ cloudai run --system-config ~/cloudaix-main/conf/common/system/israel_1.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nccl_test.toml
/.autodirect/mswg2/E2E/theo/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.19) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: Israel-1
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/cloudai_sbatch_script.sh
[ERROR] Job 383928 for test Tests.1 failed: Missing success indicators in /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/stdout.txt: '# Out of bounds values', '# Avg bus bandwidth'. These keywords are expected to be present in stdout.txt, usually towards the end of the file. Please review the NCCL test output and errors in the file. Ensure the NCCL test ran to completion. You can run the generated sbatch script manually and check if /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/stdout.txt is created and contains the expected keywords. If the issue persists, contact the system administrator.
[INFO] Terminating all jobs...
[INFO] All jobs have been killed.

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For existing prologue we use real NCCL run. In your examples it seems that we are switching to some predefined commands.

  1. How are we going to generate it?
  2. Will that cover our needs? cc @srivatsankrishnan

I do have some code related notes, but let's leave it for later discussion.

@TaekyungHeo
Copy link
Member Author

@amaslenn

How are we going to generate it?

Yes, it is one of the main design choices that we need to make.

@amaslenn
Copy link
Contributor

Yes, it is one of the main design choices that we need to make.

Can we rely on existing mechanisms? Each plugin will be defined as a regular Test TOML, meaning we can generate a CLI for it for a particular system. This is what we do now and it seems to cover all our needs for this feature.

@TaekyungHeo TaekyungHeo force-pushed the plugin-jan branch 15 times, most recently from 7594c19 to 852fee8 Compare October 24, 2024 19:54
@TaekyungHeo TaekyungHeo mentioned this pull request Oct 29, 2024
@TaekyungHeo TaekyungHeo reopened this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Jan25 Jan'25 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants