Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin support #263

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
bf1d9fb
Reorder SlurmCommandGenStrategy methods
TaekyungHeo Oct 22, 2024
38bb8a7
Rename generate_srun_command to _gen_srun_command
TaekyungHeo Oct 22, 2024
5719230
Remove pre-test implementation from JaxToolbox
TaekyungHeo Oct 23, 2024
8e8ee3e
Add prologue and epilogue to _TestScenarioTOML
TaekyungHeo Oct 23, 2024
32d7d93
Add example plugin files
TaekyungHeo Oct 22, 2024
28a38b8
Add plugin option to CLI
TaekyungHeo Oct 23, 2024
bb3275f
Parse plugins and pass them to TestRun
TaekyungHeo Oct 23, 2024
3bc3822
Generate plugin commands
TaekyungHeo Oct 25, 2024
06d4b7d
Remove plugin option from CLI
TaekyungHeo Oct 25, 2024
9174f00
Make plugin directory self-contained
TaekyungHeo Oct 25, 2024
7afa73f
Update Parser to support self-contained plugin directory
TaekyungHeo Oct 25, 2024
3e45b25
Refactor plugin path handling in parse to use a single plugin_path param
TaekyungHeo Oct 28, 2024
5634f74
Remove test_scenario directory from conf/common/plugin/
TaekyungHeo Oct 28, 2024
b798981
Restore comments in src/cloudai/parser.py
TaekyungHeo Oct 29, 2024
da9abbf
Remove unused tmp_path from unit tests
TaekyungHeo Oct 29, 2024
6b11f54
Set prologue and epilogue to None by default
TaekyungHeo Oct 29, 2024
b84c16f
Add validation to ensure 'prologue' and 'epilogue' are not empty strings
TaekyungHeo Oct 29, 2024
8d840cf
Reorder SlurmCommandGenStrategy methods
TaekyungHeo Oct 22, 2024
9725b36
Rename generate_srun_command to _gen_srun_command
TaekyungHeo Oct 22, 2024
5a658c3
Remove pre-test implementation from JaxToolbox
TaekyungHeo Oct 23, 2024
cac5484
Add prologue and epilogue to _TestScenarioTOML
TaekyungHeo Oct 23, 2024
aab165b
Add example plugin files
TaekyungHeo Oct 22, 2024
265e42e
Add plugin option to CLI
TaekyungHeo Oct 23, 2024
d9b2e83
Parse plugins and pass them to TestRun
TaekyungHeo Oct 23, 2024
bfb653f
Generate plugin commands
TaekyungHeo Oct 25, 2024
9f83cd5
Remove plugin option from CLI
TaekyungHeo Oct 25, 2024
f656eee
Make plugin directory self-contained
TaekyungHeo Oct 25, 2024
5af7113
Update Parser to support self-contained plugin directory
TaekyungHeo Oct 25, 2024
b22c2f2
Refactor plugin path handling in parse to use a single plugin_path param
TaekyungHeo Oct 28, 2024
c88fe2e
Remove test_scenario directory from conf/common/plugin/
TaekyungHeo Oct 28, 2024
d66e0fe
Merge branch 'main'
TaekyungHeo Oct 29, 2024
c814ccb
Use Pydantic model to load prologue and epilogue
TaekyungHeo Oct 29, 2024
a6d3efc
Recover acceptance tests with plugin
TaekyungHeo Oct 29, 2024
46cabe9
Clean up unit tests
TaekyungHeo Oct 29, 2024
764e181
Refactor parser to remove explicit plugin_path argument, use default …
TaekyungHeo Oct 29, 2024
d9e8c1f
Refactor gen_exec_command to simplify indentation logic for readability
TaekyungHeo Oct 29, 2024
897a7da
Make prologue and epilogue fields optional
TaekyungHeo Oct 29, 2024
d44023b
Set prologue and epilogue to None by default
TaekyungHeo Oct 29, 2024
12022de
Recover comments
TaekyungHeo Oct 29, 2024
3cf27df
Remove unused tmp_path from unit tests
TaekyungHeo Oct 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions conf/common/plugin/nccl_test_epilogue.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_epilogue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_gather"
time_limit = "00:20:00"
22 changes: 22 additions & 0 deletions conf/common/plugin/nccl_test_prologue.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_prologue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
time_limit = "00:20:00"
33 changes: 33 additions & 0 deletions conf/common/plugin/test/nccl_test_all_gather.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_all_gather"
description = "all_gather"
test_template_name = "NcclTest"

[cmd_args]
"subtest_name" = "all_gather_perf_mpi"
"ngpus" = "1"
"minbytes" = "128"
"maxbytes" = "4G"
"iters" = "100"
"warmup_iters" = "50"

[extra_cmd_args]
"--stepfactor" = "2"

[extra_env_vars]
"NCCL_TEST_SPLIT_MASK" = "0x7"
30 changes: 30 additions & 0 deletions conf/common/plugin/test/nccl_test_all_reduce.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_all_reduce"
description = "all_reduce"
test_template_name = "NcclTest"

[cmd_args]
"subtest_name" = "all_reduce_perf_mpi"
"ngpus" = "1"
"minbytes" = "128"
"maxbytes" = "16G"
"iters" = "100"
"warmup_iters" = "50"

[extra_cmd_args]
"--stepfactor" = "2"
4 changes: 4 additions & 0 deletions conf/common/test_scenario/nccl_test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@
# limitations under the License.

name = "nccl-test"

prologue = "nccl_test_prologue"
epilogue = "nccl_test_epilogue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
Expand Down
26 changes: 26 additions & 0 deletions src/cloudai/_core/command_gen_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,29 @@ def gen_exec_command(self, tr: TestRun) -> str:
str: The generated execution command.
"""
pass

@abstractmethod
def gen_srun_command(self, tr: TestRun) -> str:
"""
Generate the Slurm srun command for a test based on the given parameters.
Args:
tr (TestRun): Contains the test and its run-specific configurations.
Returns:
str: The generated Slurm srun command.
"""
pass

@abstractmethod
def gen_srun_success_check(self, tr: TestRun) -> str:
"""
Generate the Slurm success check command to verify if a test run was successful.
Args:
tr (TestRun): Contains the test and its run-specific configurations.
Returns:
str: The generated command to check the success of the test run.
"""
pass
2 changes: 2 additions & 0 deletions src/cloudai/_core/test_scenario.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ class TestRun:
weight: float = 0.0
ideal_perf: float = 1.0
dependencies: dict[str, TestDependency] = field(default_factory=dict)
prologue: Optional["TestScenario"] = None
epilogue: Optional["TestScenario"] = None

def __hash__(self) -> int:
return hash(self.name + self.test.name + str(self.iterations) + str(self.current_iteration))
Expand Down
25 changes: 22 additions & 3 deletions src/cloudai/_core/test_scenario_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ class _TestScenarioTOML(BaseModel):
name: str
job_status_check: bool = True
tests: list[_TestRunTOML] = Field(alias="Tests", min_length=1)
prologue: str = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be optional and I suggest you add an automatic check that this field is not an empty string, see how it is done for tests field.

epilogue: str = ""

@model_validator(mode="after")
def check_no_self_dependency(self):
Expand Down Expand Up @@ -99,9 +101,10 @@ class TestScenarioParser:

__test__ = False

def __init__(self, file_path: Path, test_mapping: Dict[str, Test]) -> None:
def __init__(self, file_path: Path, test_mapping: Dict[str, Test], plugin_mapping: Dict[str, TestScenario]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, I do not agree that a Plugin is equal to TestScenario: it doesn't support dependencies and num_nodes, there could be other differences as well.

To make it right, we should have another Pydantic model to hold Plugin definition and TestScenario would inherit from it. This will provide automatic verification and keep objects typed.

This is not critical for this PR, but please keep that in mind in case of further development of this functionality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss.

cc: @srinivas212 , @srivatsankrishnan .

self.file_path = file_path
self.test_mapping = test_mapping
self.plugin_mapping = plugin_mapping

def parse(self) -> TestScenario:
"""
Expand Down Expand Up @@ -136,8 +139,14 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario:
total_weight = sum(tr.weight for tr in ts_model.tests)
normalized_weight = 0 if total_weight == 0 else 100 / total_weight

prologue_name = data.get("prologue", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should use actual object:

        prologue, epilogue = None, None
        if ts_model.prologue:
            prologue = self.plugin_mapping.get(ts_model.prologue)
        if ts_model.epilogue:
            epilogue = self.plugin_mapping.get(ts_model.epilogue)

epilogue_name = data.get("epilogue", "")

prologue = self.plugin_mapping.get(prologue_name, None) if prologue_name else None
epilogue = self.plugin_mapping.get(epilogue_name, None) if epilogue_name else None

testruns_by_id: dict[str, TestRun] = {
tr.id: self._create_section_test_run(tr, normalized_weight) for tr in ts_model.tests
tr.id: self._create_section_test_run(tr, normalized_weight, prologue, epilogue) for tr in ts_model.tests
}

tests_data: dict[str, _TestRunTOML] = {tr.id: tr for tr in ts_model.tests}
Expand All @@ -153,13 +162,21 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario:
job_status_check=ts_model.job_status_check,
)

def _create_section_test_run(self, test_info: _TestRunTOML, normalized_weight: float) -> TestRun:
def _create_section_test_run(
self,
test_info: _TestRunTOML,
normalized_weight: float,
prologue: Optional[TestScenario],
TaekyungHeo marked this conversation as resolved.
Show resolved Hide resolved
epilogue: Optional[TestScenario],
) -> TestRun:
"""
Create a section-specific Test object by copying from the test mapping.

Args:
test_info (Dict[str, Any]): Information of the test.
normalized_weight (float): Normalized weight for the test.
prologue (Optional[TestScenario]): TestScenario object representing the prologue sequence.
epilogue (Optional[TestScenario]): TestScenario object representing the epilogue sequence.

Returns:
Test: Copied and updated Test object for the section.
Expand Down Expand Up @@ -192,5 +209,7 @@ def _create_section_test_run(self, test_info: _TestRunTOML, normalized_weight: f
sol=test_info.sol,
weight=test_info.weight * normalized_weight,
ideal_perf=test_info.ideal_perf,
prologue=prologue if prologue is not None else TestScenario(name="default_prologue", test_runs=[]),
epilogue=epilogue if epilogue is not None else TestScenario(name="default_epilogue", test_runs=[]),
)
return tr
34 changes: 34 additions & 0 deletions src/cloudai/_core/test_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,40 @@ def gen_exec_command(self, tr: TestRun) -> str:
)
return self.command_gen_strategy.gen_exec_command(tr)

def gen_srun_command(self, tr: TestRun) -> str:
"""
Generate an Slurm srun command for a test using the provided command generation strategy.
Args:
tr (TestRun): Contains the test and its run-specific configurations.
Returns:
str: The generated Slurm srun command.
"""
if self.command_gen_strategy is None:
raise ValueError(
"command_gen_strategy is missing. Ensure the strategy is registered in the Registry "
"by calling the appropriate registration function for the system type."
)
return self.command_gen_strategy.gen_srun_command(tr)

def gen_srun_success_check(self, tr: TestRun) -> str:
"""
Generate a Slurm success check command for a test using the provided command generation strategy.
Args:
tr (TestRun): Contains the test and its run-specific configurations.
Returns:
str: The generated command to check the success of the test run.
"""
if self.command_gen_strategy is None:
raise ValueError(
"command_gen_strategy is missing. Ensure the strategy is registered in the Registry "
"by calling the appropriate registration function for the system type."
)
return self.command_gen_strategy.gen_srun_success_check(tr)

def gen_json(self, tr: TestRun) -> Dict[Any, Any]:
"""
Generate a JSON string representing the Kubernetes job specification for this test using this template.
Expand Down
2 changes: 1 addition & 1 deletion src/cloudai/cli/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ def handle_dry_run_and_run(args: argparse.Namespace) -> int:
args (argparse.Namespace): The parsed command-line arguments.
"""
parser = Parser(args.system_config)
system, tests, test_scenario = parser.parse(args.tests_dir, args.test_scenario)
system, tests, test_scenario = parser.parse(args.tests_dir, args.test_scenario, Path("conf/common/plugin"))
assert test_scenario is not None

if args.output_dir:
Expand Down
Loading