Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev -> main #761

Merged
merged 21 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion CALL_FOR_SUBMISSIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ Submissions can compete under two hyperparameter tuning rulesets (with separate
- **Registration deadline to express non-binding intent to submit: February 28th, 2024**.\
Please fill out the (mandatory but non-binding) [**registration form**](https://forms.gle/K7ty8MaYdi2AxJ4N8).
- **Submission deadline: April 04th, 2024** *(moved by a week from the initial March 28th, 2024)*
- **Deadline for self-reporting preliminary results: May 28th, 2024**
- [tentative] Announcement of all results: July 15th, 2024

For a detailed and up-to-date timeline see the [Competition Rules](/COMPETITION_RULES.md).
Expand Down
3 changes: 0 additions & 3 deletions COMPETITION_RULES.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ The Competition begins at 12:01am (ET) on November 28, 2023 and ends at 11:59pm

- **Intention to Submit.** You must register your Intention to Submit no later than 11:59pm ET on February 28, 2024.
- **Submission Period.** You must complete your Submission and enter it after the Intention to Submit deadline, but no later than 11:59pm ET on April 04, 2024.
- **Deadline for self-reporting results.** 11:59pm ET on May 28, 2024.

## Agreement to Official Rules

Expand All @@ -65,8 +64,6 @@ There are four (4) steps to a successful submission ("Submission").

The form is sent to the working group chairs, who will process your Submission. Failure to complete the proper Submission Forms will results in disqualification of your Submission. At the close of the Submission Period, your GitHub repository must be public.

4. **Report Results.** Prior to the Deadline for self-reporting results, run your Submission on either the qualification set or the full benchmark set and report the results. You must report your scores by uploading all unmodified logs that the benchmarking codebase automatically generates in a separate `/results` directory within the `/submission` folder of your Submission's GitHub repository.

## Submission Conditions

All Submissions must meet the requirements of the terms contained in these rules, including reliance on new algorithmic or mathematical ideas and concepts, and must not use software engineering approaches in order to increase primitive operations in PyTorch, JAX, their dependencies, the operating systems, or the hardware. By entering, all Team members warrant that their Submission does not infringe any third party's rights, and that Team members have obtained all necessary permissions from all relevant third parties to submit the Submission. If, in the sole discretion of Sponsor, any Submission constitutes copyright or other intellectual property infringement, the Submission will be disqualified. Team must hold all rights through license or ownership to the entire Submission. Team members agree to indemnify Sponsor against any and all claims of infringement from any third party for any use by Sponsor of a Submission. Team members may not be: 1) represented under contract that would limit or impair Sponsor's ability to use the Submission; or 2) are under any other contractual relationship, including but not limited to guild and/or union memberships, that may prohibit them from participating fully in this Competition, or from allowing Sponsor to use royalty-free, the Submission worldwide in all media in perpetuity.
Expand Down
10 changes: 10 additions & 0 deletions DOCUMENTATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,8 @@ Submissions will be scored based on their performance on the [fixed workload](#f

Furthermore, a less computationally expensive subset of the fixed workloads is collected with the [qualification set](#qualification-set). Submitters without enough compute resources to self-report on the full set of fixed and held-out workloads can instead self-report on this smaller qualification set. Well-performing submissions can thereby qualify for computational resources provided by sponsors of the benchmark to be scored on the full benchmark set.

NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.

#### Fixed workloads

The fixed workloads are fully specified with the call for submissions. They contain a diverse set of tasks such as image classification, machine translation, speech recognition, or other typical machine learning tasks. For a single task there might be multiple models and therefore multiple fixed workloads. The entire set of fixed workloads should have a combined runtime of roughly 100 hours on the [benchmarking hardware](#benchmarking-hardware).
Expand Down Expand Up @@ -429,6 +431,8 @@ Our scoring procedure uses the held-out workloads only to penalize submissions t

#### Qualification set

NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.

The qualification set is designed for submitters that may not have the compute resources to self-report on the full set of [fixed](#fixed-workloads) and [held-out workloads](#randomized-workloads). They may instead self-report numbers on this smaller qualification set. The best-performing submissions may then qualify for compute sponsorship offering a free evaluation on the full benchmark set and therefore the possibility to win [awards and prizes](/COMPETITION_RULES.md#prizes).

The qualification set consists of the same [fixed workloads](#fixed-workloads) as mentioned above, except for both workloads on *ImageNet*, both workloads on *LibriSpeech*, and the *fastMRI* workload. The remaining three workloads (*WMT*, *Criteo 1TB*, and *OGBG*) form the qualification set. There are no [randomized workloads](#randomized-workloads) in the qualification set. The qualification set of workloads aims to have a combined runtime of roughly 24 hours on the [benchmarking hardware](#benchmarking-hardware).
Expand All @@ -449,6 +453,8 @@ All scored runs have to be performed on the benchmarking hardware to allow for a
- 240 GB in RAM
- 2 TB in storage (for datasets).

NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.

For self-reported results, it is acceptable to perform the tuning trials on hardware different from the benchmarking hardware, as long as the same hardware is used for all tuning trials. Once the best trial, i.e. the one that reached the *validation* target the fastest, was determined, this run has to be repeated on the competition hardware. For example, submitters can tune using their locally available hardware but have to use the benchmarking hardware, e.g. via cloud providers, for the $5$ scored runs. This allows for a fair comparison to the reported results of other submitters while allowing some flexibility in the hardware.

#### Defining target performance
Expand Down Expand Up @@ -571,10 +577,14 @@ on the benchmarking hardware. We also recommend to do a dry run using a cloud in

#### Are we allowed to use our own hardware to self-report the results?

NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.

You only have to use the benchmarking hardware for runs that are directly involved in the scoring procedure. This includes all runs for the self-tuning ruleset, but only the runs of the best hyperparameter configuration in each study for the external tuning ruleset. For example, you could use your own (different) hardware to tune your submission and identify the best hyperparameter configuration (in each study) and then only run this configuration (i.e. 5 runs, one for each study) on the benchmarking hardware.

#### What can I do if running the benchmark is too expensive for me?

NOTE: Submitters are no longer required to self-report results for AlgoPerf competition v0.5.

Submitters unable to self-fund scoring costs can instead self-report only on the [qualification set of workloads](/COMPETITION_RULES.md#qualification-set) that excludes some of the most expensive workloads. Based on this performance on the qualification set, the working group will provide - as funding allows - compute to evaluate and score the most promising submissions. Additionally, we encourage researchers to reach out to the [working group](mailto:[email protected]) to find potential collaborators with the resources to run larger, more comprehensive experiments for both developing and scoring submissions.

#### Can I submit previously published training algorithms as submissions?
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@
---

> [!IMPORTANT]
> Upcoming Deadline:
> Submission deadline: **April 04th, 2024** (*moved by a week*). \
> For submission instructions please see [Packaging your Submission Code](/GETTING_STARTED.md#package-your-submission-code) section in the Getting Started document.\
> Submitters are no longer required to self-report results.
> We are currently in the process of evaluating and scoring received submissions.
> We are aiming to release results by July 15th 2024.
> For other key dates please see [Call for Submissions](CALL_FOR_SUBMISSIONS.md).

## Table of Contents <!-- omit from toc -->
Expand Down
8 changes: 6 additions & 2 deletions algorithmic_efficiency/checkpoint_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,9 @@ def maybe_restore_checkpoint(framework: str,

else:
checkpoint_state = latest_ckpt
if isinstance(model_params, torch.nn.DataParallel):
if isinstance(
model_params,
(torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)):
model_params = model_params.module
model_params.load_state_dict(checkpoint_state['model_params'])
checkpoint_state['model_params'] = model_params
Expand Down Expand Up @@ -196,7 +198,9 @@ def save_checkpoint(framework: str,
opt_state = jax.device_get(jax_utils.unreplicate(opt_state))
model_state = jax.device_get(jax_utils.unreplicate(model_state))
else:
if isinstance(model_params, torch.nn.DataParallel):
if isinstance(
model_params,
(torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)):
model_params = model_params.module
model_params = model_params.state_dict()
optimizer_state_dict = {}
Expand Down
25 changes: 15 additions & 10 deletions algorithmic_efficiency/logger_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import GPUtil
import pandas as pd
import psutil
import torch.distributed as dist

from algorithmic_efficiency import spec
from algorithmic_efficiency.pytorch_utils import pytorch_setup
Expand Down Expand Up @@ -43,9 +44,6 @@ def get_log_dir(
resume_last_run: bool,
overwrite: bool,
) -> Optional[str]:
if RANK != 0:
return

# Construct path to experiment workload directory.
experiment_dir = os.path.expanduser(experiment_dir)
workload_dir_name = f'{workload}_{framework}'
Expand All @@ -61,18 +59,25 @@ def get_log_dir(
logging.info(
f'Removing existing experiment directory {experiment_path} because '
'--overwrite was set.')
shutil.rmtree(experiment_path)
if RANK == 0:
shutil.rmtree(experiment_path)
elif resume_last_run:
logging.info(
f'Resuming from experiment directory {experiment_path} because '
'--resume_last_run was set.')
else:
resume = input(
'Found existing experiment dir with the same name: {}. Do you wish '
'to resume training from this dir? [y/N]:'.format(experiment_path))
if resume.lower() != 'y':
sys.exit()

if RANK == 0:
resume = input(
'Found existing experiment dir with the same name: {}. Do you wish '
'to resume training from this dir? [y/N]:'.format(experiment_path))
if resume.lower() != 'y':
sys.exit()

if USE_PYTORCH_DDP:
try:
dist.barrier()
except RuntimeError:
sys.exit()
logging.info(f'Creating experiment directory at {experiment_path}.')
makedir(experiment_path)
return experiment_path
Expand Down
14 changes: 8 additions & 6 deletions submission_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,10 +316,12 @@ def train_once(
flag_file_name = os.path.join(log_dir, f'flags_{preemption_count}.json')
logging.info(f'Saving flags to {flag_file_name}.')
logger_utils.write_json(flag_file_name, flags.FLAGS.flag_values_dict())
metrics_logger = logger_utils.set_up_loggers(log_dir,
flags.FLAGS,
hyperparameters)
workload.attach_metrics_logger(metrics_logger)
metrics_logger = None
if RANK == 0:
metrics_logger = logger_utils.set_up_loggers(log_dir,
flags.FLAGS,
hyperparameters)
workload.attach_metrics_logger(metrics_logger)

global_start_time = get_time()
train_state['last_step_end_time'] = global_start_time
Expand Down Expand Up @@ -429,7 +431,7 @@ def train_once(

logging_start_time = get_time()

if log_dir is not None:
if log_dir is not None and RANK == 0:
metrics_logger.append_scalar_metrics(
latest_eval_result,
global_step=global_step,
Expand Down Expand Up @@ -467,7 +469,7 @@ def train_once(

metrics = {'eval_results': eval_results, 'global_step': global_step}

if log_dir is not None:
if log_dir is not None and RANK == 0:
metrics_logger.append_scalar_metrics(
{'score': train_state['accumulated_submission_time']},
global_step=global_step,
Expand Down
Loading