Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the option to rerun failed tasks #62

Merged
merged 7 commits into from
Jan 31, 2024
Merged

Conversation

nweires
Copy link
Collaborator

@nweires nweires commented Jan 24, 2024

Adds the option to use the --missingonly flag to only run the tasks that don't already have results present.

Notes:

  • This checks the output directory for results_job{TASK_ID}.json.gz files, and runs the tasks for which that file is missing. This means you can also trigger reruns by deleting those files.
  • This assumes that you're rerunning with the same project file. If you change it, the behavior is undefined. (Some types of changes would be ignored, others could cause wrong results.)

Testing:

  • I deleted some results files from a previous job, then ran with --missingonly and confirmed that only the missing tasks were rerun.
  • I also compared the new results file to the old one, to ensure that the correct set of simulations were run.

Copy link

github-actions bot commented Jan 24, 2024

File Coverage
All files 86%
base.py 91%
exc.py 57%
hpc.py 78%
local.py 70%
postprocessing.py 84%
utils.py 91%
cloud/docker_base.py 79%
sampler/base.py 79%
sampler/downselect.py 33%
sampler/precomputed.py 93%
sampler/residential_quota.py 61%
test/shared_testing_stuff.py 85%
test/test_docker.py 33%
test/test_local.py 97%
test/test_validation.py 97%
workflow_generator/base.py 90%
workflow_generator/commercial.py 53%
workflow_generator/residential_hpxml.py 86%

Minimum allowed coverage is 33%

Generated by 🐒 cobertura-action against a941448

@nweires nweires marked this pull request as ready for review January 24, 2024 21:41
Copy link

@mfathollahzadeh mfathollahzadeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Natalie! This is great! Just added a few minor comments

buildstockbatch/cloud/docker_base.py Outdated Show resolved Hide resolved
with fs.open(f"{self.results_dir}/missing_tasks.txt", "w") as f:
for task_id in range(expected):
if task_id not in done_tasks:
f.write(f"{task_id}\n")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to add print(f"Missing task ID: {task_id}") so that we can keep track of them

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if the job count becomes 0, should we print something like all expected task results are present?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if the job_count becomes zero, we should skip the post process, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging of the list of tasks. In gcp.py (where this is called), we raise an error and quit if there's nothing to retry.

done_tasks = []
for f in fs.ls(f"{self.results_dir}/simulation_output/"):
if m := re.match(".*results_job(\\d*).json.gz$", f):
done_tasks.append(int(m.group(1)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how long this is taking right now (probably not that long) but what do you think about using a set comprehension for faster lookups? something like or something similar?

fp = re.compile(".*results_job(\\d*).json.gz$")
done_tasks = {int(m.group(1)) for f in fs.ls(f"{self.results_dir}/simulation_output/")
                      if (m := fp.match(f))}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a loop is more readable than a comprehension, plus it avoids evaluating the regex twice (e.g. {int(fp.match(f).group(1)) for f in files if fp.match(f)}), but I will switch to a set instead of a list.


If this happens, you can rerun the same job with the ``--missingonly`` flag. This will rerun only the
tasks that didn't produce output files, then run postprocessing. Note: This flag assumes that your
project config file has not changed since the previous run, other than the job identifier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the job identifier needs to be changed for --missingonly flag, right?

Copy link

@mfathollahzadeh mfathollahzadeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Natalie! Looks good to me!

@mfathollahzadeh mfathollahzadeh merged commit feb1607 into gcp Jan 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants