Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task.run, wait for inputs #159

Closed
albertz opened this issue Nov 28, 2023 · 7 comments · Fixed by #162
Closed

Task.run, wait for inputs #159

albertz opened this issue Nov 28, 2023 · 7 comments · Fixed by #162

Comments

@albertz
Copy link
Member

albertz commented Nov 28, 2023

Some of my recog jobs fail because they don’t find the checkpoint:

[2023-10-22 14:05:41,761] INFO: Inputs:
[2023-10-22 14:05:41,761] INFO: /u/zeyer/setups/combined/2021-05-31/tools/returnn
[2023-10-22 14:05:41,761] INFO: /work/tools/users/zeyer/linuxbrew/opt/[email protected]/bin/python3.11
[2023-10-22 14:05:41,761] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.NSdIHfk1iw2M/output/out.ogg.zip
[2023-10-22 14:05:41,761] INFO: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.jYUaQ42BWrqA/output/models/epoch.240.pt
[2023-10-22 14:05:41,761] WARNING: Input path does not exist: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.jYUaQ42BWrqA/output/models/epoch.240.pt
...
[2023-10-22 14:05:41,762] INFO: ------------------------------------------------------------
[2023-10-22 14:05:41,762] INFO: Starting subtask for arg id: 0 args: []
[2023-10-22 14:05:41,762] INFO: ------------------------------------------------------------
[2023-10-22 14:05:41,777] INFO: Run time: 0:00:00 CPU: 215.80% RSS: 70MB VMS: 213MB 
reformatted ...
[2023-10-22 14:05:41,950] ERROR: Job failed, traceback:
...
  File "/u/zeyer/setups/combined/2021-05-31/recipe/i6_core/returnn/forward.py", line 307, in ReturnnForwardJobV2.create_files
    line: assert os.path.exists( 
              _get_model_path(self.model_checkpoint).get_path()
          ), f"Provided model checkpoint does not exists: {self.model_checkpoint}"
...
AssertionError: Provided model checkpoint does not exists: /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.jYUaQ42BWrqA/output/models/epoch.240.pt

then it ends up in this state:

interrupted_not_resumable: Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.3UMtLSktznsC>

My initial suggestion was to add some wait time in the ReturnnForwardJobV2 (see rwth-i6/i6_core#464), but @michelwi suggested to implement this more generic, so this issue here is to discuss about this.

I also found this code:

sisyphus/sisyphus/task.py

Lines 140 to 146 in cbf529c

# each input must be at least X seconds old
# if an input file is too young it's may not synced in a network filesystem yet
try:
input_age = time.time() - os.stat(i.get_path()).st_mtime
time.sleep(max(0, gs.WAIT_PERIOD_MTIME_OF_INPUTS - input_age))
except FileNotFoundError:
logging.warning("Input path does not exist: %s" % i.get_path())

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

I really don't like the WAIT_PERIOD_JOB_FS_SYNC. I want to get rid of any arbitrary sleeps before doing anything. See also #146. Ideally, if sth is ready, it should directly execute the next task without any delay.

Then, the task itself can check if inputs are available, and if not, wait a bit. Thus, in the ideal case, it would directly run, and only if not available, it would wait a bit.

Originally posted by @albertz in rwth-i6/i6_core#464 (comment), after discussion with @michelwi

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

Why do we actually have only a warning here for FileNotFoundError and not just fail? Are there really cases where it make sense to still continue with the execution and only leave this as a warning?

@albertz albertz changed the title Task, wait for inputs Task.run, wait for inputs Nov 28, 2023
@michelwi
Copy link
Contributor

Why do we actually have only a warning here for FileNotFoundError and not just fail? Are there really cases where it make sense to still continue with the execution and only leave this as a warning?

I don't see any reason and would also just fail.

Maybe we should change actually this code to not just print this warning ("Input path does not exist") but instead to wait in this case?

yes, wait until a threshold is reached and then fail would be my suggestion.

@critias
Copy link
Contributor

critias commented Nov 30, 2023

The warning is there is case someone only passes the stem to some files. e.g.:
/path/to/somewhere/awesome.model. and the files are stored in {base}.def, {base}.hdf5 etc.

You could surely argue that this is bad practice and shouldn't be allowed, but this is the historical reason why it's there.

Making all inputs required would allow to add this simple check/wait period as you suggested.

@albertz
Copy link
Member Author

albertz commented Nov 30, 2023

is case someone only passes the stem to some files. e.g.:
/path/to/somewhere/awesome.model. and the files are stored in {base}.def, {base}.hdf5 etc.

I'm not sure but I thought this also had other problems (or would not work), so people always used an existing file, e.g. the .index or the .meta file in case of TF checkpoint? At least what we have in i6_core is done this way.

So, now the question: Should we do this potential breaking change, to require that all inputs must be available? Or only via option? What should be the default for this option?

Or wait in any case if not available for some time, but then proceed with warning only?

@dthulke
Copy link
Member

dthulke commented Nov 30, 2023

I'm not sure but I thought this also had other problems (or would not work), so people always used an existing file, e.g. the .index or the .meta file in case of TF checkpoint? At least what we have in i6_core is done this way.

I have some quite old configs where it's done without the suffix (i.e. I always get this warning and ignore it).

So, now the question: Should we do this potential breaking change, to require that all inputs must be available? Or only via option? What should be the default for this option?

In general, I agree that sisyphus should fail here per default. But, I'd prefer to make this a config option to get back the old behaviour.

@vieting
Copy link
Contributor

vieting commented Nov 30, 2023

I'm in favor of waiting until a threshold is reached and then fail. Adding an option to allow unavailable inputs seems alright if it's not too much effort, but I don't think I'd currently need it.

@curufinwe
Copy link
Collaborator

Same as @vieting

albertz added a commit that referenced this issue Dec 15, 2023
New options:

WAIT_PERIOD_FOR_INPUTS_AVAILABLE
TASK_INPUTS_MUST_BE_AVAILABLE

Fix #159
albertz added a commit that referenced this issue Dec 15, 2023
New options:

WAIT_PERIOD_FOR_INPUTS_AVAILABLE
TASK_INPUTS_MUST_BE_AVAILABLE

Fix #159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants