Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility suggestion: dependency boundary #78

Open
albertz opened this issue Jul 11, 2022 · 16 comments
Open

Utility suggestion: dependency boundary #78

albertz opened this issue Jul 11, 2022 · 16 comments

Comments

@albertz
Copy link
Member

albertz commented Jul 11, 2022

I'm not sure whether this should be here for i6_core (although I think this would be useful for all users of i6_core), or even Sisyphus itself (it probably is useful for all Sisyphus users), or i6_experiments/common. If you think this should be somewhere else, we can move the issue.

The main reason is to have a workaround to the problem that Sisyphus is slow for big pipelines (rwth-i6/sisyphus#90). (If Sisyphus would not be slow, we would probably not need it, and every single dependency can always be part of the graph.)

So, the basic idea is, when doing some neural model training experiments, some commonly reused parts of the pipeline (e.g. data preprocessing, feature extraction, alignment generation, CART, whatever) are not part of the Sisyphus graph but you have done that in a separate Sisyphus pipeline and now you directly use the generated outputs.

Here in this issue I want to propose a more systematic approach for this which makes this more seamless. Esp considering the main intention of our recipes that it is simple to reproduce some results, both for ourselves and for outsiders, when they want to reproduce some result from our papers.

The idea is that it should still be possible to run the whole pipeline and that the user does not need to run separate parts of the pipeline separately.

But there are some open questions or details to be filled in, so this is now open for discussion.


So now to the high-level proposal; but as said, the exact API or other aspects this are up for discussion.

Look at the hybrid NN-HMM ASR pipeline as an example, which depends on the GMM-HMM pipeline. In between, you would get objects like RasrInitArgs, ReturnnRasrDataInput, HybridArgs, etc. Let's say you collect all the dependencies you need to train the NN in some object, like:

hybrid_nn_deps = get_all_hybrid_nn_deps()

nn_training_result = nn_training(hybrid_nn_deps)

So, the dependency boundary could be defined at the hybrid_nn_deps object.

Technically, it means, for all tk.Path objects somewhere in hybrid_nn_deps, we would replace the creator by some dummy which keeps the same hash as before, or just use hash_overwrite.

How would the API look like? We want to avoid that get_all_hybrid_nn_deps is called because calling it would be slow. So, it would look sth like:

hybrid_nn_deps = dependency_boundary(
  func=get_all_hybrid_nn_deps,
  ...
)

Now, the question is, what else should there be, and how should we implement it exactly. In principle, I think everything else could be optional and automatic. But let's go through it. First, on the technical questions:

  • How should we store the object? Just pickling? Or some Python code representation? This would also include the hashes.
  • Where should we store the object?
  • What name should the file have? This could be explicit by the user, or maybe automatic by the function name (__qualname__ or so). For the automatic case, how exactly?

I assume then the logic is quite straightforward:

  • Check if the cached object file exists, and if so, use it. Maybe make an extra check if all its dependencies (tk.Path) exists and error if sth is missing.
  • If cached object file does not exist:
    • Run the passed func.
    • In case all dependencies are there, we can create the cached object file.
      (We should wait until we have this because otherwise jobs might update their dependencies and I think the hash might change then? In any case, it feels saver.)

For the user, there are some potential actions we should implement:

  • Sanity check on the cached object. E.g. there might have been some changes in the meantime to the get_all_hybrid_nn_deps function and you want to check whether the hashes in the cached object are still correct. So basically you explicitly want to execute the whole pipeline code and any cached objects should only be used for double checking.

I'm not sure how this action or behavior would be controlled. It could be some global setting (related: rwth-i6/sisyphus#82) or maybe some OS environment variable.

The proposal is also compatible with tk.import_work_directory. When executing the pipeline config and the outputs do not exist (and neither do the cached objects), it would simply execute the whole pipeline.

@albertz
Copy link
Member Author

albertz commented Jul 11, 2022

@JackTemaki @christophmluscher we had this discussion.
@critias @curufinwe @michelwi relevant for you maybe as well.

@critias
Copy link

critias commented Jul 11, 2022

Two things:

  1. For our MT setup we already split the pipeline into two parts (data preparation and training), but don't try to conserve the hash. We simply use a strictly defined folder to pass everything between the two. We didn't do this to improve speed, but to make debugging easier and to have a fixed point allow using exiting data preparation with an updated workflow which might change the hashes of the data preparation as well.
  2. Did you try instead of loading the whole workflow every time to create it once and pickle it, and load it again when you need it? It should be faster to just unpickle a graph since this bypasses some Sisyphus code (like the coping of the inputs) additionally if the needed output already exists Sisyphus should only check that last job, so the filesystem access should be fairly low as well. If it works well we could see if we can introduce some command-line options to Sisyphus to make that simpler.

@albertz
Copy link
Member Author

albertz commented Jul 11, 2022

For our MT setup we already split the pipeline into two parts (data preparation and training), but don't try to conserve the hash. We simply use a strictly defined folder to pass everything between the two. We didn't do this to improve speed, but to make debugging easier and to have a fixed point allow using exiting data preparation with an updated workflow which might change the hashes of the data preparation as well.

Having such fixed point is the same goal here. You basically just did it more manually.

The main purpose is always that it is fully reproducible. If you update the preparation workflow, somewhere you should still have the old workflow anyway such that it is reproducible, or not? Well, at least for doing research and publications, this is important for us.

But also, it should be simple, straightforward, unambiguous to reproduce. If this is not the case, it's not really reproducible as people could get something wrong. Ideally we could provide a hash for each single number in a table and people could double check whether they end up in the same hash.

Did you try instead of loading the whole workflow every time to create it once and pickle it, and load it again when you need it? It should be faster to just unpickle a graph since this bypasses some Sisyphus code (like the coping of the inputs) additionally if the needed output already exists Sisyphus should only check that last job, so the filesystem access should be fairly low as well. If it works well we could see if we can introduce some command-line options to Sisyphus to make that simpler.

No, I did not try. Would this recalculate the hashes for every job or use them as they are? From the profiling I did (rwth-i6/sisyphus#90), this was the main bottleneck for me (with a fast filesystem) and way too slow.

If it does not recalculate the hashes, this is basically just the same as what I propose here, or not? The only difference is whether the pickled object contains the whole graph or only part of it, but this would not really matter anyway, right?

It it does recalculate the hashes, it would still be way too slow.

@critias
Copy link

critias commented Jul 11, 2022

No, I did not try. Would this recalculate the hashes for every job or use them as they are? From the profiling I did (rwth-i6/sisyphus#90), this was the main bottleneck for me (with a fast filesystem) and way too slow.

All hashes are stored inside the pickled object and loaded again from there. So to my knowledge it should recalculate anything, but I never profiled it to check if this is really the case.

If it does not recalculate the hashes, this is basically just the same as what I propose here, or not? The only difference is whether the pickled object contains the whole graph or only part of it, but this would not really matter anyway, right?

Yes, I think this would achieve what you what to do.

@critias
Copy link

critias commented Jul 12, 2022

After playing around with your setup I must admit that I didn't expect you graph to be that large... even zipped I end up with a nearly 10MB large file.
I don't know if your graph really has to be that big, but splitting it would probably be a way to get around it.

If you want to replace the creator with some dummy you would also have to overwrite all functions Sisyphus might want to access. I think it would be easier to replace the Path object and overwrite the available function and the hash. Later if the file exists your or done, otherwise you need to compute the function and map its outputs.
Also I think you need some kind of id for the function call since you somehow need to know where to store and look for the cached information. I guess this will just be the return value with the replaced dummy paths in a pickled file. This should be really small, I tested something similar for you case and got around 14KB zipped.

Overall I think this can work, but might be tricky in some parts.

@albertz

This comment was marked as duplicate.

@albertz
Copy link
Member Author

albertz commented Jul 12, 2022

Then we also had some internal discussions at i6 involving @christophmluscher, @michelwi, @Marvin84, @ZhouW321, @SimBe195 about the general concept. We had many arguments back and forth about many aspects here, like:

Do we need such dependency_boundary function at all, or maybe do the same logic "by hand" in some way? The conclusion was, it's still useful if it auto-generates Python code which is inside the Git repository itself so that it can easily be investigated.

How should the cached object file be named? As we want to have auto-generated Python code (and not just a pickle file), and it should be inside the Git repo, it would be somewhere next to the other code. The exact naming scheme still has to be decided but is probably not so important anymore.

What format should we use? As said, auto-generated Python code is preferred.

Where (what repo) should the dependency_boundary be implemented or part of? First the suggestion was Sisyphus itself, as this would be a core function useful for others as well. But then we decided it would be enough if it is in i6_experiments (i.e. in the common dir in there).

On configuration (e.g. enabling sanity check), it was said that it should be an argument to dependency_boundary and a users should just edit the code to change some behavior temporarily.

@albertz albertz transferred this issue from rwth-i6/i6_core Jul 12, 2022
@albertz
Copy link
Member Author

albertz commented Jul 14, 2022

Btw, just like common/setups/returnn_common, I think it's easiest to directly develop in the master branch and not doing a PR / separate branch, as this is a new feature, as long as this is not used by anyone (similar argumentation as for returnn_common, see #63). Once people start using it, further changes would go via PRs as usual. (Btw, should we document this as a general guideline how to handle new features in common? After the feature has been discussed in some issue.)

albertz added a commit that referenced this issue Jul 14, 2022
Python representation and dump logic is needed for
depdendency boundary logic (#78).
Diff logic useful for that as well.
@albertz
Copy link
Member Author

albertz commented Jul 14, 2022

There is some initial (untested) implementation now in common.helpers.dependency_boundary.

@JackTemaki
Copy link
Contributor

@albertz please do not push into common without a PR the next time, otherwise people might assume this is tested and stable, especially as you added a commit specifically saying that this is stable.

@albertz
Copy link
Member Author

albertz commented Jul 14, 2022

@albertz please do not push into common without a PR the next time, otherwise people might assume this is tested and stable, especially as you added a commit specifically saying that this is stable.

We already discussed this, and the conclusion was that directly pushing to common without PR is fine when this is new code, not used by anyone else. We do it this way for all such code.

Where does it say this is stable? I can mark it more explicit as not stable, work-in-progress.

@albertz
Copy link
Member Author

albertz commented Jul 14, 2022

Maybe you mean the parent helpers package readme? I can also move the dependency_boundary elsewhere, or change the parent helpers package readme, to say "stable unless explicitly said otherwise", if this is really necessary.

In any case, it should be in common so that people can easily test it already, and join the development.

@JackTemaki
Copy link
Contributor

We already discussed this

No, this was never discussed.

Yes, we did discuss this personally for the returnn_common helpers and I strongly disagreed with this method. So far only you considered pushing directly into common. Even for our other work-in-progress code like the RASR helpers we did PRs and review, even with limited extend it is better this way to keep discussions going and to make progress visible. When you are pushing directly into the master, the progress is not visible at all (for me there are e.g. no notifications because people should be able to push into their user folder freely).

If you directly want to push and use your code, then please use your user folder for this. This is what we all agreed on when starting i6_experiments.

In any case, it should be in common so that people can easily test it already, and join the development.

But what has this to do with bypassing PRs? I understand that in your view the whole process is too slow, because people will take time in looking at the PR, and during this time the code is not there. But if you directly push it, people will not look at it at all, and definitely not discuss it.

So please: push your code to your user folder to directly use it (and also other people can test it from there as well), but for common always do a PR.

@albertz
Copy link
Member Author

albertz commented Jul 14, 2022

We already discussed this

No, this was never discussed.

Yes, we did discuss this personally for the returnn_common helpers

The discussion was more generic, or whatever we had as arguments was not specific to returnn_common helpers but to anything else as well.

So, you want to have this further discussed? With more people?

and I strongly disagreed with this method.

I thought that we agreed to do it this way. It's also documented that way everywhere, e.g. see the readme in the root.

So far only you considered pushing directly into common. Even for our other work-in-progress code like the RASR helpers we did PRs and review, even with limited extend it is better this way to keep discussions going and to make progress visible. When you are pushing directly into the master, the progress is not visible at all (for me there are e.g. no notifications because people should be able to push into their user folder freely).

I don't really understand what is not visible. You can easily see all commits for common here. I'm also pretty sure there is some easy way that it can send you a mail for every single commit if you want that, or I can also do that by hand for you.

As long as you would not actively take part in the development, you can also just check the current state if you are interested. If there is some relevant new state, I will also post here in the issue, just as I did. So far it's completely untested, so if you are not interested in testing it or looking at untested code, then you can safely ignore it for now. Once there is any update on this, I will post it here, and you can get a notification for that. If you are not interested in the development, it's probably not worth it to look at it yet, until I report here that I tested it and it looks fine.

If you directly want to push and use your code, then please use your user folder for this. This is what we all agreed on when starting i6_experiments.

The dependency_boundary is intended to be used by everyone, so it does not belong into a user folder. We already discussed that it belongs to common. It would probably even be used by other code in common, e.g. by the full pipelines. That is the intention of this function.

In any case, it should be in common so that people can easily test it already, and join the development.

But what has this to do with bypassing PRs? I understand that in your view the whole process is too slow, because people will take time in looking at the PR, and during this time the code is not there. But if you directly push it, people will not look at it at all, and definitely not discuss it.

Why would people not look at it or not discuss it? If you want to look at it, you can look at it (does not matter whether PR or not). If you don't want to look at it, you don't look at it. So why is it relevant that it is a PR? To get notifications when there is activity? As said, I will just notify here in this issue.

So please: push your code to your user folder to directly use it (and also other people can test it from there as well), but for common always do a PR.

Again: Strong disagree on this.

Why do you even care if you don't actively join the development? Once it is ready (considered as ready by those people who develop it), you will be notified, and then you can give feedback that something should be changed, and then we simply can change it. And once then the review is over, and maybe some others tested it as well, we can mark it as stable. And then further development would go via PRs as usual.

And if you want to join the development, then even more this is an argument that we don't work in a separate branch (in a PR) because this has to be in the current master such that you can directly update some other stuff like users code etc without making it too complicated (merging or rebasing in the master changes etc).

I think having PR-based development for a new feature which does not touch any other existing code and which is not used by anyone except those who develop it initially does not really offer any advantage but only lots of disadvantages (making everything much more complicated and this leads to much slower development, which is already way too slow).

But if you really want to discuss this further, we should do a separate discussion on this, maybe also with other people, if anyone actually really cares about that. This discussion does not really belong here into this issue.

@albertz
Copy link
Member Author

albertz commented Jul 16, 2022

The code works now for my test case on the GMM pipeline (collecting all args for the hybrid NN-HMM), specifically:

orig_obj = dependency_boundary(get_orig_chris_hybrid_system_init_args, hash="YePKvlzrybBk")

(In file i6_experiments/users/zeyer/experiments/chris_hybrid_2021.py.)

So I think people who are interested can start reviewing the changes now.

Specifically, there is common/helpers/dependency_boundary.py, and common/utils.

Maybe we want to move common/utils elsewhere, maybe to common/helpers. I was not sure. It's mostly generic but partly also a specific for the dependency_boundary (only in small details, it can trivially be extended to be used for other purpose as well, e.g. the dump Python code logic).

@albertz
Copy link
Member Author

albertz commented Jul 16, 2022

See this example of generated code.

Note there are many potential optimizations to improve the generated code further but you not sure how much worth it is to spend time on it. One thing I'm considering is to automatically apply black on it. Esp when the generated code would be part of common/baselines or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants