Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to integrate returnn_common.nn into Sisyphus pipelines #63

Closed
JackTemaki opened this issue May 3, 2022 · 71 comments
Closed

How to integrate returnn_common.nn into Sisyphus pipelines #63

JackTemaki opened this issue May 3, 2022 · 71 comments
Assignees

Comments

@JackTemaki
Copy link
Contributor

When trying to create Returnn networks with returnn_common.nn, the following issues or questions appeared:

If returnn_common.nn would be used for construction within the manager:

  • A specific RETURNN and Tensorflow dependency is added to the manager environment
  • Loading Tensorflow slows down manager startup time significantly (especially when there is a slow fs somewhere)
  • The network construction logic is quite complex, so with a rising number of network constructions this will become slow (lets say 100 experiments with 20 pre-training stages each = 2000 construction calls for a manager/console startup).
  • The returnn_common version can be "pinned" before updating by using the git-hashed based python import call from Returnn, but the Returnn version itself can not be pinned. Nevertheless the construction might be influenced by the Returnn version, making the correct handling of network hashes as we are used to very difficult up to impossible.

As we will not solve the hashing issue without spending way to much time not doing experiments, I would propose to go with a completely different approach than we did before, which is:

-> Create a job which takes a executable and some pre-defined parameters does the network construction as task instead of in the manager.

This of course would mean that the construction result is not hashed, but only the constructor location and the parameters. To be able to find potentially unwanted changes and to be able to reconstruct the networks exactly, the job should create a local copy of the used construction code at runtime (returnn and returnn_common can of course correctly be identified by their git hash).

While this would require some manual management of the network construction to not make logic changing mistakes, I currently do not see any suitable alternative to this.

@Atticus1806
Copy link
Contributor

I like this approach. A mistake in the building of the config which was seen too late can then be fixed by calling tk.remove_job_and_descendants on the construction job

@albertz
Copy link
Member

albertz commented May 4, 2022

A specific RETURNN and Tensorflow dependency is added to the manager environment

I don't see how this is relevant.

The RETURNN version should not have any influence on returnn_common.nn at all (as long as new enough but otherwise it would simply crash with import errors, e.g. when Dim is not available).

Also the TensorFlow version should not have any influence on returnn_common.nn.

Loading Tensorflow slows down manager startup time significantly (especially when there is a slow fs somewhere)

We already discussed about that here: rwth-i6/returnn_common#27

returnn_common.nn itself really uses the nesting logic from TF, which we could also replace.

returnn_common.nn also uses some data structures from RETURNN like Data and Dim. In principle, they could also be independent from TF. But this should be done on RETURNN side.

So, it's possible, although this would require some work.

But also, it sounds a bit like an artificial problem, which we could just directly solve (just make TF import fast), instead of putting lots of effort into it to work around it.

We also already discussed how to solve this directly, i.e. how to make TF import fast. There are many solutions to that.

The network construction logic is quite complex, so with a rising number of network constructions this will become slow (lets say 100 experiments with 20 pre-training stages each = 2000 construction calls for a manager/console startup).

I don't think the net construction really adds any noticeable overhead in runtime. This should not be the case. It should be just as fast as before. If this is not the case, we can easily fix that. There is no reason why this should not be fast.

I also disagree that it is complex but this is maybe subjective. Anyway, complexity does not imply slowness.

Do you have any numbers on that? It really should not be slow. Even not 2000 calls, or even not 1M calls. If you think this is slow, again, we can fix this. Please open an issue on this with some numbers.

The returnn_common version can be "pinned" before updating by using the git-hashed based python import call from Returnn, but the Returnn version itself can not be pinned. Nevertheless the construction might be influenced by the Returnn version,

No, it cannot be influenced by the RETURNN version.


Anyway, to your main proposal here: What I don't understand is: When you now have the construction code somewhere else (where exactly?), why do you need it at all as a separate job or task? Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this?

To be able to see the net dict more directly? So you make this single thing a bit easier (it would also be easy to dump it on-the-fly if you want to see it) while making the whole setup overall much more complicated?

@JackTemaki
Copy link
Contributor Author

JackTemaki commented May 4, 2022

Setting the other comments aside (because we do not need to fix this if we can not work like this anyway).

I don't think the net construction really adds any noticeable overhead in runtime. This should not be the case. It should be just as fast as before. If this is not the case, we can easily fix that. There is no reason why this should not be fast.

I also disagree that it is complex but this is maybe subjective. Anyway, complexity does not imply slowness.

What "before"? There is no "before". We constructed the network dicts, which is merging some pre-defined dictionaries together and of course much faster than the new construction logic.

I just did the timing on one Transformer construction and it is definitely too slow (not in general, but to do it in the manager). Just one single construction was 2.3 seconds.

When you now have the construction code somewhere else

I am not sure how you read this out of my proposal, the location does not change at all.

Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this?

This is one way of doing it, using it with the `ReturnnTrainFromFileJob", which of course would make all of the mentioned issues non-existing. But this means you have limited interaction with Sisyphus, and your own defined modules always have to be copy&pasted in there (unless they are part of returnn_common) and you can have no additional helpers. But this is in a way saying "do not use Sisyphus for anything Returnn config related". Which is a valid approach, but not what should be discussed. So referring to the title of this issue your answer is "don't do it".

@albertz
Copy link
Member

albertz commented May 4, 2022

What "before"? There is no "before". We constructed the network dicts, which is merging some pre-defined dictionaries together and of course much faster than the new construction logic.

Before = before you used returnn_common.nn. You have used Sisyphus before or not? You have created dictionaries before or not?

No, this is what I'm saying: It should not make any noticeable difference. Why should it? If there is really any noticeable slowdown, we can fix this. There is really no reason why there should be any noticeable slowdown due to the new construction. Even if you think it is complex.

I just did the timing on one Transformer construction and it is definitely too slow (not in general, but to do it in the manager). Just one single construction was 2.3 seconds.

What exactly did you measure? Did you maybe count the TensorFlow import? The construction itself really should not be slow.

When you now have the construction code somewhere else

I am not sure how you read this out of my proposal, the location does not change at all.

Then I don't understand your proposal. You proposed to do the construction in a job? So that means the location is now somewhere difference, i.e. in the job. If you don't mean that, what do you mean then?

Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this?

This is one way of doing it, using it with the `ReturnnTrainFromFileJob", which of course would make all of the mentioned issues non-existing. But this means you have limited interaction with Sisyphus,

I don't see the difference to having the construction in a job, what you proposed here. What exactly is the difference?

and your own defined modules always have to be copy&pasted in there (unless they are part of returnn_common)

No, of course you don't need to copy&paste anything. Why? I don't understand.

Also, again, I don't understand how this is different to your proposal here.

and you can have no additional helpers.

Sure you can have. Why not?

And again I don't understand the difference to your proposal here.

But this is in a way saying "do not use Sisyphus for anything Returnn config related". Which is a valid approach, but not what should be discussed. So referring to the title of this issue your answer is "don't do it".

No. At the moment I just try to understand how your proposal really has any benefit, or what your are actually proposing, and how that is different to what I'm saying.

@JackTemaki
Copy link
Contributor Author

It seems this topic is too complicated to be discussed in an issue here, as we are lacking some common ground, we should take this offline.

In short:

Why should it?

Because executing hundreds of lines of code is certainly slower than merging dicts.

What exactly did you measure?

    with nn.NameCtx.new_root() as name_ctx_network:
        net = BLSTMDownsamplingTransformerASR(...)
        out = net(...)
        out.mark_as_default_output()
        for param in net.parameters():
            param.weight_decay = 0.1
        serializer = nn.ReturnnConfigSerializer(name_ctx_network)
        base_string = serializer.get_base_extern_data_py_code_str()
        network_string = serializer.get_ext_net_dict_py_code_str(net, ref_base_dims_via_global_config=True)

and you can have no additional helpers.

Sure you can have. Why not?

You can not import from the recipes in a returnn config file. A job (as in: a generated job folder in work dir) has to be independent of any outside influence that is not part of the inputs.

So that means the location is now somewhere difference, i.e. in the job

just because the construction is called in a job does not mean the code that does construction is in a different location (like physical as file)

@albertz
Copy link
Member

albertz commented May 4, 2022

It seems this topic is too complicated to be discussed in an issue here,

I don't see why this issue here is complicated.

as we are lacking some common ground, we should take this offline.

Sure, we can do that, but it would be good to have it in written form anyway here such that others can also join the discussion or understand decisions later on. So it would be anyway good if you could answer my questions.

I probably will not be able to come to the office today in person. But probably tomorrow.

In short:

Why should it?

Because executing hundreds of lines of code is certainly slower than merging dicts.

But this is what I mean about noticeable. Hundreds of lines of code should still execute in the nano-seconds range. This is still very fast. Unless sth is wrong.

What exactly did you measure?

    with nn.NameCtx.new_root() as name_ctx_network:
        net = BLSTMDownsamplingTransformerASR(...)
        out = net(...)
        out.mark_as_default_output()
        for param in net.parameters():
            param.weight_decay = 0.1
        serializer = nn.ReturnnConfigSerializer(name_ctx_network)
        base_string = serializer.get_base_extern_data_py_code_str()
        network_string = serializer.get_ext_net_dict_py_code_str(net, ref_base_dims_via_global_config=True)

This might involve imports.

Otherwise, I don't see why this should not be fast. If there is anything slow, we can certainly fix this. Some profiling would be good.

and you can have no additional helpers.

Sure you can have. Why not?

You can not import from the recipes in a returnn config file. A job (as in: a generated job folder in work dir) has to be independent of any outside influence that is not part of the inputs.

I don't understand. Sure you can import from the recipes, or from anywhere else. What stops you from doing that?

Sure a job is influenced on many other things, like the content of the filesystem, Sisyphus, Python, the env, the Python env, etc. The recipes can just be part of the Python env.

I don't really understand your argument here. You say you cannot have helpers because of some artificial constraint you are making up that there should be no shared common code, i.e. there should be no helpers? So you say you cannot have helpers because there must not be helpers. But why? This does not make sense to me.

I also still don't understand how the situation is different to what you propose. You propose to do the net construction inside a job, or not? So where does the code for the net construction come from?

So that means the location is now somewhere difference, i.e. in the job

just because the construction is called in a job does not mean the code that does construction is in a different location (like physical as file)

Yes, exactly as I'm saying. You can do it as you want. It does not matter. Although I'm not really sure if you propose to change that or not, or where exactly you would want the the code to be. What do you actually mean when you write "constructor location"?

What changes is where the code is executed, e.g. inside a job or in Sisyphus. E.g. inside the net construction job or in the ReturnnTrainJob.

And again, how does the situation change when this is inside a new custom net construction job, or just directly in the ReturnnTrainJob? I don't see how there is any difference.

@JackTemaki
Copy link
Contributor Author

This might involve imports.

I called this repeatedly to make sure this is not the case.

Sure a job is influenced on many other things, like the content of the filesystem, Sisyphus, Python, the env, the Python env, etc.

All accessed content should be local in the job folder or from the input folders. Sure, if you manually edit those it will break, but this is not something commonly done, so the first thing can be excluded. Functions from Sisyphus are usually (and should not) be used inside job tasks except for path resolving. Python, yes Python can have an influence but people usually do not change their major python version while working with the same setup folder. The env, yes, the env can change, but optimally (this is unfortunately not the case yet), you have a job for the env, then this is not an issue.

So all the things you listed are not something you alter on a daily basis. Your personal recipe code is edited on a daily basis, as this what you work with. Importing that externally is a very bad idea.

that there should be no shared common code,

I did not say that. This is what returnn_common is for or not? And this can be safely imported as it would come from a git-clone job. But you should not access your own recipe code. Of course you could include that in your python path, I am just saying this is not a good idea, as if you change something running the same job again will result in something different.

I don't see how there is any difference.

The difference is that this job would need a different kind of parameter passing, it would need fixed parameters for returnn and returnn_common, and it would need to make a local copy of the files that were used to build the network.

@albertz
Copy link
Member

albertz commented May 4, 2022

I think I don't really understand what code you are talking about when you say net construction code, and where you would have this code. I was not talking about returnn_common itself. I was referring to the code where you actually put it together to construct the network. Or other helper code you use. Anything basically by the user. And when you do research on modeling aspects, this is what you would constantly change.

In any case, where-ever this code is located (it does not matter), what difference does it make whether it is executed in NetConstructionJob or directly in ReturnnTrainJob? I don't see how there is any difference. What exactly is the advantage to separate them? Why not directly do the net construction in the ReturnnTrainJob? I don't understand this.

@JackTemaki
Copy link
Contributor Author

JackTemaki commented May 4, 2022

What exactly is the advantage to separate them? Why not directly do the net construction in the ReturnnTrainJob? I don't understand this.

I do not want to alter the ReturnnTrainingJob that much (better not at all). This job can not:

  • specify the import paths for where the construction code is located
  • manage the returnn_common dependency
  • create a local copy of the construction code for later reference

But also I do not fully know how a NetConstructionJob would do that. Otherwise I would not have put this up to discussion but just implemented it right away to have something to start experiments with. Anyway, this is less about if this is a separate NetConstructionJob or somehow integrated in ReturnnTrainingJob or ReturnnTrainingFromFileJob, as the problems that need to be solved are similar. But more about if we go for "within" Sisyphus net construction (which we originally wanted to have) or external net construction. And from what I see a complete "within" Sisyphus net construction is not realistically feasible, so the question is how to get as close as possible to that.

@albertz
Copy link
Member

albertz commented May 4, 2022

What exactly is the advantage to separate them? Why not directly do the net construction in the ReturnnTrainJob? I don't understand this.

I do not want to alter the ReturnnTrainingJob that much (better not at all).

Why not? This could simply be an extension (if any change/extension is really needed, but see below).

So, just because you don't want to alter ReturnnTrainingJob, you want to have a new separate job? This is the reason here? This does not make any sense to me. Why?

But despite, I don't think any change or extension is needed (see below).

This job can not:

  • specify the import paths for where the construction code is located
  • manage the returnn_common dependency
  • create a local copy of the construction code for later reference

Wrong. Sure you can do all of that. You can simply put it into the ReturnnConfig object, in whatever way you like.

Also, I still don't know where you actually would have your construction code located. Inside recipes? Outside recipes? Where exactly?

But also I do not fully know how a NetConstructionJob would do that.

I don't understand. In this issue here, you wrote:

Create a job which takes a executable and some pre-defined parameters does the network construction as task instead of in the manager.

So, I just gave this job a name, and called it NetConstructionJob.

So, what do you mean? Do you propose to add such job or not?

Or what is the problem? Or you say you don't know yet how to implement it?

Anyway, this is less about if this is a separate NetConstructionJob or somehow integrated in ReturnnTrainingJob or ReturnnTrainingFromFileJob, as the problems that need to be solved are similar.

Right. That is what I am saying all the time. Actually not just similar but exactly the same. That is why I don't understand why you want to have a separate job. I don't understand the advantage. I keep asking about this but so far you have not answered why it should be or must be a separate job, or why it can't be together in the ReturnnTrainingJob.

But more about if we go for "within" Sisyphus net construction (which we originally wanted to have) or external net construction. And from what I see a complete "within" Sisyphus net construction is not realistically feasible, so the question is how to get as close as possible to that.

Ah ok. I thought this issue is not about discussing that, but about proposing actually to have it external.

On the aspect whether the complete within Sisyphus net construction is feasible: Again, I don't see why not. Whatever you think is slow can (and should) be fixed. There is no reason why it should be slow.

On the question of how to do it externally (no matter if inside ReturnnTrainingJob or in a separate job): I still don't understand how you actually propose it. Where exactly would you have the code of the net construction? How does this code get into the job?

@JackTemaki
Copy link
Contributor Author

Wrong. Sure you can do all of that. You can simply put it into the ReturnnConfig object, in whatever way you like.

So you suggest to extend the ReturnnConfig code to cover that logic? This is also a possibility, but then do not say wrong as in "The job can not do that". It can not do that right now in a somewhat understandable user-friendly way. Of course we can change any code to perform any logic we want. If you talking about just passing this in prologor epilog, this surely is not a good idea because then you need to work with DelayedFormat which adds unnecessary complexity, and also it is not obvious from the interface how you should work with it.

I keep asking about this but so far you have not answered why it should be or must be a separate job, or why it can't be together in the ReturnnTrainingJob

It is adding code and parameters to the job that from my point of view do not belong in there. If there is an extra job the interface for the Job and the ReturnnConfig needs nearly no changes (only to accept one more type, which is Path).

So, what do you mean? Do you propose to add such job or not?
Or what is the problem? Or you say you don't know yet how to implement it?

I am raising the possibility this could be a solution (I am not sure myself). There is no problem, I just do not see any reason yet to start implementing this as long as there is no full concept to have in mind.

There is no reason why it should be slow.

I think there is, but lets exclude that here. This is also only one aspect.

Where exactly would you have the code of the net construction?

Somewhere under i6experiments.users.rossenbach

How does this code get into the job?

One possibility of many would be providing a Path object to the package of all construction modules and another Path object to point to a python executable which accepts all dynamic parameters and does contain the actual construction.

@albertz
Copy link
Member

albertz commented May 4, 2022

Wrong. Sure you can do all of that. You can simply put it into the ReturnnConfig object, in whatever way you like.

So you suggest to extend the ReturnnConfig code to cover that logic?

No, you also don't need an extension to ReturnnConfig. You can put in what you want.

This is also a possibility, but then do not say wrong as in "The job can not do that".

I don't understand. I did not say the job can not do that. You said that, and I said this is wrong, because the job can do that, as I described.

It can not do that right now in a somewhat understandable user-friendly way. Of course we can change any code to perform any logic we want. If you talking about just passing this in prologor epilog, this surely is not a good idea because then you need to work with DelayedFormat which adds unnecessary complexity, and also it is not obvious from the interface how you should work with it.

About understandable/user-friendly: Sure, but this can easily be solved.

It is adding code and parameters to the job that from my point of view do not belong in there. If there is an extra job the interface for the Job and the ReturnnConfig needs nearly no changes (only to accept one more type, which is Path).

There is no change needed to ReturnnTrainingJob nor ReturnnConfig.

So, what do you mean? Do you propose to add such job or not?
Or what is the problem? Or you say you don't know yet how to implement it?

I am raising the possibility this could be a solution (I am not sure myself). There is no problem, I just do not see any reason yet to start implementing this as long as there is no full concept to have in mind.

Ok. As said, this was not clear from the issue description to me. It sounds like you were proposing this specific solution.

There is no reason why it should be slow.

I think there is, but lets exclude that here. This is also only one aspect.

I don't understand. I thought this is the one and only reason why you need this at all? Assume that this can be fixed. Then the whole discussion here in this issue is obsolete or not? As I understand you, you would anyway even prefer this?

Where exactly would you have the code of the net construction?

Somewhere under i6experiments.users.rossenbach

So, inside the recipes. So it means the job accesses the recipes. But this contradicts what you wrote earlier that you don't want that?

How does this code get into the job?

One possibility of many would be providing a Path object to the package of all construction modules and another Path object to point to a python executable which accepts all dynamic parameters and does contain the actual construction.

I don't exactly understand. Can you be more specific? Maybe give an example? Python executable, you mean /usr/bin/python3? I don't really understand how the net construction code gets in there. A Path to i6experiments.users.rossenbach?

@JackTemaki
Copy link
Contributor Author

We can leave the rest for now, it does not make sense to discuss, but:

There is no change needed to ReturnnTrainingJob nor ReturnnConfig.

If there is a solution that would need no code changes at all this would be a good starting point to do any experiments. So please tell me how this should work.

@albertz
Copy link
Member

albertz commented May 4, 2022

We can leave the rest for now, it does not make sense to discuss

I don't understand. Surely all the rest is very relevant here as well, and very important to discuss? Esp, most importantly, maybe this whole issue here is actually obsolete, as I explained? And I thought this is the main question this issue is actually about, as you explained?

There is no change needed to ReturnnTrainingJob nor ReturnnConfig.

If there is a solution that would need no code changes at all this would be a good starting point to do any experiments. So please tell me how this should work.

But I thought first the question is if this is needed at all, and a within Sisyphus construction is not possible? Now you seem to assume again that it is needed.

There are many trivial ways how you can get whatever data/code with whatever dependencies into a ReturnnConfig. You should know that. I wonder a bit what the problem is.

But the more relevant question is, which I keep asking, which is still not clear to me: Where exactly is the net construction code? And what do you want to pass exactly to the job? The code itself (a copy of it or so?), or a Path to it?

Maybe you can just give an example to that? Ignore the part how exactly it ends up in the ReturnnTrainingJob. Just show the part where you have the net construction code, and what exactly would be the input of the job (code itself, or Path, or whatever).

@tbscode
Copy link
Contributor

tbscode commented May 10, 2022

Haven't read the whole discussion but I did try to use returnn_common.nn with my setup and experienced similar problems @JackTemaki mentioned here.

What I did was:

1 ) Test the network Construction via tests.returnn_helpers.config_net_dict_via_serialized and engine.init_train_from_config
2 ) Then obtain the construction code via inspect.getsource(...)
3 ) Add the code to a ReturnnConfig and use that with a ReturnnTrainingJob

Steps 1 & 2 could be done by some sort of net construction job.
This could output a ReturnnConfig that contains the construction code ( which is already tested ).

I like the Idea of a net construction job, in my view this would solve following problems:

  • startup time for sis manager ( the construction job would be a local job )
  • no hash changes when returnn common changes. ( because the construction code is used in the config )
  • No training is ever submitted with a faulty train network.
    ( - the net construction job could even already provide some metrics on the network ( num params, shapes, axes ) )

In this case the output for the net construction job could either be a returnn config file ( so a Path ), or a ReturnnConfig ( so a Variable, might be preferable to use with ReturnnTrainingJob ). The input would then be the construction code.

Am am not using returnn_common in my setup currently ( Due to some time constraints and not being to familiar with it ).
Just leaving this here as a comment.

@albertz
Copy link
Member

albertz commented May 11, 2022

Note from personal discussion with @JackTemaki:

We came to the conclusion that such net construction job is indeed not needed and we can directly do it in the ReturnnTrainingJob. Running the returnn-common (RC) code would not happen in the Sisyphus manager but via Delayed directly in the ReturnnTrainingJob. (I still don't fully understand why it was not clear that this is possible, or can be made possible, so why we had the suggestion to make a separate job for this.)

The RC code with the model definition would be inside the users recipe dir, i.e. somewhere in i6_experiments.users.

There was the suggestion to automatically copy the model definition file into the job work dir locally such that any modifications on it afterwards to not infer with the job. But it's not clear whether this is a good idea, or whether this is even so simple because the model definition might not be a single file but also access other files from the recipes.

What remains are many small details. E.g. as said, we could pass a Delayed to the ReturnnTrainingJob. More specifically, to the ReturnnConfig, to python_epilog. The Delayed would contain another job, sth like ReturnnCommonModelSerializer or so. This could define its own custom hash, or use the default Sisyphus hashing. The ReturnnTrainingJob hash depends also on this object. It could would roughly look like:

class ReturnnCommonModelSerializer:
  def __init__(self, module_name: str, *, func_name: str = "main", func_kwargs: Dict[str, Any] = None):
    self.module_name = module_name
    self.func_name = func_name
    self.func_kwargs = func_kwargs or {}

  def get(self) -> str:
    """
    This is called by Delayed.get.
    """
    from returnn_common import nn
    import importlib
    module = importlib.import_module(module_name)
    func = getattr(module, self.func_name)
    root_module = func(**self.func_kwargs)
    return nn.get_returnn_config().get_complete_py_code_str(root_module)

  def py_code_direct(self) -> str:
    return "\n".join([
      "from returnn_common import nn",
      f"import {self.module_name}",
      f"root_module = {self.module_name}.{self.func_name}({', '.join(f'{k}={v!r}' for k, v in self.func_kwargs.items())})",
      "py_code = nn.get_returnn_config().get_complete_py_code_str(root_module)",
      "eval(py_code, globals())",
      ""
    ])

This is a very simplistic draft. This ignores pretraining now. Also, maybe we should separate extern data from model definition logic more anyway? Extern data might also partially be auto-generated from dataset?

In this example, the hash would be defined via module_name, func_name, func_kwargs.

The usage in Sis would maybe look like this:

model_def = ReturnnCommonModelSerializer("i6_experiments.users.zeyer.model.my_best_model_123")
config = ReturnnConfig(..., python_epilog=[Delayed(model_def), ...])
train_job = ReturnnTrainingJob(config, ...)

Or if we don't want Delayed but really on-the-fly execution, it could look like:

model_def = ReturnnCommonModelSerializer("i6_experiments.users.zeyer.model.my_best_model_123")
config = ReturnnConfig(..., python_epilog=[model_def.py_code_direct(), ...])
train_job = ReturnnTrainingJob(config, ...)

This example would not copy the file. In the first case with Delayed, it would still explicitly create and write the net dict to the config. This is run in the create_files task, before the run task. In the second case with py_code_direct, it would directly run the code on-the-fly in the run task.

Again, as said, many details. It depends also on how your work-flow should look like, e.g. during debugging. Currently (before RC) you could just go to the work dir and run rnn.sh manually, after maybe editing some of the files. I think we still want this easy work-flow. But with RC, you likely would not edit the net dict. So having this separate task (or even a separate job) to create the net dict makes this work-flow more complicated. In the second case with py_code_direct, you could directly edit the RC model def code.

@tbscode:

2 ) Then obtain the construction code via inspect.getsource(...)

I don't understand what you need this for. You already constructed the code, so you already have it, so why do you need to obtain it so indirectly again?

@JackTemaki
Copy link
Contributor Author

@tbscode As I currently have other deadlines I postponed working on this, but the returnn_common integration will be available in 3-4 weeks I guess (after some more extensive testing by @Atticus1806 maybe).

I already have partial code / concept on how to extend the above mentioned idea by @albertz to also support pre-training and using tk.Variables or tk.Paths as flexible parameters. Also note that Delayed is not fully supported as input to python_epilog yet, see rwth-i6/i6_core#264

@tbscode
Copy link
Contributor

tbscode commented May 11, 2022

Ok I see. Yes ReturnnCommonModelSerializer also sound like a good solution, I wasn't aware that Delayed existed ( or anything of that sort ).

I don't understand what you need this for. You already constructed the code, so you already have it, so why do you need to obtain it so indirectly again?

It was convenient to use getsource since I could have the whole network definition ( with dimtags and co ) in one function, that I would pass to a test_net_create_config function which would construct the model and output the ReturnnConfig. ( Also I prefer this method from having the code in a string or external file, so I get syntax checking in the same script )

As I currently have other deadlines I postponed working on this, but the returnn_common integration will be available in 3-4 weeks I guess (after some more extensive testing by @Atticus1806 maybe).

I will not use returnn_common in my current experiments also mainly due to time constraints. Surely looking forward to using it in a future setup though.

@albertz
Copy link
Member

albertz commented May 11, 2022

I thought a bit further. Some loose thoughts:

I want to decouple things more, like the extern_data, the model def and the loss, so that I can combine it individually. This is to reduce the number of experiment files. Each of these would be specified via sth like ReturnnCommonModelSerializer , as some module_name: str, which is usually sth like i6_experiments.users.zeyer.exp.... And then maybe additionally some function or class or so, eg. func_name: str, func_kwargs: Dict[str, Any]. The tuple (module_name, func_name, func_kwargs) would define the hash.

More specifically:

I definitely want to decouple the extern_data definition (and its related dim tags) from the rest.
Maybe also the input and target separate.

The model definition should also be decoupled from the extern_data and also from the loss definition (partly, optionally, maybe).

I want to have a common model API for hybrid HMM and CTC like setups, sth like:

class Model(nn.Module):
  def __init__(self, out_dim: nn.Dim, in_dim: nn.Dim, **kwargs):
    ...

  @nn.scoped
  def __call__(self, x: nn.Tensor, *, in_spatial_dim: nn.Dim) -> Tuple[nn.Tensor, nn.Dim]:
    ...
    return y, out_spatial_dim

Or actually maybe just ISeqFramewiseEncoder and ISeqDownsamplingEncoder.

I would expect that logits come out of this, and out_dim would be the number of labels (maybe including blank or not).

Then separately I would have the loss definition. This could look like:

# this is elsewhere:
model = Model(out_dim=output_dim + 1, in_dim=input_dim)  # +1 for blank
inputs = nn.get_extern_data(nn.Data("data", dim_tags=[nn.batch_dim, time_dim, input_dim]))
logits, out_spatial_dim = model(inputs, in_spatial_dim=time_dim)
targets = nn.get_extern_data(nn.Data("classes", dim_tags=[nn.batch_dim, targets_time_dim], sparse_dim=output_dim))

# loss:
loss = nn.ctc_loss(logits=logits, targets=targets)
loss.mark_as_loss()
decoded, decoded_spatial_dim = nn.ctc_greedy_decode(logits, in_spatial_dim=out_spatial_dim)
error = nn.edit_distance(a=decoded, a_spatial_dim=decoded_spatial_dim, b=targets, b_spatial_dim=targets_time_dim)
error.mark_as_loss(as_error=True, custom_inv_norm_factor=nn.length(targets, axis=targets_time_dim))

The model (network) def maybe could be split further, e.g. by having some preprocessing like SpecAugment.

All that would go into the python_epilog. Additionally with some boilerplate code in between, which is not supposed to be hash. And I would explicitly set python_epilog_hash.

Maybe I would introduce specifically:

class NonhashedCode:
  def __init__(self, code: str):
    self.code = code

And then it could look like:

# Define some training exp:
extern_data = ReturnnCommonModelSerializer(...)
model = ReturnnCommonModelSerializer(...)
boilerplate1 = NonhashedCode(...)
loss = ReturnnCommonModelSerializer(...)
boilerplate2 = NonhashedCode(...)
epilog = [extern_data, model, boilerplate1, loss, boilerplate2]

# common function:
def train(epilog, version=1):
  epilog_hash = (version,) + tuple(obj for obj in epilog if not isinstance(NonhashedCode))
  epilog = [obj.py_code_direct() if isinstance(obj, ReturnnCommonModelSerializer) else obj for obj in epilog]
  epilog = [obj.code if isinstance(obj, NonhashedCode) else obj for obj in epilog]
  
  config = ReturnnConfig(
    ...,
    python_epilog=epilog, python_epilog_hash=epilog_hash,
    ...)
  job = ReturnnTrainingJob(config, ...)
  ...

There are still some further details to be clarified, or maybe slightly fixed in the suggestions above.

@JackTemaki
Copy link
Contributor Author

This is not too far away from my current code, just some comments:

The data handling obviously needs to be separated from the model construction, but be aware that calling py_code_direct() is not possible in this case, as this breaks data and e.g. label size dependencies. ReturnnCommonModelSerializer should correctly resolve variables in the task, not outside (this invalidates the fundamental concept of Sisyphus that all connections between jobs are passed as is). The problem is the current Sisyphus still allows you to do that, you will just get broken graphs (unfortunately many old setups rely on that because no one ever checked this). There is DELAYED_CHECK_FOR_WORKER which can be set to enforce correct graphs, but this is unfortunately not fully supported yet.

The manual handling of epilog/epilog_hash is also not needed, this should be done automatically by making ReturnnCommonModelSerializer and NonhashedCode define their own hashes correctly.

So basically the code can be similar to what you wrote, just the three lines after def train are not required. The "version" can (maybe better should so that it is visible in the job) be written into the dict of the ReturnnConfig.

@albertz
Copy link
Member

albertz commented May 12, 2022

The data handling obviously needs to be separated from the model construction, but be aware that calling py_code_direct() is not possible in this case, as this breaks data and e.g. label size dependencies. ReturnnCommonModelSerializer should correctly resolve variables in the task, not outside (this invalidates the fundamental concept of Sisyphus that all connections between jobs are passed as is).

Right. But this would only be for the extern data part.

The manual handling of epilog/epilog_hash is also not needed, this should be done automatically by making ReturnnCommonModelSerializer and NonhashedCode define their own hashes correctly.
So basically the code can be similar to what you wrote, just the three lines after def train are not required.

It depends how you pass it. When you pass it as Delayed, then yes, you could just define it. When you use py_code_direct, then you need a custom hash. In my case, I would not use Delayed as I described above because I want the simple debugging workflow as I explained. I could go with yet another option: Use Delayed but this would not construct the net dict but print out what py_code_direct returns. But this is then two times an indirection, and not sure if this is so straightforward.

The "version" can (maybe better should so that it is visible in the job) be written into the dict of the ReturnnConfig.

This was intended as a convention to be used specifically for the boilerplate code in epilog. Whenever I would change the boilerplate code in this example, which is intentionally not hashed, in some way which potentially could change the behavior, I would increase the version, to get a new hash.

Btw, according to @michelwi in rwth-i6/i6_core#266 (comment), I anyway could not set python_epilog_hash this way because it must be a str?

@JackTemaki
Copy link
Contributor Author

I could go with yet another option: Use Delayed but this would not construct the net dict but print out what py_code_direct returns

This is the only valid solution I think, otherwise this will be a little bit messy, to not treat all "objects" passed to epilog the same. At least I would not understand why some things have to be converted outside to a string an others can be passed as is.

I anyway could not set python_epilog_hash this way because it must be a str?

While this is certainly not the way it was intended this is fine to change.

I will spend the monday with @Atticus1806 to fully build and test a pipeline that covers all your and our needs, which are:

  • returnn_common code is placed as-is in the config to have it directly editable for debugging
  • separate the construction of extern_data for more flexibility and to allow syncing it directly with the code for construction the datasets
  • allow for the correct resolution of Sisyphus variables (tk.Path / tk.Variable)
  • allow to define the pre-training stages via get_network either within the model code or externally via Sisyphus (not sure if the later makes sense for any experiment, but then you could store different pre-training schemes independent of the model definition).
  • Simplistic helper objects that are based on Delayed and passed to prolog/epilog, which can be extended and control both the hashing and the correct placement of code in the final returnn.config (boilerplate, extra specaug code, random stuff).
    These helpers are responsible for adding the surrounding code to correctly construct extern_data and the get_network function and to place all external parameters where they should be (basically what my old construction job does except for the net to dict conversion, just not as standalone job).

Optionally we can look at:

  • placing in the returnn.config code that automatically dumps the full network dict to some file at each execution for debugging purpose (or not at each execution but just during config creation but then directly for all dynamic stages)

So far I do not see any conflict in having a solution that works for all. Well, except maybe for the network hashing part but I think this is something we can live with not having for now, everyone just has to be careful with his/her version parameter or renaming the model file to some ..._v2 name in case everything should re-run.

@albertz please intervene here if something is unclear or wrong in your eyes.

The only thing that is not fully clear is how some of the more independent dataset/datastream are going to be integrated, but this is only about personal structure/style preference and not about any technical limitation.

@albertz
Copy link
Member

albertz commented May 13, 2022

I could go with yet another option: Use Delayed but this would not construct the net dict but print out what py_code_direct returns

This is the only valid solution I think, otherwise this will be a little bit messy, to not treat all "objects" passed to epilog the same. At least I would not understand why some things have to be converted outside to a string an others can be passed as is.

Well, there is at least the difference that some of these are "boilerplate" which should not influence the hash in any way, by intention, and then the others which influence the hash, but only in the explicit defined way, e.g. via an explicit module_name etc.

  • separate the construction of extern_data for more flexibility and to allow syncing it directly with the code for construction the datasets

It's still not totally clear to me how you map this to the model inputs and the loss.

Maybe we would have the standard case of just data and classes and anything more custom would then also need more custom treatment.

  • Simplistic helper objects that are based on Delayed and passed to prolog/epilog, which can be extended and control both the hashing and the correct placement of code in the final returnn.config (boilerplate, extra specaug code, random stuff).

I want the boilerplate behavior to be just the same as if it would not be there. I want to be able to add further such objects into it without any change in the hash. So this means this is different to a boilerplate object where _sis_hash returns None or so.

So far I do not see any conflict in having a solution that works for all. Well, except maybe for the network hashing part but I think this is something we can live with not having for now, everyone just has to be careful with his/her version parameter or renaming the model file to some ..._v2 name in case everything should re-run.

Sure. We should put any such relevant helpers then maybe to i6_experiments/common.

@JackTemaki
Copy link
Contributor Author

JackTemaki commented May 16, 2022

We now have something (partially) working, but I am quite sure that in some details this does not meet exactly what you imagined, so we should continue the discussion. My current get_config looks like this:

def get_config(
        returnn_common_root,
        training_datasets,
        **kwargs):
    """
    :param tk.Path returnn_common_root:
    :param TrainingDatasets training_datasets:
    """

    # changing these does not change the hash
    post_config = {
        [....],
        'debug_print_layer_output_template': True,
    }
    config = {
        [...]
        'optimizer': {'class': 'Adam', 'epsilon': 1e-8},
        'accum_grad_multiple_step': 2,
        'batch_size': 10000,
        'max_seqs': 200,
        [...]
    }

    from i6_experiments.users.rossenbach.returnn.nnet_constructor import ReturnnCommonSerializer,\
        ReturnnCommonExternData, ReturnnCommonDynamicNetwork, NonhashedCode

    network_file = Path("get_transformer_network.py")

    extern_data = [
        datastream.as_nnet_constructor_data(key) for key, datastream in training_datasets.datastreams.items()]

    config["train"] = training_datasets.train.as_returnn_opts()
    config["dev"] = training_datasets.cv.as_returnn_opts()
    config["eval_datasets"] =  {'devtrain': training_datasets.devtrain.as_returnn_opts()}

    recursionlimit = NonhashedCode(code=RECURSION_LIMIT_CODE)
    rc_extern_data = ReturnnCommonExternData(
        extern_data=extern_data
    )
    rc_network = ReturnnCommonDynamicNetwork(
        network_file=network_file,
        data_map={"source_data": "audio_features",
                  "target_data": "bpe_labels"},
        parameter_dict={}
    )
    serializer = ReturnnCommonSerializer(
        delayed_objects=[recursionlimit, rc_extern_data, rc_network],
        returnn_common_root=returnn_common_root,
    
    returnn_config = ReturnnConfig(
        config=config,
        post_config=post_config,
        python_epilog=[serializer],
    )
    return returnn_config

So this is close to what you imagined, except that we now have one Serializer which takes a list of custom objects/code etc... that do the returnn_common serialization for us. This allows user to use quite strict modules like the ones I presented here (which to exactly what I would like to have), but you can also write just any custom code you want and have less things managed by helpers.

This code still uses the data helpers I used before, which we might want to replace. I do like the idea though of having the data helpers more Sisyphus specific, as this is where the data comes from. This of course does not prohibit us from moving the "Dataset" (not Datastream) helpers to returnn_common, as those are definitely Sisyphus independent.

Another thing is that here the extern_data generation does not make use of returnn_common code, because it was just easier now to directly let the helper write the "code" definition instead of first creating "real" nn.Dim + nn.Data objects, and then calling the get_base_extern_data_py_code_str to transform it back into a string.

Maybe we would have the standard case of just data and classes

I strongly oppose this idea. You should always give understandable names to your inputs, and when using custom names you never trigger any custom behavior that was specifically written for those two entries. (Yes, we could remove this custom behavior with a new behavior version, but still it is better to just give good names and not ones which only make sense for basic ASR tasks).

@JackTemaki
Copy link
Contributor Author

JackTemaki commented May 16, 2022

As additional information, the ReturnnCommonDynamicNetwork only fills (so you could just replace it by some template code for now):

        template = textwrap.dedent("""\
        
        network_parameters = ${NETWORK_PARAMETERS}
        
        ${NETWORK_CODE}
        
        def get_network(epoch, **kwargs):
            nn.reset_default_root_name_ctx()
            net = construct_network(epoch, **network_parameters)
            return nn.get_returnn_config().get_net_dict_raw_dict(net)
        
        """

@albertz maybe you have a good idea to get this more flexible, because otherwise things like a custom add_loss(net) call or seomthing would be needed to be added as additional parameters to ReturnnCommonDynamicNetwork...

I also do not like the "conventions", that are not obvious to the user, meaning that you need to have a construct_network function in your model code file. But maybe this is less problematic than I think it is.

@albertz
Copy link
Member

albertz commented May 16, 2022

Where can I see your code? Why is it not pushed?

How is ReturnnCommonDynamicNetwork defined?

What is NETWORK_CODE? Why do you need that?

What is NETWORK_PARAMETERS and parameter_dict? Are these really params or rather kwargs? You should use the standard Python and ML terminology. Parameters are usually model parameters. Use kwargs if you mean kwargs.

I also do not like the "conventions", that are not obvious to the user, meaning that you need to have a construct_network function in your model code file. But maybe this is less problematic than I think it is.

Why would you hardcode that? As I wrote before in my example, next to the module name (module_name = __package__ + ".get_transformer_network" in your example), you would also pass func_name. So func_name = "construct_network" in your example.

Where is the code which does the import {module_name}?
Then later I would expect sth like (as you see in my example above):

f"root_module = {self.module_name}.{self.func_name}({', '.join(f'{k}={v!r}' for k, v in self.func_kwargs.items())})"

How does the data_map work? How is the extern data connected with the model? This is unclear.

Maybe we would have the standard case of just data and classes

I strongly oppose this idea. You should always give understandable names to your inputs, and when using custom names you never trigger any custom behavior that was specifically written for those two entries.

In your example, you also did just the same but now it is called source_data and target_data. I don't see how that is different.

Where and how do you define the loss, and connect the model output and extern data with the loss?

How is NonhashedCode defined?

How is ReturnnCommonSerializer defined?

How is ReturnnCommonExternData defined?

The ReturnnCommonSerializer call misses a ) at the end?

This code still uses the data helpers I used before, which we might want to replace. I do like the idea though of having the data helpers more Sisyphus specific, as this is where the data comes from. This of course does not prohibit us from moving the "Dataset" (not Datastream) helpers to returnn_common, as those are definitely Sisyphus independent.

You probably refer to #55. I still think this should all be in returnn_common, esp also Datastream. I mean specifically all the data structures. Not the pipeline logic. The pipeline logic (which covers where the data comes from) should be here in i6_experiments/common.

Another thing is that here the extern_data generation does not make use of returnn_common code, because it was just easier now to directly let the helper write the "code" definition instead of first creating "real" nn.Dim + nn.Data objects, and then calling the get_base_extern_data_py_code_str to transform it back into a string.

I don't really understand. How do you get the nn.Datas and nn.Dims then, and how do you make the nn.get_extern_data(...) calls?

@JackTemaki
Copy link
Contributor Author

I can not just push it in common without review or making an extra branch, so I first put it under my user. Otherwise my Hiwis can not use it easily.

You can of course copy it there, make your own adjustment and then start a PR for discussion. I just did not have any capacity for that yet.

@albertz
Copy link
Member

albertz commented Jun 9, 2022

I can not just push it in common without review

Why not? I thought we discussed this already? We would put a readme saying "this is work in progress" or so. I think this would be the easiest workflow.

Sure, we could also have it in a separate extra branch. But not sure if there is really any benefit.

@albertz
Copy link
Member

albertz commented Jun 13, 2022

Some initial code was merged as part of PR #66. The code is now in i6_experiments.common.setups.returnn_common, here: https://github.com/rwth-i6/i6_experiments/tree/main/common/setups/returnn_common

I added some summary from the discussion here to the README.

@albertz
Copy link
Member

albertz commented Jun 13, 2022

The dataset stuff is still somewhat an open question. We had some initial discussion in the (unmerged, closed) PR #55. My opinion was that such dataset helpers belong to returnn_common as well, at least the definition of the dataset config dicts for RETURNN. Only recipe pipeline related dataset logic should be here in i6_experiments/common.

@albertz
Copy link
Member

albertz commented Jun 13, 2022

How should we continue with discussions on the helper code? Here in this issue? In separate issues? Just directly (Slack, in person) (but then others cannot read it afterwards or join the discussion)?

@albertz
Copy link
Member

albertz commented Jun 14, 2022

DimInitArgs is problematic. It does not exactly match the Dim interface, specifically is_feature. Why not just use a dict for the kwargs?

@albertz
Copy link
Member

albertz commented Jun 14, 2022

DataInitArgs is problematic in a similar way. Why does it have has_batch_dim? Why do we need it at all?

@albertz

This comment was marked as resolved.

@albertz

This comment was marked as resolved.

@albertz
Copy link
Member

albertz commented Jun 14, 2022

The logic in ReturnnCommonSerializer (SerializerObjectCollection) to extract packages from the SerializerObjects is very problematic, together with make_local_package_copy. This will miss any other imports inside that files, and there is also no real error checking for such cases. I don't think we should have such broken and dangerous code. So this either should be completely rewritten (which will be quite complex and complicated though) or removed.

@albertz

This comment was marked as resolved.

@albertz

This comment was marked as resolved.

@albertz
Copy link
Member

albertz commented Jun 14, 2022

You should never use os.getcwd().

If the dir is somehow relative to __file__, you can use (and that should be in the global scope of the module, not inside a function):

_module_dir = os.path.dirname(os.path.absname(__file__))

Or if this is the Sisyphus base dir, we can use gs.BASE_DIR.

@JackTemaki
Copy link
Contributor Author

DimInitArgs is problematic. It does not exactly match the Dim interface, specifically is_feature. Why not just use a dict for the kwargs?

For a dict it is not well defined what parameters should be defined.

@JackTemaki
Copy link
Contributor Author

You should never use os.getcwd() but __file__ instead, and that should be in the global scope of the module (not inside a function), sth like:

_module_dir = os.path.dirname(os.path.absname(__file__))

But this returns something completely different. I do not see how those two are related.

@JackTemaki

This comment was marked as resolved.

@albertz
Copy link
Member

albertz commented Jun 14, 2022

DimInitArgs is problematic. It does not exactly match the Dim interface, specifically is_feature. Why not just use a dict for the kwargs?

For a dict it is not well defined what parameters should be defined.

Why does it need to be? It's whatever nn.Dim accepts. This is defined by RETURNN. I don't see why we should redefine it here. You could also just directly use nn.Dim instead.

@albertz
Copy link
Member

albertz commented Jun 14, 2022

You should never use os.getcwd() but __file__ instead, and that should be in the global scope of the module (not inside a function), sth like:

_module_dir = os.path.dirname(os.path.absname(__file__))

But this returns something completely different. I do not see how those two are related.

First of all, again, you should never use os.getcwd(). I assumed that you wanted instead to actually get the module dir. I'm not sure which dir you want otherwise. In any case, do not use os.getcwd(). The current dir can be totally arbitrary and is not well defined. The module dir is, so you can always express it relative to that. Or use some other global config information or so.

@albertz
Copy link
Member

albertz commented Jun 14, 2022

For the rest of the comments (besides the naming): I do not understand why you are even mentioning it here?

Where else? I asked you about that, see above. Or you see you don't want to discuss it at all and I should just change it? I don't understand your statement.

This is a Python code violation. Or you don't care about proper Python code?

If you are giving comments like this then I will refrain from pushing any work-in-progress code in the future.

I don't understand?

I just want to know what Python code standards you want to follow here. I'm not used to the code style in i6_core and i6_experiments/common. I see a lot of Python code violations everywhere in there. So I wondered whether you just don't care about that and you want it to be this way or not.

Why do you want that I don't give comments on the code?

Why is that related to pushing work-in-progress code? Why do you want to refrain from pushing WIP code? This does not make sense.

Also it does not make sense to comment on things like this in the issue here.

Again, I asked you about that, where to post these things, and you did not answer, and we already said earlier that we might just discuss it here in the issue, and I did not have a better idea where to post them, so I just posted it here.

Where exactly is the problem now?

Where else should I post them?

Or you just want that I don't give any comments on this and we do not discuss it further?

Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?

You specifically told me you do NOT want to a PR style review. So why are you doing this now?

No, I did not say that. I said I want that we work directly in the master branch because I also directly want to work with the code and not just look at it in a PR, or work in another branch.

I did not say that we should stop discussions on the code. I wonder, how do you want to work on the code together if you don't want that we discuss the code?

@albertz

This comment was marked as resolved.

@JackTemaki
Copy link
Contributor Author

Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?

But then please at least refrain from adding unnecessary statements to code comments that could be counted as personal attack. Nobody writes perfect code right away, and of course you can comment on that, but then do this is in a proper professional way without personal accusations, especially when you know that this was just some arbitrary WIP code that was pushed here.

So lets keep this discussion first about relevant high-level things:

Why does it need to be? It's whatever nn.Dim accepts. This is defined by RETURNN. I don't see why we should redefine it here.

But this exactly what I do not want when working with RETURNN, that I have to look up in another code base or documentation what parameters are valid. The code itself should clearly define what is valid. This was the fundamental problem when working with dict-structures for layer definitions, and this is also something that should not happen here.

You could also just directly use nn.Dim instead.

This would bring back the Tensorflow and RETURNN dependency to the Sisyphus manager thread, which I still strongly disagree with.

Then regarding comments I consider rather off-topic at this moment:

W.r.t. the os.getcwd() issue (which I still consider not really relevant at this point), I searched through Sisyphus, and what we could use is gs.BASE_DIR, which is the path that I want.

In almost all other code, when we have the type str|tk.Path.

In most cases this is just because we can not break with old behavior (importing locations from global_settings mostly), otherwise this would have been removed already. Specifying file locations via str in code run by the Sisyphus manager thread should be considered wrong or at least dangerous in any case.

@albertz
Copy link
Member

albertz commented Jun 14, 2022

Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?

But then please at least refrain from adding unnecessary statements to code comments that could be counted as personal attack.

Where are you reading anything like that? Why do you interpret things like that into what I wrote? I'm totally confused. Are you referring to my question on what code style you want to follow here? So I should not ask about what code style we want to follow?

@albertz
Copy link
Member

albertz commented Jun 14, 2022

Why does it need to be? It's whatever nn.Dim accepts. This is defined by RETURNN. I don't see why we should redefine it here.

But this exactly what I do not want when working with RETURNN, that I have to look up in another code base or documentation what parameters are valid. The code itself should clearly define what is valid. This was the fundamental problem when working with dict-structures for layer definitions, and this is also something that should not happen here.

My suggestion was actually to just use the Data and Dim from RETURNN directly. I don't really see the point in replicating another wrapper around it. returnn-common also does the same.

Note that for layers, returnn-common obviously creates a wrapper around them but this is for other technical reasons, and I'm also not super happy with it. However, to really avoid the problem that the wrapper and the layer interface diverge (this is one of the big problems with wrappers, when the wrapper even in the beginning does not fully match the real interface), we create them automatically. Although this might be a bit overkill of course now for Data and Dim.

This would bring back the Tensorflow and RETURNN dependency to the Sisyphus manager thread, which I still strongly disagree with.

I still see this as very arbitrary. We also have e.g. Numpy and other libs. And why not? Your only argument here is that the file system is slow and TF import is slow?

Such technical administrative issues should really not be a reason to make the code more complicated and introduce unnecessary wrappers and layers of abstraction.

In any case, this is why I bring this up here. Such things need to be discussed. Some opinion from others would also be helpful.

@JackTemaki
Copy link
Contributor Author

Where are you reading anything like that? Why do you interpret things like that into what I wrote? I'm totally confused. Are you referring to my question on what code style you want to follow here? So I should not ask about what code style we want to follow?

Lets talk about this offline, it is definitely not about a question regarding code style (or in case you wanted to ask only that, then you did not formulate it that way).

In any case, this is why I bring this up here. Such things need to be discussed. Some opinion from others would also be helpful.

But about this I do not argue, this is exactly what we can and should discuss.

I still see this as very arbitrary. We also have e.g. Numpy and other libs. And why not? Your only argument here is that the file system is slow and TF import is slow?

And I do not see this as arbitrary. Yes, some modules unfortunately import numpy on the global level, but in most cases it is only used in tasks. But here you could argue that most people kind of consider numpy as elemental part of python.

Again, this is not (only) about the TF import. This is about having a fixed TF and RETURNN dependency in my Sisyphus manager in the first place. The manager should not have that. We can rather argue to remove numpy from the global imports than to add Tensorflow and RETURNN. (E.g. you can run all jobs with a specified singularity/apptainer image, which would specifically define which numpy to use for job tasks, independent of what happens in the manager).

Also, the wrapper has the advantage that you are not showing many parameters that are completely irrelevant. If you expose nn.Dim and nn.Data, you have many options that are not relevant at all and are more confusing than helpful to the user.

The additional code here is really minimal, it is not about re-implementing anything. It is really just a dataclass object holding some parameters, thats it.

With your argument, you could also say we should import modules directly from RASR with C/C++ to Python bindings instead of re-implementing some interface helpers (like bliss format implementations or RASR-flow definitions).

@albertz
Copy link
Member

albertz commented Dec 2, 2022

@JackTemaki
Copy link
Contributor Author

Regarding the main points this Issue became obsolete now, as we are no longer passing constructed networks dicts for training. For both RETURNN-frontend and RETURNN-pure-PyTorch setups, we are linking code directly from separate files into the RETURNN configs using e.g. https://github.com/rwth-i6/i6_experiments/blob/main/common/setups/serialization.py

As importing those separate files containing model code can be avoided if necessary, also importing PyTorch/Tensorflow in the manager is no longer an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants