Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of big Numpy arrays #150

Open
albertz opened this issue May 6, 2022 · 8 comments
Open

Serialization of big Numpy arrays #150

albertz opened this issue May 6, 2022 · 8 comments
Milestone

Comments

@albertz
Copy link
Member

albertz commented May 6, 2022

There are cases when some bigger Numpy array is part of your net dict, e.g. when you have some custom init for some parameter, e.g. like in the case of GammatoneV2.

When some bigger Numpy array is part of the net dict, it is currently serialized as is, via __repr__. This makes the produced net dict very difficult to read, when 99% of it is just the Numpy array.

So, should we do sth about it?

What are possible things we could do? Here some ideas:

We could at least move the definition to the top, similar as we do it for dim tags. Then the net dict itself stays readable. But still 99% of the resulting RETURNN config would be just the Numpy array.

We could move them outside, either as Numpy txt files and do numpy.loadtxt, or as Python files and import them. However, this means that any config serialization logic now needs extra logic to handle these cases. Although we are probably only writing this once anyway and then not care anymore about it.

Such external file handling of the serialization could also be done in a generic way, and maybe it becomes useful for other purpose as well.

@albertz
Copy link
Member Author

albertz commented May 6, 2022

@JackTemaki @Atticus1806 opinions?

@Atticus1806
Copy link
Contributor

I would really prefer external handling. The config is already usually a lot longer than "old" configs due to explicitly setting everything, but it is still readable. I feel like dumping big arrays into it would probably make it unreadable or at least very annoying (slows text editor etc.).

@albertz
Copy link
Member Author

albertz commented May 9, 2022

Yes, me too. But then the next question is, how exactly?

I mean, probably numpy.loadtxt should be fine.

Should the path be relative? Relative to what?

Where should those files be stored?

How should the API look like? get_returnn_config would get some extra param like extra_out_dir? Is there any reasonable default? Probably not...

@JackTemaki
Copy link
Contributor

For Sisyphus usage the extra_out_dir would be fine, because then we can even place it with an absolute path if wanted. So in the end this will probably be a extra_data (or somewhat similar named) directory next to the config file.

@JackTemaki
Copy link
Contributor

I would prefer if it works relative thought, because then you can move both the config and the extra dirs around

@albertz
Copy link
Member Author

albertz commented May 9, 2022

Ok, extra_out_dir.

Where do we expect the config to be? So how should we generate relative paths to extra_out_dir? Should this be configurable? config_extra_out_dir_prefix or whatever?

Should there be a reasonable default for extra_out_dir? Maybe allow None in which case this is not used? I think for many simple test cases, this might make it simpler. But for Sisyphus usage or any setup pipeline, you would set this.

@JackTemaki
Copy link
Contributor

Should there be a reasonable default for extra_out_dir? Maybe allow None in which case this is not used? I think for many simple test cases, this might make it simpler. But for Sisyphus usage or any setup pipeline, you would set this.

Yes why not this way. With Sisyphus we always know where the file should be, and for the tests it can be within the config.

@albertz
Copy link
Member Author

albertz commented May 9, 2022

So, where do we expect the config to be? So how should we generate relative paths to extra_out_dir? Should this be configurable? config_extra_out_dir_prefix or whatever?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants