Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 'chunk' option for getData #292

Open
RJDan opened this issue Jul 14, 2023 · 2 comments
Open

add 'chunk' option for getData #292

RJDan opened this issue Jul 14, 2023 · 2 comments
Assignees
Labels

Comments

@RJDan
Copy link

RJDan commented Jul 14, 2023

Hello
I have had many issues trying to do custom analyses with the output of getData when the data sets are large.

I get a segmentation fault that does not necessarily go away with more RAM. For instance my LSF job scheduler reports that I only use 20 GB of ram when I get the Seg Fault error but I requested a total of 40 GB, so there is another 20 GB available and unused.

I want to request that getData function gets an option to read data in chunks, as the *DB functions currently have.

I know that I could change my code to read in the file in chunks (i.e. don't use getData) or split my files into chunks outside of R, but I think this would be super useful for many people who don't know how to do the more manual manipulation of files.

@alexg9010
Copy link
Collaborator

Hi @RJDan,

Thank you for your bringing this back to our attention.

We already had a discussion about reading data in chunks, but back then we switched to reading the whole file because it was faster ( see #141 ). But now, with even more samples and larger tabix files, this seems less stable and possibly less reasonable.

Maybe, chunkwise reading will lower the memory pressure and prevent segmentation faults, but the whole purpose of getData is to read the whole dataset into memory, so it is questionable whether reading data in chunks will solve memory issues in the long run. Probably the best we can do here is to approximate possible memory consumption and warn/stop if this exceeds available memory.

I guess what you are looking for is some kind of streaming function to read a new chunk of data on every invocation, doing some calculations and repeating until the whole file is processed. We actually have this kind of functionality available internally to process tabix files in chunks, per chromosome or only for specific ranges. These functions are not exported and only available via the methylKit::: notations, but maybe they should be exposed and/or promoted more to allow for more customized analysis.

We are already using this functionality for memory efficient extraction of the percent methylation matrix or to perform coverage calculations prior to bigwig export.

@RJDan
Copy link
Author

RJDan commented Aug 23, 2023

Thanks for the reply.

I would suggest making a more prominent statement on the website and manual about what does and does not work with 'chunks' for those who don't read things properly (such as myself).

Perhaps you could also make mention that if we encounter SegFault issues the first thing to check is that we are not using some function that reads everything into memory. I had a line in my script that created a non-DB object as an intermediate step and it took me a while to figure out what was causing the segfault.
Maybe you could print to screen whether a function is reading by chunks or the whole file when the function is called (to make it easier to spot the issue).

Thanks for the pointers about the non-exposed functions.

@RJDan RJDan closed this as completed Aug 23, 2023
@alexg9010 alexg9010 self-assigned this Aug 23, 2023
@alexg9010 alexg9010 reopened this Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants