-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architecting signal and fast5 data organization #86
Conversation
) | ||
self.validate(self.f5, logger) | ||
|
||
def validate(self, bulk_filepath: str, log: Logger = getLogger()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to Katie Q in L109:
We shouldn't need too much validation. I think the main things we should focus on are (1) file existence and (2) distinguishing factors between "other" fast5 files and bulk fast5 files.
Suggestions for validation:
- File exists
- At least one hdf5 group following the pattern /Raw/Channel_[\d]+ exists
- Maybe: No groups following the pattern /read_.*
raise ValueError(error_message) | ||
|
||
try: | ||
# TODO: Katie Q: Why the [2:-1] at the end? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing down what we talked about in our standup:
str casting the "run_id" attribute leaves the byte string characters, e.g. b"<run_id_here>"
. char.decode() would be preferred but was never implemented.
self, capture_f5_filepath: str, raw_signal: RawSignal, metadata: CaptureMetadata | ||
): | ||
"""Write a single capture to the specified capture fast5 file (which has | ||
already been created via create_capture_fast5()). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: this note in the docstring that was carried over,create_capture_fast5()
was not being called (bug) until my last PR, so we should include that function here now. Would that go in init or should it be called separately? Documentation should be tweaked to match
) | ||
return cap | ||
|
||
def write_capture( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is handled in the segmenter logic, but want to verify: We set a maximum number of captures per capture fast5, is that still being honored?
In your opinion, does it make sense for segment.py to continue handling the cap on captures per file? I had not considered what managing that cap might look like with this kind of architecture. Right now, the segmenter code is just keeping track of how many reads have been written to a file, and opening a new file if it's full. That approach might be a bottleneck for parallelization.
Side observation: I've noticed that when performing basecalling for DNA sequencing, there are often lots of files that contain anywhere from a few reads to max_reads, maybe that is due to their file handling for parallelization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great question. The uncommited code I wrote still has the caps-per-file being honored, it's just handled in the Segmenter. My initial inclination, and please let me know if I'm mistaken, is that the Fast5 module is responsible for reading/writing/handling things inherent to the file process, and the capture file caps is more a decision of the segmenter, rather than an inherent aspect of the file format.
Side observation: I've noticed that when performing basecalling for DNA sequencing, there are often lots of files that contain anywhere from a few reads to max_reads, maybe that is due to their file handling for parallelization?
Oooh, that's an awesome observation! Perhaps we could consider a similar optimization? I've created a Github issue for us to brainstorm our potential optimizations, would you mind adding more details there? #85
poretitioner/fast5s.py
Outdated
""" | ||
super().__init__(capture_filepath, logger=logger) | ||
if not self.filepath.exists(): | ||
error_msg = f"Capture fast5 file does not exist at path: {self.filepath}. Make sure the bulk file is in this location." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible typo with "bulk file" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof good catch! Thank you, let me fix that.
return channel_path | ||
|
||
|
||
def signal_path_for_read_id(read_id: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love these methods for backwards compatibility!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! You're the author of the template (writing the channel_for_read_id), so thank you for setting a good example :)
tests/test_fast5.py
Outdated
|
||
|
||
@patch("h5py.File") | ||
def bulk_fast5_expands_homedir_test(Mockf5File): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need Mockf5File
in my life, can we talk about this more on Wednesday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Or any time before then :), whenever is best for you!
tests/test_fast5.py
Outdated
signal_path_for_read_id, | ||
) | ||
|
||
test_bulk_fas5_filepath = "tests/data/bulk_fast5_dummy.fast5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in test_bulk_fas5_filepath
--> test_bulk_fast5_filepath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
def __exit__(self): | ||
self.f5.__exit__() | ||
|
||
def get_channel_calibration_for_path(self, path: str) -> ChannelCalibration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!!! Did not know this could be characterized as calibration, that is so much nicer than whatever I had been calling it! Awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! And it's totally not your fault for characterizing it differently-- ONT only released the documentation explaining what they were recently. Up until then, they were just magic numbers
class TestRawSignal: | ||
def signal_converts_to_picoamperes_test(self): | ||
signal = [1, 2, 3] | ||
raw = RawSignal(signal, CHANNEL_NUMBER, CALIBRATION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that the channel number will always be a constant? Or is that just for the purposes of testing in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! Thanks for pointing this out. So my original thought/understanding of the actual device was that each channel represents a nanopore and can only produce one signal (i.e. there's a bijection between channels and signals).
However, the latest docs from ONT seem to indicate that I'm wrong (or at least it's changed since the MinknowCore 3.X releases)! According to their documentation, 'Each channel can contain one or more wells, where each well "normally" contains 1 nanopore' (I feel like the word "normally" is doing a lot of work here, what exactly defines "normal" in this context? lol)
// A well is a discrete location on the device where sensing can take place. Normally, each well
// should have a single nanopore in it.
So, my question for you, if it's possible that each channel contains multiple wells, is this still an acceptable abstraction (i.e. conflating "signal" with the channel where it was generated?)
Data structures for Signals and File I/O.
|
Signals! Fast5!