Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecting signal and fast5 data organization #86

Merged
merged 8 commits into from
Feb 10, 2021
Merged

Conversation

thequicksort
Copy link
Contributor

Signals! Fast5!

)
self.validate(self.f5, logger)

def validate(self, bulk_filepath: str, log: Logger = getLogger()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to Katie Q in L109:

We shouldn't need too much validation. I think the main things we should focus on are (1) file existence and (2) distinguishing factors between "other" fast5 files and bulk fast5 files.

Suggestions for validation:

  • File exists
  • At least one hdf5 group following the pattern /Raw/Channel_[\d]+ exists
  • Maybe: No groups following the pattern /read_.*

raise ValueError(error_message)

try:
# TODO: Katie Q: Why the [2:-1] at the end?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing down what we talked about in our standup:

str casting the "run_id" attribute leaves the byte string characters, e.g. b"<run_id_here>". char.decode() would be preferred but was never implemented.

self, capture_f5_filepath: str, raw_signal: RawSignal, metadata: CaptureMetadata
):
"""Write a single capture to the specified capture fast5 file (which has
already been created via create_capture_fast5()).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: this note in the docstring that was carried over,create_capture_fast5() was not being called (bug) until my last PR, so we should include that function here now. Would that go in init or should it be called separately? Documentation should be tweaked to match

)
return cap

def write_capture(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is handled in the segmenter logic, but want to verify: We set a maximum number of captures per capture fast5, is that still being honored?

In your opinion, does it make sense for segment.py to continue handling the cap on captures per file? I had not considered what managing that cap might look like with this kind of architecture. Right now, the segmenter code is just keeping track of how many reads have been written to a file, and opening a new file if it's full. That approach might be a bottleneck for parallelization.

Side observation: I've noticed that when performing basecalling for DNA sequencing, there are often lots of files that contain anywhere from a few reads to max_reads, maybe that is due to their file handling for parallelization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great question. The uncommited code I wrote still has the caps-per-file being honored, it's just handled in the Segmenter. My initial inclination, and please let me know if I'm mistaken, is that the Fast5 module is responsible for reading/writing/handling things inherent to the file process, and the capture file caps is more a decision of the segmenter, rather than an inherent aspect of the file format.

Side observation: I've noticed that when performing basecalling for DNA sequencing, there are often lots of files that contain anywhere from a few reads to max_reads, maybe that is due to their file handling for parallelization?

Oooh, that's an awesome observation! Perhaps we could consider a similar optimization? I've created a Github issue for us to brainstorm our potential optimizations, would you mind adding more details there? #85

"""
super().__init__(capture_filepath, logger=logger)
if not self.filepath.exists():
error_msg = f"Capture fast5 file does not exist at path: {self.filepath}. Make sure the bulk file is in this location."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible typo with "bulk file" here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof good catch! Thank you, let me fix that.

return channel_path


def signal_path_for_read_id(read_id: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love these methods for backwards compatibility!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! You're the author of the template (writing the channel_for_read_id), so thank you for setting a good example :)



@patch("h5py.File")
def bulk_fast5_expands_homedir_test(Mockf5File):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need Mockf5File in my life, can we talk about this more on Wednesday?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Or any time before then :), whenever is best for you!

signal_path_for_read_id,
)

test_bulk_fas5_filepath = "tests/data/bulk_fast5_dummy.fast5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in test_bulk_fas5_filepath --> test_bulk_fast5_filepath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

def __exit__(self):
self.f5.__exit__()

def get_channel_calibration_for_path(self, path: str) -> ChannelCalibration:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!! Did not know this could be characterized as calibration, that is so much nicer than whatever I had been calling it! Awesome.

Copy link
Contributor Author

@thequicksort thequicksort Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! And it's totally not your fault for characterizing it differently-- ONT only released the documentation explaining what they were recently. Up until then, they were just magic numbers ☺️

class TestRawSignal:
def signal_converts_to_picoamperes_test(self):
signal = [1, 2, 3]
raw = RawSignal(signal, CHANNEL_NUMBER, CALIBRATION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that the channel number will always be a constant? Or is that just for the purposes of testing in this context?

Copy link
Contributor Author

@thequicksort thequicksort Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! Thanks for pointing this out. So my original thought/understanding of the actual device was that each channel represents a nanopore and can only produce one signal (i.e. there's a bijection between channels and signals).

However, the latest docs from ONT seem to indicate that I'm wrong (or at least it's changed since the MinknowCore 3.X releases)! According to their documentation, 'Each channel can contain one or more wells, where each well "normally" contains 1 nanopore' (I feel like the word "normally" is doing a lot of work here, what exactly defines "normal" in this context? lol)

    // A well is a discrete location on the device where sensing can take place. Normally, each well
    // should have a single nanopore in it.

So, my question for you, if it's possible that each channel contains multiple wells, is this still an acceptable abstraction (i.e. conflating "signal" with the channel where it was generated?)

@thequicksort
Copy link
Contributor Author

Data structures for Signals and File I/O.

  • Created classes for dealing with signals: raw from the device, converted to picoamperes, and fractionalized.
  • Created classes for reading/writing Fast5 files.
  • Updated build script to only build Docker images for non-Darwin systems
  • Updated pytest to calculate code coverage

@thequicksort thequicksort merged commit 2d392b4 into master Feb 10, 2021
@thequicksort thequicksort deleted the signals2 branch February 10, 2021 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants