Architecting signal and fast5 data organization #86

thequicksort · 2021-02-08T19:14:39Z

Signals! Fast5!

kdoroschak · 2021-02-08T20:35:06Z

poretitioner/fast5s.py

+            )
+        self.validate(self.f5, logger)
+
+    def validate(self, bulk_filepath: str, log: Logger = getLogger()):


Responding to Katie Q in L109:

We shouldn't need too much validation. I think the main things we should focus on are (1) file existence and (2) distinguishing factors between "other" fast5 files and bulk fast5 files.

Suggestions for validation:

File exists

At least one hdf5 group following the pattern /Raw/Channel_[\d]+ exists

Maybe: No groups following the pattern /read_.*

kdoroschak · 2021-02-08T20:37:46Z

poretitioner/fast5s.py

+            raise ValueError(error_message)
+
+        try:
+            # TODO: Katie Q: Why the [2:-1] at the end?


Writing down what we talked about in our standup:

str casting the "run_id" attribute leaves the byte string characters, e.g. b"<run_id_here>". char.decode() would be preferred but was never implemented.

kdoroschak · 2021-02-08T20:44:23Z

poretitioner/fast5s.py

+        self, capture_f5_filepath: str, raw_signal: RawSignal, metadata: CaptureMetadata
+    ):
+        """Write a single capture to the specified capture fast5 file (which has
+        already been created via create_capture_fast5()).


Re: this note in the docstring that was carried over,create_capture_fast5() was not being called (bug) until my last PR, so we should include that function here now. Would that go in init or should it be called separately? Documentation should be tweaked to match

kdoroschak · 2021-02-08T20:52:33Z

poretitioner/fast5s.py

+        )
+        return cap
+
+    def write_capture(


I believe this is handled in the segmenter logic, but want to verify: We set a maximum number of captures per capture fast5, is that still being honored?

In your opinion, does it make sense for segment.py to continue handling the cap on captures per file? I had not considered what managing that cap might look like with this kind of architecture. Right now, the segmenter code is just keeping track of how many reads have been written to a file, and opening a new file if it's full. That approach might be a bottleneck for parallelization.

Side observation: I've noticed that when performing basecalling for DNA sequencing, there are often lots of files that contain anywhere from a few reads to max_reads, maybe that is due to their file handling for parallelization?

Really great question. The uncommited code I wrote still has the caps-per-file being honored, it's just handled in the Segmenter. My initial inclination, and please let me know if I'm mistaken, is that the Fast5 module is responsible for reading/writing/handling things inherent to the file process, and the capture file caps is more a decision of the segmenter, rather than an inherent aspect of the file format.

Side observation: I've noticed that when performing basecalling for DNA sequencing, there are often lots of files that contain anywhere from a few reads to max_reads, maybe that is due to their file handling for parallelization?

Oooh, that's an awesome observation! Perhaps we could consider a similar optimization? I've created a Github issue for us to brainstorm our potential optimizations, would you mind adding more details there? #85

kdoroschak · 2021-02-08T20:52:57Z

poretitioner/fast5s.py

+        """
+        super().__init__(capture_filepath, logger=logger)
+        if not self.filepath.exists():
+            error_msg = f"Capture fast5 file does not exist at path: {self.filepath}. Make sure the bulk file is in this location."


possible typo with "bulk file" here

Oof good catch! Thank you, let me fix that.

kdoroschak · 2021-02-08T20:54:11Z

poretitioner/fast5s.py

+    return channel_path
+
+
+def signal_path_for_read_id(read_id: str) -> str:


love these methods for backwards compatibility!

Thank you! You're the author of the template (writing the channel_for_read_id), so thank you for setting a good example :)

kdoroschak · 2021-02-08T20:56:09Z

tests/test_fast5.py

+
+
+@patch("h5py.File")
+def bulk_fast5_expands_homedir_test(Mockf5File):


I think I need Mockf5File in my life, can we talk about this more on Wednesday?

Sure! Or any time before then :), whenever is best for you!

kdoroschak · 2021-02-08T20:56:30Z

tests/test_fast5.py

+    signal_path_for_read_id,
+)
+
+test_bulk_fas5_filepath = "tests/data/bulk_fast5_dummy.fast5"


typo in test_bulk_fas5_filepath --> test_bulk_fast5_filepath

kdoroschak · 2021-02-08T20:59:15Z

poretitioner/fast5s.py

+    def __exit__(self):
+        self.f5.__exit__()
+
+    def get_channel_calibration_for_path(self, path: str) -> ChannelCalibration:


!!! Did not know this could be characterized as calibration, that is so much nicer than whatever I had been calling it! Awesome.

Thank you! And it's totally not your fault for characterizing it differently-- ONT only released the documentation explaining what they were recently. Up until then, they were just magic numbers ☺️

kdoroschak · 2021-02-08T21:02:11Z

tests/test_signals.py

+class TestRawSignal:
+    def signal_converts_to_picoamperes_test(self):
+        signal = [1, 2, 3]
+        raw = RawSignal(signal, CHANNEL_NUMBER, CALIBRATION)


Is it expected that the channel number will always be a constant? Or is that just for the purposes of testing in this context?

Great point! Thanks for pointing this out. So my original thought/understanding of the actual device was that each channel represents a nanopore and can only produce one signal (i.e. there's a bijection between channels and signals).

However, the latest docs from ONT seem to indicate that I'm wrong (or at least it's changed since the MinknowCore 3.X releases)! According to their documentation, 'Each channel can contain one or more wells, where each well "normally" contains 1 nanopore' (I feel like the word "normally" is doing a lot of work here, what exactly defines "normal" in this context? lol)

// A well is a discrete location on the device where sensing can take place. Normally, each well // should have a single nanopore in it.

So, my question for you, if it's possible that each channel contains multiple wells, is this still an acceptable abstraction (i.e. conflating "signal" with the channel where it was generated?)

thequicksort · 2021-02-10T21:58:55Z

Data structures for Signals and File I/O.

Created classes for dealing with signals: raw from the device, converted to picoamperes, and fractionalized.
Created classes for reading/writing Fast5 files.
Updated build script to only build Docker images for non-Darwin systems
Updated pytest to calculate code coverage

Jessica Dunstan added 2 commits February 8, 2021 11:13

Architecting signal and fast5 data organization

32726b0

Architecting signal and fast5 data organization

1070cde

kdoroschak reviewed Feb 8, 2021

View reviewed changes

Jessica Dunstan added 6 commits February 8, 2021 14:51

Updated documentation and fixed typos

f76c64a

Added back core utilities

3291c54

Added more unit tests around signals and fast5 files.

c0f95af

Corrected 'bulkfast5' typo

bbfc70b

Formatting nix files

911d6b7

Only build Docker images for non-Darwin systems

7cf0a38

thequicksort merged commit 2d392b4 into master Feb 10, 2021

thequicksort deleted the signals2 branch February 10, 2021 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecting signal and fast5 data organization #86

Architecting signal and fast5 data organization #86

thequicksort commented Feb 8, 2021

kdoroschak Feb 8, 2021

kdoroschak Feb 8, 2021

kdoroschak Feb 8, 2021

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021 •

edited

Loading

kdoroschak Feb 8, 2021

thequicksort Feb 8, 2021 •

edited

Loading

thequicksort commented Feb 10, 2021

		return channel_path


		def signal_path_for_read_id(read_id: str) -> str:



		@patch("h5py.File")
		def bulk_fast5_expands_homedir_test(Mockf5File):

Architecting signal and fast5 data organization #86

Architecting signal and fast5 data organization #86

Conversation

thequicksort commented Feb 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thequicksort Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thequicksort Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

thequicksort commented Feb 10, 2021

thequicksort Feb 8, 2021 •

edited

Loading

thequicksort Feb 8, 2021 •

edited

Loading