Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Apologies for the extensive note. This is kinda noodly stuff so I wanted to pull together the results of my requirements gathering / process of understanding the problem into some form of context. If you feel like you get this already you can skip to "Assumptions that need review" for the real ask here.
Context
FASTQ file validation previously compared all FASTQ files in a given directory against each other if the filenames matched a certain regex. That regex looked for the presence of a lane number and read type (see
Filenames
below for explanation). This seems to have worked fine because directories seem to have contained 2* files that needed to be compared against each other.Recently, a case came up where files had trailing set numbers (001, 002, etc.)**. All files within a given set should be compared against each other, but not against other sets. Our plugin was not setup to accommodate this, hence this change.
* In the case of read_type
R
, since the regex looked for onlyR1
andR2
, anyR3
files that may have been present would have been ignored / not compared to the other two. I believe SNARE-seq can have R3 files (the dir schema references this as being part of ATACseq sequencing).** These set numbers actually just denote that a file has been split into n chunks. So you want to compare chunk 001 to chunk 001, etc.
Filenames
The filenames we are considering should look something like this:
arbitrary_prefix_L001_R1_001.fastq
arbitrary_prefix = anything preceding the lane number
L001 = lane number, identified as
L
followed by digits followed by an underscoreR1 = read type (can be
R
,I
, or potentiallyread
--see below) followed by a digit followed by an underscore001 = set number
.fastq = extension, can also be .fq / .fastq.gz but we leave this validation up to CMU's
fastq_utils
moduleComparison example
Given the following files:
we want to compare line numbers between those with matching prefixes and set numbers.
So we want those in set 001 compared to each other:
And those in set 002 compared to each other:
Assumptions that need review
Changes from previous functionality, based on assumptions that probably need co-signing:
read
in addition toR
/I
as read types (i.e. read1, read2, read3). I mimicked this.arbitrary_prefix_L001_R1.fastq
would not be captured for comparison, even if there was another matching file (e.g.arbitrary_prefix_L001_R2.fastq
) in the same directory.Other changes that do not conflict with previous functionality, based on other assumptions:
arbitrary_prefix_L001_R1_001_blahblahblah.fastq
would get compared against any other files starting with the samearbitrary_prefix_L001_R#_001
pattern....files 1 & 2 should have matching line counts, but file 3 has nothing to be compared to and therefore will be passed over. Not sure if this should be allowed to happen but currently it is logged and permitted.