Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTQ plugin fix #63

Merged
merged 12 commits into from
Oct 25, 2024
Merged

FASTQ plugin fix #63

merged 12 commits into from
Oct 25, 2024

Conversation

gesinaphillips
Copy link
Contributor

@gesinaphillips gesinaphillips commented Oct 16, 2024

Apologies for the extensive note. This is kinda noodly stuff so I wanted to pull together the results of my requirements gathering / process of understanding the problem into some form of context. If you feel like you get this already you can skip to "Assumptions that need review" for the real ask here.

Context

FASTQ file validation previously compared all FASTQ files in a given directory against each other if the filenames matched a certain regex. That regex looked for the presence of a lane number and read type (see Filenames below for explanation). This seems to have worked fine because directories seem to have contained 2* files that needed to be compared against each other.

Recently, a case came up where files had trailing set numbers (001, 002, etc.)**. All files within a given set should be compared against each other, but not against other sets. Our plugin was not setup to accommodate this, hence this change.

* In the case of read_type R, since the regex looked for only R1 and R2, any R3 files that may have been present would have been ignored / not compared to the other two. I believe SNARE-seq can have R3 files (the dir schema references this as being part of ATACseq sequencing).
** These set numbers actually just denote that a file has been split into n chunks. So you want to compare chunk 001 to chunk 001, etc.

Filenames

The filenames we are considering should look something like this:

arbitrary_prefix_L001_R1_001.fastq

arbitrary_prefix = anything preceding the lane number
L001 = lane number, identified as L followed by digits followed by an underscore
R1 = read type (can be R, I, or potentially read--see below) followed by a digit followed by an underscore
001 = set number
.fastq = extension, can also be .fq / .fastq.gz but we leave this validation up to CMU's fastq_utils module

Comparison example

Given the following files:

20147_Healthy_PA_S1_L001_R1_001.fastq.gz
20147_Healthy_PA_S1_L001_R1_002.fastq.gz
20147_Healthy_PA_S1_L001_R2_001.fastq.gz
20147_Healthy_PA_S1_L001_R2_002.fastq.gz

we want to compare line numbers between those with matching prefixes and set numbers.

So we want those in set 001 compared to each other:

20147_Healthy_PA_S1_L001_R1_001.fastq.gz
20147_Healthy_PA_S1_L001_R2_001.fastq.gz

And those in set 002 compared to each other:

20147_Healthy_PA_S1_L001_R1_002.fastq.gz
20147_Healthy_PA_S1_L001_R2_002.fastq.gz

Assumptions that need review

Changes from previous functionality, based on assumptions that probably need co-signing:

  • fastq_utils allows for read in addition to R/I as read types (i.e. read1, read2, read3). I mimicked this.
  • Set number suffix treated as non-optional. This is a change from previous functionality which ignored everything after the read number.
    • This means that a file named arbitrary_prefix_L001_R1.fastq would not be captured for comparison, even if there was another matching file (e.g. arbitrary_prefix_L001_R2.fastq) in the same directory.

Other changes that do not conflict with previous functionality, based on other assumptions:

  • Allows arbitrary text between set_num and fastq extension.
    • Meaning that arbitrary_prefix_L001_R1_001_blahblahblah.fastq would get compared against any other files starting with the same arbitrary_prefix_L001_R#_001 pattern.
  • Allows files that match prefix/read/set pattern but do not have a pair to compare against. In the case where you have the following files:
  1. 20147_Healthy_PA_S1_L001_R1_001.fastq.gz
  2. 20147_Healthy_PA_S1_L001_R2_001.fastq.gz
  3. 20147_Healthy_PA_S1_L001_R1_002.fastq.gz

...files 1 & 2 should have matching line counts, but file 3 has nothing to be compared to and therefore will be passed over. Not sure if this should be allowed to happen but currently it is logged and permitted.

@gesinaphillips gesinaphillips marked this pull request as ready for review October 24, 2024 20:52
@jpuerto-psc jpuerto-psc merged commit 73d7aff into devel Oct 25, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants