-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create R9.4 E. coli profile #13
Comments
Hi Michael, great, thank you so much for digging into this. The easiest way would be to put it here: https://github.com/karel-brinda/NanoSim-H/tree/master/nanosimh/profiles and create a pull request. Do you think I should make it the "default default" profile? :) |
I would like to make sure it isn't producing rubbish first. Let me simulate some reads from something and plot it. I'll put it up here and see what you think. |
tl;dr There doesn't seem to be any problems with the model. TrainingOk, so I trained the model on E. coli data basecalled with Guppy (v3.1.5) high accuracy model that was sequenced for the ultra-long read protocol from Loman et al.. The reference I used in the training was * container="docker://quay.io/biocontainers/nanosim-h:1.1.0.4--pyr351h24bf2e0_1"
reads="loman_k12.fa"
reference="NC_000913.3.fa"
profile_dir="profile/"
singularity exec "$container" nanosim-h-train \
--infile "$reads" \
"$reference" \
"$profile_dir" This took ~4 days and there was an error at the end saying that nanosim-h couldn't find the R CMD BATCH '--args prefix="profile/"' "model_fitting.R" TestingTo test whether the output is crazy I simulated reads from the TB reference NC_000962.3 ref="NC_000962.3.fa"
container="docker://quay.io/biocontainers/nanosim-h:1.1.0.4--pyr351h24bf2e0_1"
profile="nanosim_training/ecoli/profile"
singularity exec "$container" nanosim-h \
--circular \
"$ref" \
--profile "$profile" The read length distribution is Happy for someone to correct me but this doesn't seem too unusual. Especially given the dataset I simulated this sample from. And if I map the reads to the reference they were simulated from minimap2 -ax map-ont NC_000962.3.fa simulated.fa | samtools sort -o output.bam the percent identity looks like In case there are questions about how the percent identity it calculated I will put the code below def sam_percent_identity(filename: str):
"""Opens a SAM/BAM file and extracts the read percent identity for all
mapped reads that are not supplementary or secondary alignments.
Args:
filename: Path to SAM/BAM file.
Returns:
A list of the percent identity for all valid reads.
"""
# get pysam read option depending on whether file is sam or bam
file_ext = os.path.splitext(filename)[-1]
read_opt = "rb" if file_ext == ".bam" else "r"
# open file
samfile = pysam.AlignmentFile(filename, read_opt)
perc_identities = []
for record in samfile:
# make sure read is mapped, and is not a suppl. or secondary alignment
if record.is_unmapped or record.is_supplementary or record.is_secondary:
continue
pid = get_percent_identity(record)
if pid:
perc_identities.append(pid)
samfile.close()
return perc_identities
def get_percent_identity(read: pysam.AlignedSegment):
"""Calculates the percent identity of a read based on the NM tag if present
, if not calculate from MD tag and CIGAR string.
Args:
read: A read within a sam file (pysam class).
Returns:
The percent identity or None if required fields are not present.
"""
try:
return 100 * (1 - read.get_tag("NM") / read.query_alignment_length)
except KeyError:
try:
return 100 * (
1
- (_parse_md_flag(read.get_tag("MD")) + _parse_cigar(read.cigartuples))
/ read.query_alignment_length
)
except KeyError:
return None
except ZeroDivisionError:
return None
perc_identities = sam_percent_identity("output.bam") |
Given R9.4 is much newer than R9 and not many people really work with R9 now it probably makes sense to make it the default. |
Hi Michael, thank you very much for the update and all the associated work! It looks like a very good improvement. I can't fully review the PR now (from time reasons), but I will do that as soon I can (probably July) and then I'll also release a new version. Thanks. |
Hi Michael, thank you for the absolutely amazing work you did!! My only question is the following – do you think the data that the model was trained on, are "typical"? In my experience, data produced by leading experts on nanopore sequencing are often of much higher quality than what a typical user gets. I can think of two reasons:
Given your experience, does the simulation well correspond to what you get in your lab (from your comment it seems that yes)? Btw. I'm thinking how to incorporate your scripts and text in the repository – it will be very useful for others as well. Thanks again for all the work! |
Hi Karel, I mean obviously, Nic and Josh are leading experts. I think at the time this data was sequenced I would say it was of much higher quality than most people would have been producing. But from being at London Calling recently it seems to me like most people use this protocol now. So this data, whilst not "typical" at the time, would now probably be considered "typical" (or close to). Happy for someone to correct me though as I am dealing a lot with M. tuberculosis data and the data from that organism is much different (read length-wise) than E. coli. If I am being a harsh judge of my own work I would maybe say it would be nice to have done this with a variety of different samples, but sadly it is difficult to get nanopore data with known truth. Within the next year though I should have a treasure trove of TB data with known truth and I would obviously be very happy to update/add a new model with that. Very happy for you to include the scripts etc. |
Hi @karel-brinda ,
I have just finished training a new profile with some E. coli R9.4 data that was basecalled with the latest Guppy high accuracy algorithm (v3.1.5).
Is there a way of "testing" this? Also, what would be the best way for me to share this with you so you can add it to the default profiles?
The text was updated successfully, but these errors were encountered: