Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False negatives - some sort of interaction with gzip / bgzip #26

Open
rmcolq opened this issue Aug 6, 2024 · 2 comments
Open

False negatives - some sort of interaction with gzip / bgzip #26

rmcolq opened this issue Aug 6, 2024 · 2 comments

Comments

@rmcolq
Copy link

rmcolq commented Aug 6, 2024

Strange bug, and might be more of a user error

I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.

If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.

All works as expected with the second file and it finds all the kmers for each of the reference sequences.

broken.fa.gz
works.fa.gz

minimal script:

#!/usr/bin/env python

import cobs_index as cobs
from Bio import SeqIO
import gzip

p = cobs.CompactIndexParameters()
p.term_size = 31               # k-mer size
p.clobber = True               # overwrite output and temporary files
p.false_positive_rate = 0.4    # higher false positive rate -> smaller index

broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")

with gzip.open(broken, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")

with gzip.open(works, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)
@shenwei356
Copy link

shenwei356 commented Aug 6, 2024

I tried to index with cobs 0.3.0 with these two files. It funny to see that the numbers of sequences are different.

Works (10 sequences)

> cobs compact-construct works.fa.gz  works.cobs_compact

--- document list (1 entries) ---
FastaFile: loading index works.fa.gz.cobs_cache [10 subsequences]
document[0] size 25796 31-mers 92788 : works.fa.gz : works
--- end of document list (1 entries) ---
documents: 1
minimum 31-mers: 92788
maximum 31-mers: 92788
average 31-mers: 92788
total 31-mers: 92788

Broken (7 sequences).

$ cobs compact-construct broken.fa.gz  broken.cobs_compact

--- document list (1 entries) ---
FastaFile: loading index broken.fa.gz.cobs_cache [7 subsequences]
document[0] size 26790 31-mers 63522 : broken.fa.gz : broken
--- end of document list (1 entries) ---
documents: 1
minimum 31-mers: 63522
maximum 31-mers: 63522
average 31-mers: 63522
total 31-mers: 63522

But the contents of the two files are identical

zcat works.fa.gz | md5sum 
ef23c6df0195e5b4610c22aa30e5b70d  -
$ zcat broken.fa.gz | md5sum 
ef23c6df0195e5b4610c22aa30e5b70d  -

I guess there's something wrong with the zlib or the sequence parser of COBS (kseq from Heng).

@rmcolq
Copy link
Author

rmcolq commented Aug 6, 2024

Maybe bgzip defines the first block part way through the 7th sequence. But seems like the sort of error which shouldn't be silent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants