Skip to content

Fastest implementation of BED file sorting [genomics]

Notifications You must be signed in to change notification settings

balwierz/pioSortBed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Fastest implementation of BED file sorting [genomics]

It replaces the UNIX sort -k1,1 -k2,2n file.bed (in LC_ALL=C) and the bedtools sort commands.

Input files: BED3, BED6+n etc

It uses (what I believe is called) bucket sorting. So effectively does not do any coordinate comparisons, just the indexing. In this way it is not anymore O(n*log(n)) problem, but O(n+m) problem, where n is the numer of reads and m it the maximum length of a chromosome/contig. So for large datasets it runs in linear time of the number of reads. But it sucks on really small files

pioSortBed needs to store all data in the memory. Roughly twice as much memory needed than the size of the BED file. It is possible to add some swapping in the future.

There are some compilation-time limits on the lenghts of lines [1024], chromosome name lengths [256] and chromosome length limits [1Gbp]. You can change these and recompile.

It can do some trivial operations like collapsing regions if they are multiple lines regions with the same coordinates. It does not aim at replacing bedops, bedtools, GenomicRanges etc

It is probably compatible with Unicode characters in read names :-) Uses Boost for hast tables and command line opions.

Piotr Balwierz Imperial College London

About

Fastest implementation of BED file sorting [genomics]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published