Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter stats option. #172

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Filter stats option. #172

wants to merge 5 commits into from

Conversation

agalitsyna
Copy link
Member

When binning coolers, the user might provide a set of filters to include certain types of pairs into final matrices. However, for stats, there was no such option. In this PR, we propose two additional distiller processes that will allow collecting and merging the stats for the same set of filters.

Disclaimer: this is an option that does not affect the rest of distiller. The addition of two filter+stats processes is controlled by params.stats.use_filters==True, which is set to False by default.
The addon is backward-compatible with previous versions of params files.

@Phlya
Copy link
Member

Phlya commented Jan 28, 2022

Ran this on one of my real projects. Here are the lines of two files for the same library until the by-distance cis counts, old file and mapq_30 filtered new.

total	740052127
total_unmapped	43208972
total_single_sided_mapped	99919934
total_mapped	596923221
total_dups	52495969
total_nodups	544427252
cis	419698314
trans	124728938
pair_types/MM	22105219
pair_types/WW	18023654
pair_types/NM	1943706
pair_types/RU	487212
pair_types/MR	2706565
pair_types/NU	17540428
pair_types/NN	1136393
pair_types/DD	52495969
pair_types/UU	543450198
pair_types/MU	68737651
pair_types/NR	10935290
pair_types/UR	489842
total	559048104
total_unmapped	12601339
total_single_sided_mapped	0
total_mapped	546446765
total_dups	48264298
total_nodups	498182467
cis	387357306
trans	110825161
pair_types/DD	48264298
pair_types/WW	12601339
pair_types/RU	438586
pair_types/UU	497302503
pair_types/UR	441378

Some strange differences: different total number and different total_unmapped. 0 total_single_sided_mapped in the new file! Very strange. And I guess related, a lot of pair types are missing... I guess is makes sense with the mapq_30 filter if you think about it, but it's a little confusing that even the total number is affected.

@agalitsyna
Copy link
Member Author

@Phlya Can you provide no_filter stats and full stats file, please?

@Phlya
Copy link
Member

Phlya commented Jan 28, 2022

@agalitsyna
Copy link
Member Author

I suggest checking G1_DMSO_rep2.no_filter.hg38.stats.txt against G1_DMSO_rep2.hg38.stats.txt, not G1_DMSO_rep2.mapq_30.hg38.stats.txt

@Phlya
Copy link
Member

Phlya commented Jan 28, 2022

Ah I didn't have the no_filter in this pipeline, actually! Running it

@agalitsyna
Copy link
Member Author

agalitsyna commented Jan 28, 2022

by the way, the total number of reads is inevitably reduced with mapq30 filter, because pairtools select '(mapq1>=30) and (mapq2>=30)' will remove the pairs that have at least one segment with lower mapq than the threshold. I guess that's the filter that you run.

@Phlya
Copy link
Member

Phlya commented Jan 28, 2022

Yes. Technically it makes sense, but actually that's not what you'd expect... Or at least not what I would expect.

@agalitsyna
Copy link
Member Author

Can you think of the script or pairtools command that will do what you expect to have in filtered stats then?
Maybe marking all the filtered read instead of filtering them out? I'm not sure whether there are existing tools for that, though.

@Phlya
Copy link
Member

Phlya commented Jan 28, 2022

I guess just for the stats purposes we can modify the filter to include all unmapped or ss mapped pairs too? I think then everything above the total_mapped will be the same as without the filter, and that is the most confusing part for me.

Marking would be the best though. Or stats could be modified to apply a filter on the fly and accumulate, in this case, stats at different mapq filtering levels? But all this requires modifying pairtools.

@Phlya
Copy link
Member

Phlya commented Jan 29, 2022

So the cluster was super busy, but finally here is an example of no_filter vs mapq_30
G1_DMSO_2.hg38.mapq_30.stats.txt
G1_DMSO_2.hg38.no_filter.stats.txt

@agalitsyna
Copy link
Member Author

Hi, @Phlya Thanks
I think the only important thing to check is convergence of no_filter stats against the default output of distiller stats. Then it means that the pairs merging filters run correctly and as expected by design of pipeline.
I don't think comparison of mapq_30 filter against no_filter gives us any additional information despite that mapq_30 filter indeed filters out some contacts, which is good, but not enough to test the pipeline update

@Phlya
Copy link
Member

Phlya commented Jan 29, 2022

No filter and no_filter are identical

@agalitsyna
Copy link
Member Author

Thanks, cool!
Interesting counterintuitive thing from your amazing stats: mapq_30 filter results in non-zero "total_unmapped". How come the read can have mapq>30 if it is not mapped?
mapq is bwa mem output, while the pair type is an output of pairs parser. Parser in some regimes cannot parse complex walks (reporting WW types instead), which are counted as "unmapped" by pairtools stats

So, with filter stats one should interpret only the pair types that are mapped and actually ended up as meaningful contact pairs in coolers. Any other thoughts?

@Phlya
Copy link
Member

Phlya commented Jan 29, 2022

For individual libraries and library groups the file names are inconsistent: one has assembly name before filter, the other filter before assembly.

@Phlya
Copy link
Member

Phlya commented Jan 31, 2022

I didn't realize WW counted as unmapped, btw! That's a little misleading actually, if one wanted to compare different pipelines, for example... Perhaps any pair types that are not rescued should be counted in another category, "unresolved" or something like that. That would also make the output of this new stats a bit more sensible. Not suggesting it as part of this PR of course, just a thought...

@agalitsyna
Copy link
Member Author

This can be removed now, @Phlya ?

@agalitsyna
Copy link
Member Author

I think all this branch can be deleted now as we modified the pairtools stats to accept filters. @Phlya @golobor , is it okay with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants