Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairtools phase - issue with phasing #253

Open
ddepierre opened this issue Oct 11, 2024 · 0 comments
Open

Pairtools phase - issue with phasing #253

ddepierre opened this issue Oct 11, 2024 · 0 comments

Comments

@ddepierre
Copy link

Hi Open2C team,

I'd like to use pairtools phase and I have some issues running it. I'd like to use it in a first time to reanalyse snHiC data from Collombet et al.,2020.
following this https://pairtools.readthedocs.io/en/latest/examples/pairtools_phase_walkthrough.html

1. Test data set:
Here are my final phase stats (with the small test, i.e. fastq downloaded with --minSpotId 0 --maxSpotId 1000000):


# cat phased.XA.dedup.stats | head -8
total	1429506
total_unmapped	203628
total_single_sided_mapped	278449
total_mapped	947429
total_dups	60249
total_nodups	887180
cis	152576
trans	734604

# cat phased.XA.phase0.stats | head -8
total	3704
total_unmapped	0
total_single_sided_mapped	0
total_mapped	3704
total_dups	257
total_nodups	3447
cis	3304
trans	143
# cat phased.XA.phase1.stats | head -8
total	3102
total_unmapped	0
total_single_sided_mapped	0
total_mapped	3102
total_dups	279
total_nodups	2823
cis	2389
trans	434
# cat phased.XA.unphased.stats | head -8
total	1203742
total_unmapped	0
total_single_sided_mapped	264503
total_mapped	939239
total_dups	59712
total_nodups	879527
cis	125709
trans	753818
# cat phased.XA.trans-phase.stats | head -8
total	1384
total_unmapped	0
total_single_sided_mapped	0
total_mapped	1384
total_dups	1
total_nodups	1383
cis	985
trans	398

I don't get why I don't have the same stats than in your tutorial, but the ratio of cis in phase0 / cis total is about the same range, I have here 2.5% (3304/(125709+985+2389+3304)) of cis in [hase0, in your example you have 1% (535/(52294+177+394+535)).

When I run the same example with the full fastq files (stll with SRR8811373), here are my final stats:


# cat phased.XA.dedup.stats | head -35
total	6923823
total_unmapped	953932
total_single_sided_mapped	1401139
total_mapped	4568752
total_dups	816275
total_nodups	3752477
cis	3045355
trans	707122


# cat phased.XA.phase0.stats | head -20
total	17866
total_unmapped	0
total_single_sided_mapped	0
total_mapped	17866
total_dups	4639
total_nodups	13227
cis	12685
trans	542
# cat phased.XA.phase1.stats | head -20
total	14771
total_unmapped	0
total_single_sided_mapped	0
total_mapped	14771
total_dups	4201
total_nodups	10570
cis	9202
trans	1368
# cat phased.XA.trans-phase.stats | head -20
total	6848
total_unmapped	0
total_single_sided_mapped	0
total_mapped	6848
total_dups	27
total_nodups	6821
cis	4867
trans	1954
# cat phased.XA.unphased.stats | head -20
total	5857878
total_unmapped	0
total_single_sided_mapped	1328611
total_mapped	4529267
total_dups	807408
total_nodups	3721859
cis	2529823
trans	1192036


Here the ratio of cis in phase0 / cis total is 0.4% (12685/(2529823+4867+12685+9202))

Do you know why the ratio of contact mapped on phase0 haplotype drops when I use the full dataset? do you have the same? I was expecting this ratio to be the same independently of using a full dataset or a sample.

2. Comparison with processed data:

So I compared with the number or ratio of pairs that they report in the original dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129029)


head GSM3691125_2CSE_70.allValidPairs.txt
NB501547:21:HCVWNAFXX:4:11612:3393:4075	chr1	3000757	+	chr3	14672639	-	242	HIC_chr1_2	HIC_chr3_31231	34	12	0-0
NB501547:21:HCVWNAFXX:3:11402:2702:9569	chr1	3003299	-	chr1	3266239	+	119	HIC_chr1_7	HIC_chr1_786	42	42	2-0
NB501547:21:HCVWNAFXX:3:11612:25249:9112	chr1	3004023	+	chr1	10608406	-	261	HIC_chr1_13	HIC_chr1_20557	38	42	0-2
NB501547:21:HCVWNAFXX:3:21510:8573:6102	chr1	3004278	-	chr4	20056554	-	182	HIC_chr1_15	HIC_chr4_43236	42	42	0-0
NB501547:21:HCVWNAFXX:2:21109:18612:1489	chr1	3004451	-	chr1	3008304	+	175	HIC_chr1_16	HIC_chr1_24	42	42	0-0
NB501547:21:HCVWNAFXX:1:21307:17607:14306	chr1	3005602	-	chr1	4002173	-	149	HIC_chr1_18	HIC_chr1_2746	42	42	0-0
NB501547:21:HCVWNAFXX:3:11507:26752:1374	chr1	3011527	-	chr1	9745573	+	148	HIC_chr1_35	HIC_chr1_18503	42	42	0-0
NB501547:21:HCVWNAFXX:4:11602:20611:17311	chr1	3013889	-	chr3	14672639	-	399	HIC_chr1_40	HIC_chr3_31231	42	30	1-0

# total number of pairs
wc -l GSM3691125_2CSE_70.allValidPairs.txt
1209391 GSM3691125_2CSE_70.allValidPairs.txt

# number of phased pairs (with 1 being BL6, 2 CAST, 0 unassigned)

grep '0-0' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
456762
grep '1-1' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
82362
grep '2-2' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
97820
grep '1-2' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
2773
grep '2-1' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
2356
grep '0-1' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
126764
grep '1-0' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
127739
grep '2-0' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
156166
grep '0-2' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
155665


Then here the ratio of 1-1 over total (= phase0 in pairtools phase) is 6% (82362/1209391)
I guess numbers of aligned reads, detected pairs and cis-contact will vary depending on quality threshold applied and algorithm used, but I don't get why it is so different between HiCPro output and pairtools phase on the full fastq file.

Additional questions:

Can I combine pairtools phase with pairtools dedub by restriction fragment?
Can I combine pairtools phase with filterByCov?
Something like
pairtools parse -> pairtools phase -> pairtools sort -> pairtools dedup amplification duplicates -> pairtools restrict -> pairtools dedup by restriction fragment -> pairtools filterbycov -> pairtools select by phase

Thanks you for your help!

David

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty

pairtools, version 1.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant