-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to store protein groups and gene groups in feature and psms files. #51
Comments
I think we have two options plus a combination of both:
I guess if we only do one, I would go for the second option as it is easier. I am sure we can find a representation that can accommodate the output of all programs. |
Related issue: vdemichev/DiaNN#22 |
I'm also in favor of jpfeuffer's idea of directly exporting the reported protein group. Also it may be necessary to indicate which tool is reporting on it? Because the grouping algorithm may be different for different tools. |
As long a the protein groups, which are in the vast majority of cases (>95%) mapping to isotopes from a single gene, are denoting that the peptide can be found in these proteins, I don't have a strong preference. If you want a comparable format, the format should not depend on the software (option 1). I agree that option 2 is easier to implement, but then it should be clear for a feature file which program was used to generate the file (+ version). |
Currently, the DIANN, Spectronout and other DIA-NN tools at the level of the features release only a protein group by feature/PSM in the same way discussed here vdemichev/DiaNN#22. Each peptide/feature was reported and all the protein accessions were these feature maps. The problem comes with MQ, FragPipe and other tools that not only support the protein group but report two things:
I prefer the option in the feature and PSM to export all proteins where the feature map is, but I understand that this removes completely the inference of the tool which is also not desirable. |
@jpfeuffer what is your take on Yasset's comment. I see the proteins schema file and wonder if inference information could be preserved there? Would that make sense? |
How about just another file with the protein groups. And keep the "list all proteins a peptide maps to" approach for the other files. Or yes, as Timo hinted, add it to the protein file: add another column in the protein table (indicating the group it belongs to), and if it is the master protein of that group or not. If you need information about the group, such as a name, you might want to have a separate file, though. |
You will have to think about which queries you want to be fast. Looking up groups etc. can becoming tricky or slow. |
One thing that I recently noticed. In theory, if we do not recalculate things ourselves, we might need to add the relevant settings of the software (and version) that was used to the file as well. E.g. I think DIANN has multiple settings related to protein grouping. Maybe not a problem if people use the whole bundle of outputs from quantms (incl. software/settings report). But maybe it is an issue if someone looks at the pqt files separately. |
General remarks
What is clearBack to the inference problem:
Current structure based on columns:
What is pending is three columns:
Struct for inferenceThe other option is to leave in columns only what is the very basic information, like all the protein ids that the peptide map and the gene names and then model in a structure the protein group in a struct. That struct will be rarely query and it can contains the start and end positions and also the scores of the proteins, anchor protein etc. That struct can be empty for those datasets with no inference of filled for those structures with protein inference. We can have something like:
|
I think it sounds fair to keep track of the inference result in case this is given. To create a Feature file from MaxQuant this will mean that you need to parse both the evidence.txtGene Names: gene group for a peptide maps to
|
For a while, we have been avoiding
Protein Group
modelling in the psm and feature in the format. @jpfeuffer triggered this issue a long time ago. Our main tools DIA-NN, OpenMS TMT, and OpenMS LFQ pipelines handle the Protein groups in different ways.In addition, we are using the feature file as
input
->to ibaqpy
with MaxQuant. Im proposing now to handle this as ProteinGroup and GeneGroup containing as s list of all the proteins and genes where the peptide get mapped https://github.com/bigbio/quantms.io/blob/dev/docs/README.adoc#111-common-peptide-fieldsWould be good for you through your input here. Im also understanding how DIA-NN handle protein inference. In the previous release of quantms, we were taking the Protein.Ids but would like to have this documented. Ideas @jpfeuffer @timosachsenberg @zprobot
The text was updated successfully, but these errors were encountered: