-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MaxQuant scores #82
Comments
@zprobot I already added the score to PSI-MS: id:
|
I have collected them in additional_scores.
|
@zprobot we should discuss the naming of the scores. One idea I have is that we have an additional parquet/csv called metadata.csv or matadata.parquet where we map all the keywords you are using to ontology terms. For example andromeda_score: Andromeda:delta score and also provide the accession in PSI-MS. What do you think? |
Agreed. We can have a mapping table to display these. |
Can you model it, the use case will be, for scores, column names etc, where an acronym is used for example: It could be called: |
Why do we use acronyms instead of the full name? |
Ah you mean the ontology mapping. But the mapping is defined in the ontology, why would we want to replicate it? |
Yes, For example. We use the following score acronyms right now:
etc. Would be nice if we have a mapping table somewhere where the actual PSI term corresponding to that acronym is annotated like:
This could help to understand each column etc. The idea is that we have to use acronyms because is difficult in some cases to store the original term from PSI or other ontologies because they have spaces and special characters, it is better to have an acronym. |
But can't the ontology have synonyms? I feel like this kind of mapping should not be our task. |
Or we say that the name needs to match the ontology_name in snake_case. Only the unnecessary long name of PEP would be a problem here |
Agreed. But if Ithe terms do not exist now, then I suggest having this table as optional to enable easy search, at least in our toolbox. |
If this is an interim solution, I feel like we can just do without it. It is pretty clear what the score names mean. |
This is why I think it should be optional. These acronyms could be a bigger list, BTW. We use acronyms in scores, table column names, and additional information from the original search engines. |
I still don't like it. Everything that we make optional is an additional if-case for everyone using that format. An additional check to see if that file is just missing or was forgotten. |
This is exactly my point
A lot of terms are not ready for data handling. For example, |
But then, why do you need this table now? Document it for now and hard-code the CV term in a potential validator. |
Because I don't want to hardcoded everything in the validator. |
Then you could have the mapping file in your validator, but I would like to avoid a wild west format where people can map score names arbitrarily to some ontology. In the end you will have one dataset where PEP means posterior error probability, and in the other percent endogenous peptide or whatever. |
Ok your idea is that the format itself, meaning the validator release an internal file for the mapping? |
Yes because the mapping should be the same for every dataset out there. |
Then, this file |
Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible. |
I will try to trigger the conversation, but It may take a while 😉. @zprobot the idea is to keep the mapping table within the format library. |
We can use a unified format to represent the scores given by search engines. like |
Two things:
|
Is the issue with e.g. using >>>conn.sql("SELECT * FROM test;").show()
┌─────────────────┬─────────────┐
│ Andromeda:score │ scan number │
│ float │ int32 │
├─────────────────┼─────────────┤
│ 10.0 │ 1 │
│ 24.0 │ 2 │
│ -2.0 │ 3 │
└─────────────────┴─────────────┘
>>> conn.sql("""SELECT "Andromeda:score" FROM test;""").show()
┌─────────────────┐
│ Andromeda:score │
│ float │
├─────────────────┤
│ 10.0 │
│ 24.0 │
│ -2.0 │
└─────────────────┘ I agree with the argument that adding an alias table that introduces a combinatorial expansion of possible names for common columns is a big footgun. The use of CURIEs is maximally stable, but minimally readable. The use of CURIE-backed names is a good compromise between readability and stability. If the CURIE-backed name isn't convenient, synonyms in the controlled vocabulary centralize the aliases, albeit if every term is heavily aliased we've not anyone any favors. |
I was thinking of a more |
Backing up a step, aren't scores encoded as pairs? {
"type": "array",
"items": {
"type": "struct",
"fields": [
{"name": "score_name", "type": "string"},
{"name": "score_value", "type": "float32"}
],
}
} or did this change while I wasn't paying attention? |
This is the way is implemented:
The point is that that name could be |
I guess for additional_scores you can just add another field "CV term" and then use any "name" you like. But I thought you are also worried about other columns? |
But in that case the score names are strings, all requiring quoting, and where "special characters" do not matter unless you are typing them out by hand repeatedly for an ad hoc query. Assuming a QWERTY layout, for I suppose the goal here is to produce a file format that is suited directly to the |
We can provide a file like this. It is used to describe the information of all fields currently in use. |
@zprobot :
I have been looking at some MaxQuant examples for ms/ms. MaxQuant has the following scores:
Additionally, the delta score needs to be added to the PSI MS to be able to add in the ms/ms: HUPO-PSI/psi-ms-CV#356
The text was updated successfully, but these errors were encountered: