Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaxQuant scores #82

Open
ypriverol opened this issue Nov 10, 2024 · 32 comments
Open

MaxQuant scores #82

ypriverol opened this issue Nov 10, 2024 · 32 comments
Assignees

Comments

@ypriverol
Copy link
Member

@zprobot :

I have been looking at some MaxQuant examples for ms/ms. MaxQuant has the following scores:

Additionally, the delta score needs to be added to the PSI MS to be able to add in the ms/ms: HUPO-PSI/psi-ms-CV#356

@ypriverol
Copy link
Member Author

@zprobot I already added the score to PSI-MS: id:

MS:1003433
name: Andromeda:delta score

@zprobot
Copy link
Collaborator

zprobot commented Nov 11, 2024

I have collected them in additional_scores.

- Score -> andromeda_score
- Delta score -> delta_score

@ypriverol
Copy link
Member Author

@zprobot we should discuss the naming of the scores. One idea I have is that we have an additional parquet/csv called metadata.csv or matadata.parquet where we map all the keywords you are using to ontology terms. For example andromeda_score: Andromeda:delta score and also provide the accession in PSI-MS.

What do you think?

@zprobot
Copy link
Collaborator

zprobot commented Nov 11, 2024

Agreed. We can have a mapping table to display these.

@ypriverol
Copy link
Member Author

ypriverol commented Nov 11, 2024

Can you model it, the use case will be, for scores, column names etc, where an acronym is used for example:
posterior_error_probability we can find the correct cvterm for each in that table. @jpfeuffer what do you think?

It could be called: psi-ms-terms.parquet

@jpfeuffer
Copy link
Contributor

Why do we use acronyms instead of the full name?

@jpfeuffer
Copy link
Contributor

Ah you mean the ontology mapping. But the mapping is defined in the ontology, why would we want to replicate it?
Just use the full/display name of the ontology entry.

@ypriverol
Copy link
Member Author

Yes, For example. We use the following score acronyms right now:

posterior_error_probability
andromeda_score
msgf_rawsocre

etc.

Would be nice if we have a mapping table somewhere where the actual PSI term corresponding to that acronym is annotated like:

term ontology_name ontology_accession
posterior_error_probability posterior error probability from identification based on multiple spectra MS:1003336
andromeda_score Andromeda:score MS:1002338
msgf_rawscore MS-GF:RawScore MS:1002049

This could help to understand each column etc. The idea is that we have to use acronyms because is difficult in some cases to store the original term from PSI or other ontologies because they have spaces and special characters, it is better to have an acronym.

@jpfeuffer
Copy link
Contributor

But can't the ontology have synonyms? I feel like this kind of mapping should not be our task.

@jpfeuffer
Copy link
Contributor

Or we say that the name needs to match the ontology_name in snake_case. Only the unnecessary long name of PEP would be a problem here

@ypriverol
Copy link
Member Author

Agreed. But if Ithe terms do not exist now, then I suggest having this table as optional to enable easy search, at least in our toolbox.

@jpfeuffer
Copy link
Contributor

If this is an interim solution, I feel like we can just do without it. It is pretty clear what the score names mean.
I really want to avoid having yet another table.

@ypriverol
Copy link
Member Author

ypriverol commented Nov 11, 2024

This is why I think it should be optional. These acronyms could be a bigger list, BTW. We use acronyms in scores, table column names, and additional information from the original search engines.

@jpfeuffer
Copy link
Contributor

I still don't like it. Everything that we make optional is an additional if-case for everyone using that format. An additional check to see if that file is just missing or was forgotten.
It also allows people to circumvent ontologies and starting their own naming schemes etc

@ypriverol
Copy link
Member Author

This is exactly my point

It also allows people to circumvent ontologies and start their own naming schemes etc

A lot of terms are not ready for data handling. For example, percolator:PEP is difficult if you want to skip special characters like :, and it could be worse sometimes. This is why I have started to use acronyms which enable querying, sorting in duckDB by scores, etc.

@jpfeuffer
Copy link
Contributor

But then, why do you need this table now? Document it for now and hard-code the CV term in a potential validator.
Once the synonym is available in the ontology, you can switch the validator from a hard-coded dict to an actual ontology lookup

@ypriverol
Copy link
Member Author

Because I don't want to hardcoded everything in the validator.

@jpfeuffer
Copy link
Contributor

Then you could have the mapping file in your validator, but I would like to avoid a wild west format where people can map score names arbitrarily to some ontology. In the end you will have one dataset where PEP means posterior error probability, and in the other percent endogenous peptide or whatever.

@ypriverol
Copy link
Member Author

Ok your idea is that the format itself, meaning the validator release an internal file for the mapping?

@jpfeuffer
Copy link
Contributor

Yes because the mapping should be the same for every dataset out there.

@ypriverol
Copy link
Member Author

Then, this file psi-ms-terms.parquet could be an internal file maintained by us?

@jpfeuffer
Copy link
Contributor

Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.

@ypriverol
Copy link
Member Author

Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.

I will try to trigger the conversation, but It may take a while 😉. @zprobot the idea is to keep the mapping table within the format library.

@zprobot
Copy link
Collaborator

zprobot commented Nov 11, 2024

We can use a unified format to represent the scores given by search engines. like {software}_score.
I think the mapping table is just for display purposes, used to view the available optional fields.

@ypriverol
Copy link
Member Author

Two things:

  • Yes @zprobot for the acronyms, you can use that style {software}_score.
  • Yes, the mapping is mainly for displaying to help users to know what the score is. As @jpfeuffer said, users may understand the score by the acronym, but the table is to make sure that users know using an ontology which score we are referring to.

@mobiusklein
Copy link
Contributor

Is the issue with : and space-containing column names that they are impossible or not ergonomic? Most SQL engines support column names that aren't "proper identifiers" enclosed in double quotes. This applies to DuckDB, as well as tested with pyarrow/pyarrow.parquet and datafusion.

e.g. using duckdb from Python with a test table mocked up for convience:

>>>conn.sql("SELECT * FROM test;").show()
┌─────────────────┬─────────────┐
│ Andromeda:scorescan number │
│      floatint32    │
├─────────────────┼─────────────┤
│            10.01 │
│            24.02 │
│            -2.03 │
└─────────────────┴─────────────┘

>>> conn.sql("""SELECT "Andromeda:score" FROM test;""").show()
┌─────────────────┐
│ Andromeda:score │
│      float      │
├─────────────────┤
│            10.0 │
│            24.0 │
│            -2.0 │
└─────────────────┘

I agree with the argument that adding an alias table that introduces a combinatorial expansion of possible names for common columns is a big footgun.

The use of CURIEs is maximally stable, but minimally readable. The use of CURIE-backed names is a good compromise between readability and stability. If the CURIE-backed name isn't convenient, synonyms in the controlled vocabulary centralize the aliases, albeit if every term is heavily aliased we've not anyone any favors.

@ypriverol
Copy link
Member Author

I was thinking of a more ergonomic meaning the users don't need to deal with such many skipping characters.

@mobiusklein
Copy link
Contributor

Backing up a step, aren't scores encoded as pairs?

{
  "type": "array",
  "items": {
    "type": "struct",
    "fields": [
	    {"name": "score_name", "type": "string"},
	    {"name": "score_value", "type": "float32"}
    ],
  }
}

or did this change while I wasn't paying attention?

@ypriverol
Copy link
Member Author

ypriverol commented Nov 11, 2024

This is the way is implemented:

{"name": "additional_scores", 
   "type": {"type": "array",
            "items": { "type": 
                "struct", "field": { 
                      "name": "string", 
                      "value": "float32"
                 }  
             }
}

The point is that that name could be Andromeda:score or Andromeda:delta score which is not nice to filter, group etc.

@jpfeuffer
Copy link
Contributor

I guess for additional_scores you can just add another field "CV term" and then use any "name" you like.

But I thought you are also worried about other columns?

@mobiusklein
Copy link
Contributor

mobiusklein commented Nov 11, 2024

But in that case the score names are strings, all requiring quoting, and where "special characters" do not matter unless you are typing them out by hand repeatedly for an ad hoc query.

Assuming a QWERTY layout, for Andromeda:score vs andromeda_score, you press only one extra key to write the CV name instead of the snake_case'd name due to the shift-key for the uppercase "A", the ":" and "_" both cost a shift. For Andromeda:delta score vs andromeda_delta_score you actually break even because you convert the space into a "_" which costs an extra shift, balancing the cost of the extra capitalization.

I suppose the goal here is to produce a file format that is suited directly to the quantms pipeline's output though, in which case adding a new search engine is a breaking change in any case, so updating an alias table is par for the course.

@zprobot
Copy link
Collaborator

zprobot commented Nov 13, 2024

We can provide a file like this. It is used to describe the information of all fields currently in use.
Fields

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants