MaxQuant scores #82

ypriverol · 2024-11-10T08:16:38Z

I have been looking at some MaxQuant examples for ms/ms. MaxQuant has the following scores:

score -> Andromeda:score

Additionally, the delta score needs to be added to the PSI MS to be able to add in the ms/ms: HUPO-PSI/psi-ms-CV#356

ypriverol · 2024-11-11T06:32:13Z

@zprobot I already added the score to PSI-MS: id:

MS:1003433
name: Andromeda:delta score

zprobot · 2024-11-11T06:52:42Z

I have collected them in additional_scores.

- Score -> andromeda_score
- Delta score -> delta_score

ypriverol · 2024-11-11T06:56:47Z

@zprobot we should discuss the naming of the scores. One idea I have is that we have an additional parquet/csv called metadata.csv or matadata.parquet where we map all the keywords you are using to ontology terms. For example andromeda_score: Andromeda:delta score and also provide the accession in PSI-MS.

What do you think?

zprobot · 2024-11-11T07:03:55Z

Agreed. We can have a mapping table to display these.

ypriverol · 2024-11-11T07:36:29Z

Can you model it, the use case will be, for scores, column names etc, where an acronym is used for example:
posterior_error_probability we can find the correct cvterm for each in that table. @jpfeuffer what do you think?

It could be called: psi-ms-terms.parquet

jpfeuffer · 2024-11-11T08:09:54Z

Why do we use acronyms instead of the full name?

jpfeuffer · 2024-11-11T08:36:26Z

Ah you mean the ontology mapping. But the mapping is defined in the ontology, why would we want to replicate it?
Just use the full/display name of the ontology entry.

ypriverol · 2024-11-11T09:18:32Z

Yes, For example. We use the following score acronyms right now:

posterior_error_probability
andromeda_score
msgf_rawsocre

etc.

Would be nice if we have a mapping table somewhere where the actual PSI term corresponding to that acronym is annotated like:

term	ontology_name	ontology_accession
posterior_error_probability	posterior error probability from identification based on multiple spectra	MS:1003336
andromeda_score	Andromeda:score	MS:1002338
msgf_rawscore	MS-GF:RawScore	MS:1002049

This could help to understand each column etc. The idea is that we have to use acronyms because is difficult in some cases to store the original term from PSI or other ontologies because they have spaces and special characters, it is better to have an acronym.

jpfeuffer · 2024-11-11T09:20:07Z

But can't the ontology have synonyms? I feel like this kind of mapping should not be our task.

jpfeuffer · 2024-11-11T09:21:52Z

Or we say that the name needs to match the ontology_name in snake_case. Only the unnecessary long name of PEP would be a problem here

ypriverol · 2024-11-11T09:22:43Z

Agreed. But if Ithe terms do not exist now, then I suggest having this table as optional to enable easy search, at least in our toolbox.

jpfeuffer · 2024-11-11T09:24:54Z

If this is an interim solution, I feel like we can just do without it. It is pretty clear what the score names mean.
I really want to avoid having yet another table.

ypriverol · 2024-11-11T09:42:23Z

This is why I think it should be optional. These acronyms could be a bigger list, BTW. We use acronyms in scores, table column names, and additional information from the original search engines.

jpfeuffer · 2024-11-11T09:48:56Z

I still don't like it. Everything that we make optional is an additional if-case for everyone using that format. An additional check to see if that file is just missing or was forgotten.
It also allows people to circumvent ontologies and starting their own naming schemes etc

ypriverol · 2024-11-11T09:52:10Z

This is exactly my point

It also allows people to circumvent ontologies and start their own naming schemes etc

A lot of terms are not ready for data handling. For example, percolator:PEP is difficult if you want to skip special characters like :, and it could be worse sometimes. This is why I have started to use acronyms which enable querying, sorting in duckDB by scores, etc.

jpfeuffer · 2024-11-11T09:59:18Z

But then, why do you need this table now? Document it for now and hard-code the CV term in a potential validator.
Once the synonym is available in the ontology, you can switch the validator from a hard-coded dict to an actual ontology lookup

ypriverol · 2024-11-11T11:20:25Z

Because I don't want to hardcoded everything in the validator.

jpfeuffer · 2024-11-11T11:28:33Z

Then you could have the mapping file in your validator, but I would like to avoid a wild west format where people can map score names arbitrarily to some ontology. In the end you will have one dataset where PEP means posterior error probability, and in the other percent endogenous peptide or whatever.

ypriverol · 2024-11-11T11:31:47Z

Ok your idea is that the format itself, meaning the validator release an internal file for the mapping?

jpfeuffer · 2024-11-11T11:34:24Z

Yes because the mapping should be the same for every dataset out there.

ypriverol · 2024-11-11T11:35:39Z

Then, this file psi-ms-terms.parquet could be an internal file maintained by us?

jpfeuffer · 2024-11-11T11:41:32Z

Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.

ypriverol · 2024-11-11T11:42:51Z

Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.

I will try to trigger the conversation, but It may take a while 😉. @zprobot the idea is to keep the mapping table within the format library.

zprobot · 2024-11-11T12:43:54Z

We can use a unified format to represent the scores given by search engines. like {software}_score.
I think the mapping table is just for display purposes, used to view the available optional fields.

ypriverol · 2024-11-11T12:50:24Z

Two things:

Yes @zprobot for the acronyms, you can use that style {software}_score.
Yes, the mapping is mainly for displaying to help users to know what the score is. As @jpfeuffer said, users may understand the score by the acronym, but the table is to make sure that users know using an ontology which score we are referring to.

mobiusklein · 2024-11-11T16:46:59Z

Is the issue with : and space-containing column names that they are impossible or not ergonomic? Most SQL engines support column names that aren't "proper identifiers" enclosed in double quotes. This applies to DuckDB, as well as tested with pyarrow/pyarrow.parquet and datafusion.

e.g. using duckdb from Python with a test table mocked up for convience:

>>>conn.sql("SELECT * FROM test;").show()
┌─────────────────┬─────────────┐
│ Andromeda:score │ scan number │
│      float      │    int32    │
├─────────────────┼─────────────┤
│            10.0 │           1 │
│            24.0 │           2 │
│            -2.0 │           3 │
└─────────────────┴─────────────┘

>>> conn.sql("""SELECT "Andromeda:score" FROM test;""").show()
┌─────────────────┐
│ Andromeda:score │
│      float      │
├─────────────────┤
│            10.0 │
│            24.0 │
│            -2.0 │
└─────────────────┘

I agree with the argument that adding an alias table that introduces a combinatorial expansion of possible names for common columns is a big footgun.

The use of CURIEs is maximally stable, but minimally readable. The use of CURIE-backed names is a good compromise between readability and stability. If the CURIE-backed name isn't convenient, synonyms in the controlled vocabulary centralize the aliases, albeit if every term is heavily aliased we've not anyone any favors.

ypriverol · 2024-11-11T16:56:49Z

I was thinking of a more ergonomic meaning the users don't need to deal with such many skipping characters.

mobiusklein · 2024-11-11T17:15:22Z

Backing up a step, aren't scores encoded as pairs?

{
  "type": "array",
  "items": {
    "type": "struct",
    "fields": [
	    {"name": "score_name", "type": "string"},
	    {"name": "score_value", "type": "float32"}
    ],
  }
}

or did this change while I wasn't paying attention?

ypriverol · 2024-11-11T19:37:58Z

This is the way is implemented:

{"name": "additional_scores", 
   "type": {"type": "array",
            "items": { "type": 
                "struct", "field": { 
                      "name": "string", 
                      "value": "float32"
                 }  
             }
}

The point is that that name could be Andromeda:score or Andromeda:delta score which is not nice to filter, group etc.

jpfeuffer · 2024-11-11T20:04:42Z

I guess for additional_scores you can just add another field "CV term" and then use any "name" you like.

But I thought you are also worried about other columns?

mobiusklein · 2024-11-11T20:14:38Z

But in that case the score names are strings, all requiring quoting, and where "special characters" do not matter unless you are typing them out by hand repeatedly for an ad hoc query.

Assuming a QWERTY layout, for Andromeda:score vs andromeda_score, you press only one extra key to write the CV name instead of the snake_case'd name due to the shift-key for the uppercase "A", the ":" and "_" both cost a shift. For Andromeda:delta score vs andromeda_delta_score you actually break even because you convert the space into a "_" which costs an extra shift, balancing the cost of the extra capitalization.

I suppose the goal here is to produce a file format that is suited directly to the quantms pipeline's output though, in which case adding a new search engine is a breaking change in any case, so updating an alias table is par for the course.

zprobot · 2024-11-13T14:38:17Z

We can provide a file like this. It is used to describe the information of all fields currently in use.
Fields

ypriverol assigned zprobot and ypriverol Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MaxQuant scores #82

MaxQuant scores #82

ypriverol commented Nov 10, 2024

ypriverol commented Nov 11, 2024

zprobot commented Nov 11, 2024

ypriverol commented Nov 11, 2024

zprobot commented Nov 11, 2024

ypriverol commented Nov 11, 2024 •

edited

Loading

jpfeuffer commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024 •

edited

Loading

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

zprobot commented Nov 11, 2024

ypriverol commented Nov 11, 2024

mobiusklein commented Nov 11, 2024

ypriverol commented Nov 11, 2024

mobiusklein commented Nov 11, 2024

ypriverol commented Nov 11, 2024 •

edited

Loading

jpfeuffer commented Nov 11, 2024

mobiusklein commented Nov 11, 2024 •

edited

Loading

zprobot commented Nov 13, 2024

MaxQuant scores #82

MaxQuant scores #82

Comments

ypriverol commented Nov 10, 2024

ypriverol commented Nov 11, 2024

zprobot commented Nov 11, 2024

ypriverol commented Nov 11, 2024

zprobot commented Nov 11, 2024

ypriverol commented Nov 11, 2024 • edited Loading

jpfeuffer commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024 • edited Loading

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

jpfeuffer commented Nov 11, 2024

ypriverol commented Nov 11, 2024

zprobot commented Nov 11, 2024

ypriverol commented Nov 11, 2024

mobiusklein commented Nov 11, 2024

ypriverol commented Nov 11, 2024

mobiusklein commented Nov 11, 2024

ypriverol commented Nov 11, 2024 • edited Loading

jpfeuffer commented Nov 11, 2024

mobiusklein commented Nov 11, 2024 • edited Loading

zprobot commented Nov 13, 2024

ypriverol commented Nov 11, 2024 •

edited

Loading

ypriverol commented Nov 11, 2024 •

edited

Loading

ypriverol commented Nov 11, 2024 •

edited

Loading

mobiusklein commented Nov 11, 2024 •

edited

Loading