Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor inconsistencies in spec #3

Open
bittremieux opened this issue Aug 6, 2021 · 9 comments
Open

Minor inconsistencies in spec #3

bittremieux opened this issue Aug 6, 2021 · 9 comments

Comments

@bittremieux
Copy link

bittremieux commented Aug 6, 2021

There are some minor inaccuracies in some of the examples in the specification draft 12:

  • page 8: EM[R: Methionine sulfone]EVEES[O-phospho-L-serine]PEK -> This term doesn't appear in RESID. Note the leading space, but even without that the name is incorrect. Probably it should be L-methionine sulfone (RESID:AA0251)?
  • page 9: EM[UNIMOD:15]EVEES[UNIMOD:56]PEK -> accession UNIMOD:15 does not exist. In case consistency with the previous examples is desired, UNIMOD:35 corresponds to Oxidation. Same for the invalid example with U:15 just underneath.
  • page 11: EVTSEKC[half-cystine]LEMSC[half-cystine]EFD -> half-cystine should be half cystine (no hyphen).
  • page 14: The mass of HexS is specified with only three decimals, whereas other masses in that list have four decimals. It's also not rounded correctly. Instead use 242.0096 as the mass with four decimals.

More conceptual question:

  • Q: page 14: Parsing glycan compositions is somewhat non-trivial because some labels overlap. It would be easier if spaces between monosaccharides are used (split on space) or cardinality is always specified (split on [a-zA-Z]+\d+). Maybe this can be a bit more strongly recommended in section 4.2.8?
    A: Parsing is possible without enforcing spaces or cardinality by checking for only defined monosaccharides rather than any string.

  • Q: page 18: I'm a bit confused how parsers should interpret that global modifications are isotopes? The examples (13C, 15N, D) don't seem to be specified using a controlled vocabulary, whereas this is the case throughout the rest of the document. Is it that when no @ is used in the global modification part, as specified in section 4.6.2, it should always be considered an isotope instead?
    A: Yes, I currently interpret global modifications of the form INT* LETTER+ SIGNED_INT* as an isotope and global modifications of the form "[" mod "]@" (AA ",")* AA as global amino acid modifications (so square brackets and "@" sign).

  • Q: page 19: How should multiple global modifications on different amino acids be specified? I guess the following example, with a comma separating the global modifications within the angular brackets, would lie in line with the spec, but this is not explicitly detailed: <[Carbamidomethyl]@C,[Oxidation]@M>MTPEILTCNSIGCLK.
    A: Multiple global modifications are each specified in their own block between angled brackets.

@mobiusklein
Copy link
Collaborator

RE Glycan formula parsing, I thought that spaces were required already. Otherwise, without constructing an unambiguous longest-to-shortest testing order, it wouldn't be possible to solve in the general case without extreme look-ahead. It's still doable with a fixed list of monosaccharides.

For multiple global modifications, they should be in separate angle brackets, following the example in 4.6.1?

Both Carbon 13 and Nitrogen 15: <13C><15N>ATPEILTVNSIGQLK

I think this fits similarly to how curly-brace syntax specifies one labile modification, though in that case it takes the place of the square braces. It would make the angle bracket section really laborious to parse if we had to overload , to be a possible state transition

@bittremieux
Copy link
Author

bittremieux commented Aug 9, 2021

  • Glycans: No, the spaces are optional, with the possibility to make this mandatory mentioned in section 4.2.8:

If glycan symbols conflict with themselves or element symbols in such a way that ambiguities occur, we will consider requiring spaces between 'atoms' (see Formula Rule #1).

And formula rule 1 includes:

Pairs SHOULD be separated by spaces but are not required to be.

Maybe this should be revisited?

  • Global modifications: Ok, makes sense, thanks. I glossed too quickly over the example in 4.6.1.

@bittremieux
Copy link
Author

bittremieux commented Aug 21, 2021

Additionally, I have the following comments about the specification draft 13:

Minor comments:

  • The long example at the top of page 13 should use "//" instead of "\\" to represent the inter-chain crosslink.
  • Example (b) of branched peptides in section 4.2.4 page 13 uses non-existing modification MOD:000134. This should probably be MOD:00134 (one fewer 0).
  • Example {Glycan:Hex}{Glycan:NeuAc}EMEVNESPEK contains an invalid glycan. NeuAc should probably be Neu5Ac?
  • Example MPGLVDSNPAPPESQEKKPLK(PCCACPETKKARDACIIEKGEEHCGHLIEAHKECMRALGFKI)[disulfide][Oxidation][Oxidation] in section 4.5 on page 21 includes the non-existing modification disulfide (in UNIMOD or PSI-MOD).
  • In section 4.9, page 23, the reference to section 4.2.5 should become 4.2.6.
  • On page 32, the example [U:iTRAQ4plex]EM[U:Oxidation]EVNES[U:Phospho]PEK[U:iTRAQ4plex]-[U:Methyl]/3 should probably have the first iTRAQ4plex as an N-terminal modification? The "-" is missing in that case.

Suggestions / questions:

  • In which position should labile modifications be specified? Section 4.3.2 does not explicitly mention this, although the examples all place the labile modification in the beginning. However, how does it relate to modifications with an unknown position (section 4.4.1) and global modifications (section 4.6)? Section 4.6 specifies that global modifications should be written before ambiguous modifications and N-terminal modifications, but the position of labile modifications is not mentioned.
  • I don't fully understand section 4.7 on amino acid sequence ambiguity. What does it mean if a single or multiple amino acids are specified to be ambiguous? What is the position where this should be specified w.r.t. other tags that are included at the start of the string?
  • If a pipe character is used to list multiple options for a modification (section 4.9), can each option have an associated label, specified with #, or should there only be a single label after all options have been listed?

@javizca
Copy link
Contributor

javizca commented Aug 26, 2021

Thanks a lot Wout for all your minor corrections. I think all of them are correct apart from the NeuAc, which, as far as I can see it is a valid glycan?. I also considered your previous comments on draft 12.

@javizca
Copy link
Contributor

javizca commented Aug 26, 2021

In which position should labile modifications be specified? Section 4.3.2 does not explicitly mention this, although the examples all place the labile modification in the beginning. However, how does it relate to modifications with an unknown position (section 4.4.1) and global modifications (section 4.6)? Section 4.6 specifies that global modifications should be written before ambiguous modifications and N-terminal modifications, but the position of labile modifications is not mentioned.

A: I have added in the specification document (Section 4.3.2): "Labile modification MUST be located before the first amino acid sequence and before N-terminal modifications, if applicable".
In Section 4.6.2: "Fixed modifications MUST be written prior to ambiguous and labile modifications".

@bittremieux
Copy link
Author

bittremieux commented Aug 26, 2021

I think all of them are correct apart from the NeuAc, which, as far as I can see it is a valid glycan?

Right, this does seem to be a glycan (shows that I don't know much about it). It failed my validation though because apparently it's listed as a synonym of Neu5Ac in the monosaccharides OBO and I was only considering the default names.

@bittremieux
Copy link
Author

All right, so the proper order is like this?

<GLOBAL_MOD>[UNKNOWN_POS]?{LABILE_MOD}[N_TERM]-PEPTIDE-[C_TERM]

@mobiusklein
Copy link
Collaborator

Clearly I didn't submit my note about Neu5Ac last night. Neu5Ac is synonymous with NeuAc. Mincing the monosaccharide apart to determine where the acetyl group is attached is also impossible with the current dissociation methods available. You can find NeuAc with additional O-acetyl groups (though they are pretty fragile and are easily lost in sample processing), but GNOme doesn't index them.

The OBO and the generated JSON file list all the synonyms for each monosaccharide, though most monosaccharides aren't listed in the ProForma spec, and a very restricted subset are actually indexed in GNOme.

My parser isn't handling this properly either. I just wrote the common names from memory.

@edeutsch
Copy link

In which position should labile modifications be specified? Section 4.3.2 does not explicitly mention this, although the examples all place the labile modification in the beginning. However, how does it relate to modifications with an unknown position (section 4.4.1) and global modifications (section 4.6)? Section 4.6 specifies that global modifications should be written before ambiguous modifications and N-terminal modifications, but the position of labile modifications is not mentioned.

A: I have added in the specification document (Section 4.3.2): "Labile modification MUST be located before the first amino acid sequence and before N-terminal modifications, if applicable".
In Section 4.6.2: "Fixed modifications MUST be written prior to ambiguous and labile modifications".

It is my recollection that a {labile} modification can appear anywhere that a [non-labile modification] can appear. The only difference is that the writer is making the statement that there is not (or there is not expected to be) any evidence of the mod in a particular location because it is completely labile. So the peptidoform SMALLS{Sulfo}NACK simply means that the writer believes that the sulfo is on the second S, but there is no trace of that in the associated evidence because the mod is (or is expected to be) completely labile.

And thus it counts when computing the precursor m/z, but it can be ignored when computing abcxyz ions because it is labile.

Therefore I don't think it is confined to a specific location. {} is equivalent to [] but with a "labile" meaning. Does anyone else remember that or am I confused?

bittremieux added a commit to bittremieux/spectrum_utils that referenced this issue Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants