deal with phylo data more elegantly #18

JenkeScheen · 2024-05-01T08:49:53Z

in cases like SARS-CoV-2 N protein we have to use phylogenetics data. Two issues:

this is formatted at JSON, we do have a standalone script for the conversion already but this should be absorbed into API (see https://github.com/asapdiscovery/choppa/blob/main/choppa/IO/convert.py)
specifically for N protein (but will happen in future for other targets too) the PDB is a homodimer. The fitness data (in JSON) is as monomer, so we need to duplicate the fitness data to make sure the alignment works correctly.

For the second point, I had a go (in IO) by making a copy of the fitness DF, then adjusting the residue index column in the copy to start counting from where the first fitness DF ended, and then concatenating both DFs together: this messes up the alignment spectacularly.

Perhaps a more elegant approach would be to align the whole phylogenetics JSON to the PDB, then pick the largest overlapping island? Would still have to deal with mini gaps (may just need to define some lag factor). This wouldn't solve the homodimer issue though - perhaps this could just be exposed in the CLI and then a duplication as above is tried?

The text was updated successfully, but these errors were encountered:

JenkeScheen self-assigned this May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deal with phylo data more elegantly #18

deal with phylo data more elegantly #18

JenkeScheen commented May 1, 2024

deal with phylo data more elegantly #18

deal with phylo data more elegantly #18

Comments

JenkeScheen commented May 1, 2024