Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deal with phylo data more elegantly #18

Open
JenkeScheen opened this issue May 1, 2024 · 0 comments
Open

deal with phylo data more elegantly #18

JenkeScheen opened this issue May 1, 2024 · 0 comments
Assignees

Comments

@JenkeScheen
Copy link
Contributor

in cases like SARS-CoV-2 N protein we have to use phylogenetics data. Two issues:

  • this is formatted at JSON, we do have a standalone script for the conversion already but this should be absorbed into API (see https://github.com/asapdiscovery/choppa/blob/main/choppa/IO/convert.py)
  • specifically for N protein (but will happen in future for other targets too) the PDB is a homodimer. The fitness data (in JSON) is as monomer, so we need to duplicate the fitness data to make sure the alignment works correctly.

For the second point, I had a go (in IO) by making a copy of the fitness DF, then adjusting the residue index column in the copy to start counting from where the first fitness DF ended, and then concatenating both DFs together: this messes up the alignment spectacularly.

Perhaps a more elegant approach would be to align the whole phylogenetics JSON to the PDB, then pick the largest overlapping island? Would still have to deal with mini gaps (may just need to define some lag factor). This wouldn't solve the homodimer issue though - perhaps this could just be exposed in the CLI and then a duplication as above is tried?

@JenkeScheen JenkeScheen self-assigned this May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant