Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to improve annotations and eliminate premature stop codons? #29

Open
sheinasim opened this issue Sep 14, 2024 · 3 comments
Open

Comments

@sheinasim
Copy link

Hello!

I'm having a slight issue with the annotations generated by EGAPx. When I translate the CDS into a protein sequence, I'm seeing a lot of premature stop codons. This is a species where I am supplying RNA-seq, RNA-seq in SRA (other people's sequences), and a protein sequence file from a previous assembly of the same species.

I'm tryin to annotate the genome of a fly in the family Tephritidae. I'm going to try now to add protein sets from closely related species as well. Will the gene models improve with more evidence even if it is not from the same species?

Is there an option I can use to favor annotations and frames that minimize the number of pseudogenes or premature stop codons?

Thanks!
Sheina

@murphyte
Copy link

Hi Sheina -- this is partially related to the warning we have about the current version not yet being feature complete or ready for submission. We are working on wrapping up v0.3, which will add functional annotation analysis including logic to classify protein-coding vs pseudogene. That will likely convert a chunk of the CDS annotations into pseudogenes. I'm hoping we'll have that out in early October, depending on if any issues arise in our ongoing pre-release testing.

a protein sequence file from a previous assembly of the same species.

You generally don't need to do this. The default dipteran protein file should work well in combination with RNA-seq. Including proteins from an automated annotation of the same species can have some adverse effects where errors in any proteins get locked in which is less likely to happen when aligning cross-species proteins. You'll also get more pseudogenes annotated when aligning same species proteins than when relying only on cross-species, so that might be elevating your pseudogene count. Our goal is to make EGAPx easy, with little need to customize anything, and the default sets are designed to cover most everything.

Note we do seeing varying rates of models with internal frameshifts or nonsense codons that EGAP winds up classifying as protein-coding with the designation LOW QUALITY PROTEIN. For dipterans, across 98 genomes currently in RefSeq it looks like that averages ~250 genes per genome, with a median of ~150. Sometimes that rate can be elevated, particularly in genomes based on non-HiFi PacBio that haven't been polished, which can have a higher rate of indels that adversely affect gene models. But my suspicion is this is an effect of (a) not yet having the functional annotation logic, and (b) aligning same-species proteins.

@sheinasim
Copy link
Author

Hi Terence,

Thank you for your reply!

For now, I will use the output where I did not supply it with a protein fasta from a previous assembly.

Our genome was made from PacBio HiFi reads, so I wouldn't expect any CLR related frameshifts or need for polishing. Thanks for that pseudogene number, I will compare it to what I found.

I'm using AGAT to translate the cds to protein sequences, is there another program you would recommend?

Thanks again, and I look forward to the new release!

Best wishes,
Sheina

@murphyte
Copy link

Great, PacBio HiFi has made a world of difference.

We'll provide protein FASTA output as part of the v0.3 release.

AGAT works fine for now. It's just that the pseudogenes aren't labeled yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants