Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI and example data #30

Merged
merged 3 commits into from
Aug 5, 2023
Merged

Add CI and example data #30

merged 3 commits into from
Aug 5, 2023

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Jul 19, 2023

Description of proposed changes

Generally inspired by existing practices in the monkeypox repo.

Related issue(s)

Testing

  • Checks pass

Post-merge

Add this repo to pathogen repo CI lists:

@victorlin victorlin self-assigned this Jul 19, 2023
Generally inspired by existing practices in the monkeypox repo.
@victorlin victorlin marked this pull request as ready for review July 19, 2023 21:30
@victorlin victorlin requested a review from a team July 19, 2023 21:31
Comment on lines +7 to +8
The subset of data is generated by an augur filter call which:
- sets the subsampling size to 50
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers: Does anyone know if there are filter options to better subsample the RSV data? The monkeypox filter includes the root and groups by clade + lineage. I'm not sure if those apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I'm not seeing 'clade', 'lineage' or similar in the metadata:

$ head -n1 data/*/metadata.tsv                             
==> data/a/metadata.tsv <==
accession	genbank_accession_rev	strain	date	region	country	division	location	host	date_submitted	sra_accession	abbr_authors	reverse	authors	institution	F_coverage	G_coverage	genome_coverage

==> data/b/metadata.tsv <==
accession	genbank_accession_rev	strain	date	region	country	division	location	host	date_submitted	sra_accession	abbr_authors	reverse	authors	institution	F_coverage	G_coverage	genome_coverage

which would be required for the monkeypox-like filter. Clades aren't currently part of ingest rules, and doesn't happen until here in the build. Maybe a way forward is adding clade assignment in ingest but I would imagine this as a separate discussion and PR.

To get at your "better subsampling" question, maybe group by year and country? Similar to the rsv config file?

filter:
group_by: "year country"

Others, feel free to chime in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good suggestion! Updated to use that config directly in 538442d.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM :D

@victorlin
Copy link
Member Author

Merging since this is already useful in #29.

@victorlin victorlin merged commit 3d086ca into master Aug 5, 2023
6 checks passed
@victorlin victorlin deleted the victorlin/add-example-data branch August 5, 2023 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

ENH: Add pathogen-ci to ci
2 participants