Whether you are a seasoned open source contributor or a first-time committer, we welcome and encourage you to contribute code (via Pull Request), documentation (via our docusaurus site), ideas (via Discussions for larger ideas, or issues for specific feature requests), or reporting bugs (via issues) to this project. You can also contribute via topics on our Discourse.
Before you start a load of work, please note that all Pull Requests (apart from cosmetic fixes like typos) should be associated with an issue that has been approved for development by a maintainer. This is to stop you doing lots of development that may not be accepted into the package for a variety of reasons. Make sure to either raise an issue yourself or look at the existing issues before starting any development.
This document serves as guide for contributing code changes to dbt-snowplow-unified
. It is not intended as a guide for using dbt-snowplow-unified
, and some pieces assume a level of familiarity with Python development (virtualenvs, pip
, etc) and dbt package development. Specific code snippets in this guide assume you are using macOS or Linux and are comfortable with the command line.
- CLA: If this is your first time contributing you will be asked to sign the Individual Contributor License Agreement. If you would prefer to read this in advance of submitting your Pull Request you can find it here. If you are unable to sign the CLA, the
dbt-snowplow-unified
maintainers will unfortunately be unable to merge any of your Pull Requests. We welcome you to participate in discussions, open issues, and comment on existing ones. - Branches: All Pull Requests from community contributors should target the
main
branch (default) and the maintainers will create the appropriate branch to merge this into. Please let us know if you believe your changes are a breaking change or could be done as part of a patch release, if you are unsure that's fine just make that clear in your Pull Request. - Documentation: The majority of the documentation for our dbt packages is in the core Snowplow Docs and as such a Pull Request will need to be raised there to update any docs related to your change. Things such as the deployed dbt site are taken care of automatically.
You will need git
in order to download and modify the dbt-unified
source code. On macOS, the best way to download git is to just install Xcode.
If you are not a member of the snowplow
GitHub organization, you can contribute to dbt-snowplow-unified
by forking the relevant package repository. For a detailed overview on forking, check out the GitHub docs on forking. In short, you will need to:
- Fork this repository
- Clone your fork locally
- Check out a new branch for your proposed changes
- Push changes to your fork
- Open a Pull Request against this repo from your forked repository
If you are a member of the snowplow
GitHub organization, you will have push access to this repo. Rather than forking to make your changes, just clone the repository, check out a new branch, and push directly to that branch.
Assuming you already have dbt installed, it will be beneficial to create a profile for any warehouse connections you have when it comes to testing the changes to your package. The easiest way to do this that will involve the least changes to the testing setup is to create an integration_tests
profile and populate it with any connections you have to our supported warehouse types (redshift+postgres, databricks, snowflake, bigquery).
It is recommended you use a custom schema for integration tests.
integration_tests:
outputs:
databricks:
type: databricks
...
snowflake:
type: snowflake
...
bigquery:
type: bigquery
...
redshift:
type: redshift
...
postgres:
type: postgres
...
target: postgres
In general we try to follow these rules of thumb, but there are possible exceptions:
- Dispatch any macro where it needs to support multiple warehouses.
- Use inheritance where possible i.e. only define a macro for
redshift
if it is different topostgres
, the same fordatabricks
andspark
- Use inheritance where possible i.e. only define a macro for
- Where models need to be different across multiple warehouse types, ensure they are enabled based on the
target.type
- Make use of macros (ours and dbt's) where possible to avoid duplication and to manage the differences between warehouses
- Do not reinvent the wheel e.g. make use of
type_*
macros instead of explicit datatypes - In the case where a macro may be useful outside of a specific package, we may make the choice to add it to
dbt-snowplow-utils
repository instead
- Do not reinvent the wheel e.g. make use of
- Make use of the incremental logic as much as possible to avoid full-scanning large tables
- Where new functionality is being added, or you are touching existing functionality that does not have good/any test, add tests
Once you're able to manually test that your code change is working as expected, it's important to run existing automated tests, as well as adding some new ones. These tests will ensure that:
- Your code changes do not unexpectedly break other established functionality
- Your code changes can handle all known edge cases
- The functionality you're adding will keep working in the future
In general our packages all have similar structures, with an integration_tests
folder that contains a .scripts/integration_tests.sh
file. This script is run with 1 argument, the name of your target
in the integration_tests
profile e.g. ./integration_tests/.scripts/integration_tests.sh -d postgres
which will run all the tests on your postgres instance. This all means you don't need your own Snowplow data to run the tests.
Tests are of 1 of 2 kinds:
- Row count/equality tests; these ensure that the processed seed data from the package matches exactly an expected input seed file. If you have made no change to logic these should not fail, however if you have changed the logic you may need to edit the expected seed file, and add records to the events input seed file to cover the use case. In some cases it may make sense to add both expected and unexpected data to the test (i.e. to ensure a fix you have deployed actually fixes the issue you have seen).
- Macro based tests; these are more varied, sometimes checking the output sql from a macro or otherwise examining database objects. Look at existing tests for more details and for how to edit/create these.
To run the integration tests:
- Ensure the
integration_tests
folder is your working directory (you may need tocd integration_tests
) - Run
dbt run-operation post_ci_cleanup
to ensure a clean set of schemas (this will drop the schemas we use, so ensure your profile is only for these tests) - Run
./.scripts/integration_test.sh -d {target}
with your target name - Ensure all tests run successfully
If any tests fail, you should examine the outputs and either correct the test or correct your changes.
If you do not have access to all warehouses do not worry, test what you can and the remainder will be run when you submit your Pull Request (once enabled by maintainers).
For specific details for running existing integration tests and adding new ones to this package see integration_tests/README.md.
You don't need to worry about which version your change will go into. Just create the changelog entry at the top of CHANGELOG.md, copying the style of those below, but populate the date and version numbers with x
s and open your Pull Request against the main
branch.
A maintainer will review your Pull Request. They may suggest code revision for style or clarity, or request that you add unit or integration test(s). We promise these are good things and it's not personal, we all want to make sure the highest quality of work goes into the packages in a way that will be the least disruptive for our users.
Automated tests run via Github actions. If you're a first-time contributor, all tests (including code checks and unit tests) will require a maintainer to approve. You will not be able to see the output data of these tests, but we can share and explore any failures with you should there be any.
Once all tests are passing and your Pull Request has been approved, a maintainer will merge your changes into the active development branch. And that's it! You're now an Open Source Contributor!