Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
Test SQL code and preview changes by comparing development/staging environment data to production:
- Make a change to some SQL code
- Run the SQL code to create a new dataset
- Compare the dataset with its production version or another iteration
dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration
👀 Watch 4-min demo video
Get started with data-diff & dbt
Also available in a VS Code Extension
Reach out on the dbt Slack in #tools-datafold for advice and support
When comparing the data, data-diff
utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences
More information about the algorithm and performance considerations can be found here
⚡ Looking to use data-diff
in dbt development? Head over to our data-diff
+ dbt
documentation to get started!
🔀 To compare data between databases, install data-diff
with specific database adapters, e.g.:
pip install data-diff 'data-diff[postgresql,snowflake]' -U
Run data-diff
with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:
data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k <primary key column> \
-c <columns to compare> \
-w <filter condition>
Check out documentation for the full command reference.
Database | Status | Connection string |
---|---|---|
PostgreSQL >=10 | 🟢 | postgresql://<user>:<password>@<host>:5432/<database> |
MySQL | 🟢 | mysql://<user>:<password>@<hostname>:5432/<database> |
Snowflake | 🟢 | "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]" |
BigQuery | 🟢 | bigquery://<project>/<dataset> |
Redshift | 🟢 | redshift://<username>:<password>@<hostname>:5439/<database> |
Oracle | 🟡 | oracle://<username>:<password>@<hostname>/servive_or_sid |
Presto | 🟡 | presto://<username>:<password>@<hostname>:8080/<database> |
Databricks | 🟡 | databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema> |
Trino | 🟡 | trino://<username>:<password>@<hostname>:8080/<database> |
Clickhouse | 🟡 | clickhouse://<username>:<password>@<hostname>:9000/<database> |
Vertica | 🟡 | vertica://<username>:<password>@<hostname>:5433/<database> |
DuckDB | 🟡 | |
ElasticSearch | 📝 | |
Planetscale | 📝 | |
Pinot | 📝 | |
Druid | 📝 | |
Kafka | 📝 | |
SQLite | 📝 |
- 🟢: Implemented and thoroughly tested.
- 🟡: Implemented, but not thoroughly tested yet.
- ⏳: Implementation in progress.
- 📝: Implementation planned. Contributions welcome.
Your database not listed here?
- Contribute a new database adapter – we accept pull requests!
- Get in touch about enterprise support and adding new adapters and features
We thank everyone who contributed so far!
This project is licensed under the terms of the MIT License.