Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liquid Clustering config for table materialization #398

Conversation

ammarchalifah
Copy link
Contributor

@ammarchalifah ammarchalifah commented Jul 28, 2023

Resolves #399

Description

With this change, dbt user could supply liquid_clustered_by model config with column (or list of columns) to be used as clustering keys using Liquid Clustering feature of Delta. The change introduces a small change in DDL query, while it does nothing for python-based models (as I couldn't find the python API for Liquid Clustering).

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

Ammar Chalifah and others added 5 commits July 28, 2023 18:16
Signed-off-by: Ammar Chalifah <[email protected]>
Signed-off-by: Ammar Chalifah <[email protected]>
Signed-off-by: Ammar Chalifah <[email protected]>
Signed-off-by: Ammar Chalifah <[email protected]>
@ammarchalifah
Copy link
Contributor Author

Tested on private Databricks cluster, works as intended. Resulting table is clustered correctly.

Screenshot 2023-07-28 at 19 00 55 Screenshot 2023-07-28 at 19 01 34

@ammarchalifah ammarchalifah changed the title Feature liquid clustering Liquid Clustering config for table materialization Jul 28, 2023
@ammarchalifah
Copy link
Contributor Author

Hi, could you help review this PR? @andrefurlan-db @susodapop @ueshin

@susodapop susodapop self-assigned this Aug 9, 2023
Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good with one question 👍 Thanks for this contribution.

@@ -154,6 +176,7 @@ def test_macros_create_table_as_all_delta(self):
"create or replace table my_table "
"using delta "
"partitioned by (partition_1,partition_2) "
"cluster by (cluster_1,cluster_2)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs (emphasis mine):

Clustering is not compatible with partitioning or ZORDER, and requires that the Azure Databricks client manages all layout and optimization operations for data in your table.

Therefore the output of this self._render_create_as(), while expected, would technically raise an exception. Should we enforce any kind of block or just allow the compute cluster to raise an exception if the dbt user configures both partition_by and liquid_clustered_by?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I'm for leaving the compute cluster to raise an exception, but would be open for suggestion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pinging a couple other engineers internally for an opinion on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My stance has been to let Databricks surface the issues, rather than intercepting in the adapter. I'm not sure if we're doing this consistently everywhere. My reasoning is that, for any rule we hard code in, the Databricks implementation could change, and it would be better for that functionality to be available to customers without us having to spin a new release.

@benc-db
Copy link
Collaborator

benc-db commented Aug 9, 2023

@ammarchalifah can you file an issue for us to add this capability for python models as well? The python model support is currently in flux, so no need for you to make the change, but I would appreciate if you capture the expected behavior (matching your sql tests) in an issue.

@ammarchalifah
Copy link
Contributor Author

@ammarchalifah can you file an issue for us to add this capability for python models as well? The python model support is currently in flux, so no need for you to make the change, but I would appreciate if you capture the expected behavior (matching your sql tests) in an issue.

Yes, opened a new issue here.

@susodapop susodapop changed the base branch from main to staging-liquid-clustering August 9, 2023 18:21
Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! I've updated this to merge to a staging branch on the main repository. This enables the staging branch to run our e2e tests. Once those pass, we'll merge to main

@ammarchalifah-bolt
Copy link

Thanks again! I've updated this to merge to a staging branch on the main repository. This enables the staging branch to run our e2e tests. Once those pass, we'll merge to main

@susodapop Thank you! Really happy to contribute and can't wait to try out Liquid Clustering in our system!

@ammarchalifah
Copy link
Contributor Author

All tests passed. Should we merge the PR? @susodapop

@ammarchalifah
Copy link
Contributor Author

Any update with this PR? @susodapop

@susodapop susodapop merged commit 154e5c6 into databricks:staging-liquid-clustering Aug 14, 2023
18 checks passed
susodapop pushed a commit that referenced this pull request Aug 15, 2023
Signed-off-by: Ammar Chalifah <[email protected]>
Co-authored-by: Ammar Chalifah <[email protected]>
susodapop pushed a commit that referenced this pull request Aug 29, 2023
Signed-off-by: Ammar Chalifah <[email protected]>
Co-authored-by: Ammar Chalifah <[email protected]>
(cherry picked from commit b632484)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liquid Clustering configuration in adapter
5 participants