Adding SnowflakeCompare (Snowflake/Snowpark compare) #333

rhaffar · 2024-09-12T20:55:04Z

Adding SFTableCompare class for Snowflake table and Snowpark dataframe comparisons
Added tests and documentation
Adding snowpark session creation in test config

*commit history is pretty bare because I had to reset and update the author

A few considerations that may or may not be worth addressing, but I'd like your takes on these:

Case-sensitive columns are "partially" supported. In Snowpark, case sensitive columns are actually enclosed in double quotes, so 'A' and 'a' are the same column, but '"a"' is distinct from these. This can be troublesome to handle flexibly for column inclusion checks, and more importantly for instances where you need to tag the dataframe name to the column (the double-quotes screw with this). So for now I just convert all columns and join_columns to be case insensitive. The only way this could be an issue is if users are using the exact same column name for different rows, differentiating only by case sensitivity. I think this is sufficient since the vast majority of Snowflake tables are case insensitive to begin with, and those that aren't likely don't repeat the same column with different casing, but let me know if you have other thoughts.
Running a compare with a table against itself creates some weird issues, as Datacompy starts looking for temporary columns belonging to both tables at a later point in the comparison. There is obviously no use case to run datacompy on a single table, but just thought I’d mention this.
Compares with a float in the join column leads to incorrect results. This isn’t really implementation specific, this is just the natural outcome of trying to join on floating point columns. I don’t know if you’ve seen similar things in Spark. I don’t consider this an issue since it’s generally bad practice to compare floats using a strict equality, at best we add some sort of relative error check but I think that would actually be a bit of a headache to do in Snowpark and would just enable bad practice.

fdosani

Just my first round of review here. Excellent work, and much appreciate 🥳
A few changes, comments, and nitpicks. 😅 Feel free to comment and discuss.

To address the 3 question:

I'm ok with not allowing case-sensitive column names as you stated. We can change that if users have a need later on.
You're right, doesn't make much sense. I'd be curious to dig into that later though to understand what is happening.
agreed on the join columns for sure. I'll need to check into how that works for the other compares. It would probably be the same issue I'm guessing.

datacompy/sf_sql.py

fdosani · 2024-09-13T15:45:34Z

datacompy/sf_sql.py

+    @df1.setter
+    def df1(self, df1: Union[str, sp.DataFrame]) -> None:
+        """Check that df1 is either a Snowpark DF or the name of a valid Snowflake table."""
+        if isinstance(df1, str):


Wondering if we can be consistent with something like we have for polars and push this into _validate_dataframe

I think this is a bit different, in that this is actually used to determine how we construct the dataframe, versus type-checking that the built dataframe is in a valid form actually does occur in _validate_dataframe .

gotcha, that is fair.

fdosani · 2024-09-13T15:46:51Z

datacompy/sf_sql.py

+    abs_tol : float, optional
+        Absolute tolerance between two values.
+    rel_tol : float, optional
+        Relative tolerance between two values.


We should consider including:

df1_name: str = "df1", df2_name: str = "df2",

to be consistent with the other cleaners.

datacompy/sf_sql.py

tests/conftest.py

tests/test_sf_sql.py

rhaffar · 2024-09-18T19:26:44Z

Just for documentation purposes, Snowpark local testing is still a relatively new product and seems to have a few pieces of functionality that have yet to be implemented, so although all tests pass when using a Snowflake cluster, we get many failures when running locally (usually the same couple of failures affect a handful of tests). For now I've kept the local testing configuration as we can pretty much choose when to run it locally or not, and hopefully within a few Snowpark updates we'll be able to run these tests locally without issue.

rhaffar · 2024-10-29T18:04:47Z

Conclusion regarding testing:

Due to current CICD limitation as well as a desire to keep testing resources public, we've attempted to implement testing using either the Snowpark local runner or a local Snowflake resource emulator. In both instances, we see potential but find support somewhat limited, with either option requiring either significant mocking or unnecessary modifications of compare logic to work around lack of support. There's also no guarantee that our tests would currently be able to properly represent user usage through the use of an emulator or the local runner. So for now we will stick to testing Snowflake/Snowpark comparisons internally, and revisit these options at a later date.

fdosani · 2024-10-29T19:10:29Z

@rhaffar We will need to exclude the test_snowflake.py from the pytest calls since it will cause everything to fail. Need to add --ignore=path/to/file.py when calling pytest in github actions.

fdosani · 2024-10-29T20:58:58Z

@rhaffar nice work. LGTM.

rhaffar added 4 commits September 12, 2024 13:18

adding snowflake/snowpark compare

d22f15c

undo change

96f9cc0

add partial case sensitive support

0610c10

update doc

43ece65

rhaffar requested review from fdosani, ak-gupta, jdawang and gladysteh99 as code owners September 12, 2024 20:55

fdosani requested changes Sep 13, 2024

View reviewed changes

fdosani mentioned this pull request Sep 13, 2024

Please add Snowpark support #290

Closed

rhaffar added 2 commits September 17, 2024 13:47

pr comments

e0d1093

remainder PR comments

aa048b4

rhaffar requested a review from fdosani September 18, 2024 18:58

update readme

254a169

rhaffar changed the title ~~Adding SFTableCompare (Snowflake/Snowpark compare)~~ Adding SnowflakeCompare (Snowflake/Snowpark compare) Sep 19, 2024

rhaffar added 4 commits October 24, 2024 11:35

mocking abs, trim

7a16fbb

merge latest

9d0c72b

remove local testing, fix text

ab1242c

Merge branch 'develop' into table_compare

208c124

rhaffar added 8 commits October 29, 2024 15:17

ignore snowflake tests

71db619

catch snowpark import in test config

916ec2c

python 3.12 actions without snowflake

a408bf5

clean

9fbc9a8

fix

bb65ba1

catch snowflake imports, fix snowflake type annotations

e831b29

conditional install

246db0a

conditional install

cbe2c60

rhaffar added 2 commits October 29, 2024 16:28

conditional install

10e139f

fix conditional

b77abf3

fdosani approved these changes Oct 29, 2024

View reviewed changes

fdosani merged commit 1ea649a into capitalone:develop Oct 29, 2024
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SnowflakeCompare (Snowflake/Snowpark compare) #333

Adding SnowflakeCompare (Snowflake/Snowpark compare) #333

rhaffar commented Sep 12, 2024 •

edited

Loading

fdosani left a comment

fdosani Sep 13, 2024

rhaffar Sep 13, 2024

fdosani Sep 13, 2024

fdosani Sep 13, 2024

rhaffar commented Sep 18, 2024

rhaffar commented Oct 29, 2024

fdosani commented Oct 29, 2024

fdosani commented Oct 29, 2024

Adding SnowflakeCompare (Snowflake/Snowpark compare) #333

Adding SnowflakeCompare (Snowflake/Snowpark compare) #333

Conversation

rhaffar commented Sep 12, 2024 • edited Loading

fdosani left a comment

Choose a reason for hiding this comment

fdosani Sep 13, 2024

Choose a reason for hiding this comment

rhaffar Sep 13, 2024

Choose a reason for hiding this comment

fdosani Sep 13, 2024

Choose a reason for hiding this comment

fdosani Sep 13, 2024

Choose a reason for hiding this comment

rhaffar commented Sep 18, 2024

rhaffar commented Oct 29, 2024

fdosani commented Oct 29, 2024

fdosani commented Oct 29, 2024

rhaffar commented Sep 12, 2024 •

edited

Loading