-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SnowflakeCompare (Snowflake/Snowpark compare) #333
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my first round of review here. Excellent work, and much appreciate 🥳
A few changes, comments, and nitpicks. 😅 Feel free to comment and discuss.
To address the 3 question:
- I'm ok with not allowing case-sensitive column names as you stated. We can change that if users have a need later on.
- You're right, doesn't make much sense. I'd be curious to dig into that later though to understand what is happening.
- agreed on the join columns for sure. I'll need to check into how that works for the other compares. It would probably be the same issue I'm guessing.
datacompy/sf_sql.py
Outdated
@df1.setter | ||
def df1(self, df1: Union[str, sp.DataFrame]) -> None: | ||
"""Check that df1 is either a Snowpark DF or the name of a valid Snowflake table.""" | ||
if isinstance(df1, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can be consistent with something like we have for polars and push this into _validate_dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit different, in that this is actually used to determine how we construct the dataframe, versus type-checking that the built dataframe is in a valid form actually does occur in _validate_dataframe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, that is fair.
datacompy/sf_sql.py
Outdated
abs_tol : float, optional | ||
Absolute tolerance between two values. | ||
rel_tol : float, optional | ||
Relative tolerance between two values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider including:
df1_name: str = "df1",
df2_name: str = "df2",
to be consistent with the other cleaners.
Just for documentation purposes, Snowpark local testing is still a relatively new product and seems to have a few pieces of functionality that have yet to be implemented, so although all tests pass when using a Snowflake cluster, we get many failures when running locally (usually the same couple of failures affect a handful of tests). For now I've kept the local testing configuration as we can pretty much choose when to run it locally or not, and hopefully within a few Snowpark updates we'll be able to run these tests locally without issue. |
Conclusion regarding testing: Due to current CICD limitation as well as a desire to keep testing resources public, we've attempted to implement testing using either the Snowpark local runner or a local Snowflake resource emulator. In both instances, we see potential but find support somewhat limited, with either option requiring either significant mocking or unnecessary modifications of compare logic to work around lack of support. There's also no guarantee that our tests would currently be able to properly represent user usage through the use of an emulator or the local runner. So for now we will stick to testing Snowflake/Snowpark comparisons internally, and revisit these options at a later date. |
@rhaffar We will need to exclude the test_snowflake.py from the pytest calls since it will cause everything to fail. Need to add |
@rhaffar nice work. LGTM. |
*commit history is pretty bare because I had to reset and update the author
A few considerations that may or may not be worth addressing, but I'd like your takes on these:
'A'
and'a'
are the same column, but'"a"'
is distinct from these. This can be troublesome to handle flexibly for column inclusion checks, and more importantly for instances where you need to tag the dataframe name to the column (the double-quotes screw with this). So for now I just convert all columns andjoin_columns
to be case insensitive. The only way this could be an issue is if users are using the exact same column name for different rows, differentiating only by case sensitivity. I think this is sufficient since the vast majority of Snowflake tables are case insensitive to begin with, and those that aren't likely don't repeat the same column with different casing, but let me know if you have other thoughts.