Optimize Snowflake/Snowpark Compare #354

rhaffar · 2024-11-12T21:17:14Z

Make the max_diff, null_dif, and matching calculations non-blocking so that Snowflake may process them all simultaneously.
Multithread the column stat calculation so that several column-pairs can forward their calculations requests to Snowflake simultaneously.
Patched an intermittent bug where the Snowflake optimizer in an attempt to pushdown filtering would end up applying an incorrect filter which would ruin the pre-merge df1 and df2 dataframes.

Additional context:

The multithreading solution required splitting apart the adding of match rows to the intersect dataframe from the calculation of matches/max_diff/null_diff. This is because in order to keep the function thread safe, we want to avoid updating the intersect dataframe across multiple threads simultaneously. Any pointers on making the code there a little more clean is appreciated.
The intermittent issue was (and still is) hard to pin down. Ultimately, the issue was resolved by forcing caching on df1 and df2 prior to the join which generates the temp columns, since it was this join that was being ruined by the Snowflake optimizer. It might have been enough to just force a collection prior to the join, but this seems to work fine as well.

Benchmark in following comment

rhaffar · 2024-11-12T21:41:28Z

Attached is a brief benchmark on the new optimizations. We can see there's very little difference in performance between the 2-8 column use cases in both 1-10 million rows, meanwhile we see a very significant difference performing a compare using a model monitoring table (with 15 join colums and 95 compare columns). This is because the model monitoring table compare's runtime is heavily centered on the intersect portion where columns are compared against eachother, as opposed to the merge portion where the provided tables/dataframes are merged. This is expected, as all performance changes in this PR are centered specifically on the intersect component, making comparisons on large tables with many columns far more feasible from a performance perspective.

Factors that determine whether the runtime of the compare is centered on the intersect includes:

A large number of columns.
Columns that are more expensive to compare.

The comparison data used for the 2-8 col, 1-10 million row benchmark are made up of float and int columns which take 0.1-0.2 seconds per match calculation (max_diff, null, match count). This would at best shave off 2-3 seconds runtime for the 8 column comparisons, which is insignificant compared against Snowflake performance volatility. The model monitoring table, on the other hand, is made up of a vast array of columns that can take up to 1.5 seconds to compare per match calculation per column. This depends primarily on the datatype and the size of the intersect dataframe.

With these new performance improvements, the Snowflake Compare runtime is actually shorter than the time it would take just to load Snowflake tables into dataframes to set up a Pandas Compare. Meaning if you have any Snowflake tables that you want to compare, it will almost always be faster to do a Snowflake compare than it will be to load the tables into Pandas dataframes and do a Pandas compare (all this without having to store two entire tables as dataframes in memory). The larger the tables, the more true this will be.

fdosani · 2024-11-13T02:40:45Z

@rhaffar can you also post the results from running the snowflake tests. Just make sure to omit any sensitive info in paths etc. Just so we have some record that those test also work.

gladysteh99

LGTM! Thanks for the great work.

rhaffar added 3 commits November 6, 2024 11:14

non-blocking diffs, multithread intersect

43ca31c

caching to prevent bad snowflake pushdown optimization

c0d4c71

update max_diff test

8a63824

rhaffar requested review from fdosani, ak-gupta, jdawang and gladysteh99 as code owners November 12, 2024 21:17

gladysteh99 approved these changes Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Snowflake/Snowpark Compare #354

Optimize Snowflake/Snowpark Compare #354

rhaffar commented Nov 12, 2024

rhaffar commented Nov 12, 2024 •

edited

Loading

fdosani commented Nov 13, 2024

gladysteh99 left a comment

Optimize Snowflake/Snowpark Compare #354

Are you sure you want to change the base?

Optimize Snowflake/Snowpark Compare #354

Conversation

rhaffar commented Nov 12, 2024

rhaffar commented Nov 12, 2024 • edited Loading

fdosani commented Nov 13, 2024

gladysteh99 left a comment

Choose a reason for hiding this comment

rhaffar commented Nov 12, 2024 •

edited

Loading