Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FugueSQL implementation #259

Merged
merged 22 commits into from
May 23, 2024
Merged

Conversation

goodwanghan
Copy link
Contributor

@goodwanghan goodwanghan commented Jan 13, 2024

This change includes the totally redesigned Fugue solution based on Fugue SQL. For some distributed backends such as Ray, we will use map + FugueSQL solution.

This change also includes the standardized perf tests.

The benchmark is on a 8cpu 230GB vm:

The first batch contains 1k and 1m dataframes. In this case Fugue SQL solution has a fixed overhead around 200ms. So you can see in the 1k case, the native Pandas and Polars solutions are better. However, in 1m case, Fugue outperforms the native solutions in every way.

-------------------------------------------------------------------------------- benchmark 'Fugue Duckdb': 2 tests --------------------------------------------------------------------------------
Name (time in ms)            Min                   Max                  Mean             StdDev                Median                 IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000                    230.8344 (1.0)        360.4439 (1.0)        257.9363 (1.0)      46.2692 (1.0)        237.1728 (1.0)        8.0511 (1.0)           2;2  3.8769 (1.0)          10           1
1000000               1,270.9667 (5.51)     1,435.2726 (3.98)     1,345.2778 (5.22)     60.4833 (1.31)     1,330.7293 (5.61)     100.2451 (12.45)         4;0  0.7433 (0.19)         10           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'Fugue From Path': 2 tests ------------------------------------------------------------------------------
Name (time in ms)            Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000                    231.7124 (1.0)        357.9333 (1.0)        257.1750 (1.0)      43.4362 (1.08)       238.5309 (1.0)      10.4215 (1.0)           2;2  3.8884 (1.0)          10           1
1000000               1,288.5981 (5.56)     1,429.7279 (3.99)     1,323.4801 (5.15)     40.4048 (1.0)      1,314.4691 (5.51)     21.1441 (2.03)          1;1  0.7556 (0.19)         10           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------- benchmark 'Fugue Pandas': 2 tests --------------------------------------------------------------------------------
Name (time in ms)            Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000                    244.0393 (1.0)        354.2560 (1.0)        259.8368 (1.0)      33.3450 (1.0)        248.6499 (1.0)       6.0624 (1.0)           1;1  3.8486 (1.0)          10           1
1000000               1,011.3400 (4.14)     1,226.8940 (3.46)     1,076.6292 (4.14)     59.1444 (1.77)     1,063.0359 (4.28)     45.9311 (7.58)          2;1  0.9288 (0.24)         10           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------- benchmark 'Fugue Polars': 2 tests ----------------------------------------------------------------------------
Name (time in ms)          Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000                  245.5637 (1.0)      364.3286 (1.0)      264.3582 (1.0)      35.7350 (1.0)      252.3867 (1.0)      12.0981 (1.0)           1;1  3.7827 (1.0)          10           1
1000000               720.2136 (2.93)     962.4623 (2.64)     787.8628 (2.98)     68.9155 (1.93)     770.7164 (3.05)     57.7406 (4.77)          1;1  1.2693 (0.34)         10           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'Pandas': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)            Min                   Max                  Mean             StdDev                Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000                     49.1814 (1.0)         59.1568 (1.0)         53.2441 (1.0)       3.2670 (1.0)         52.0470 (1.0)       3.3041 (1.0)           3;0  18.7814 (1.0)          10           1
1000000               1,980.7163 (40.27)    2,277.0961 (38.49)    2,133.1111 (40.06)    88.0153 (26.94)    2,135.2367 (41.03)    69.4043 (21.01)         3;2   0.4688 (0.02)         10           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'Polars': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)            Min                   Max                  Mean             StdDev                Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000                     29.9618 (1.0)         34.7551 (1.0)         31.9761 (1.0)       1.6170 (1.0)         31.9876 (1.0)       2.4362 (1.0)           5;0  31.2734 (1.0)          10           1
1000000               1,450.6187 (48.42)    1,573.4601 (45.27)    1,509.4765 (47.21)    37.5102 (23.20)    1,498.0326 (46.83)    25.0244 (10.27)         3;3   0.6625 (0.02)         10           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This following are the tests on 10m dataframes. Now Fugue tests are a few times faster

----------------------------------- benchmark 'Fugue Duckdb': 1 tests -----------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
10000000             3.2077  3.4287  3.2726  0.0740  3.2498  0.0882       2;0  0.3056      10           1
---------------------------------------------------------------------------------------------------------

---------------------------------- benchmark 'Fugue From Path': 1 tests ---------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
10000000             3.0992  3.3413  3.1877  0.0715  3.1871  0.1059       3;0  0.3137      10           1
---------------------------------------------------------------------------------------------------------

----------------------------------- benchmark 'Fugue Pandas': 1 tests -----------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
10000000             7.3088  7.8993  7.5519  0.1991  7.5137  0.3183       3;0  0.1324      10           1
---------------------------------------------------------------------------------------------------------

----------------------------------- benchmark 'Fugue Polars': 1 tests -----------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
10000000             4.3440  5.2962  4.5521  0.2909  4.4308  0.1875       1;1  0.2197      10           1
---------------------------------------------------------------------------------------------------------

---------------------------------------- benchmark 'Pandas': 1 tests ----------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
10000000             29.4323  30.0649  29.7274  0.2342  29.6526  0.4064       4;0  0.0336      10           1
-------------------------------------------------------------------------------------------------------------

---------------------------------------- benchmark 'Polars': 1 tests ----------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
10000000             16.8016  17.7913  17.3864  0.3588  17.5418  0.4802       3;0  0.0575      10           1
-------------------------------------------------------------------------------------------------------------

This following are the tests on 100m dataframes.

------------------------------------- benchmark 'Fugue Duckdb': 1 tests -------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
100000000            24.3345  25.3882  24.6994  0.3147  24.6405  0.3174       3;1  0.0405      10           1
-------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark 'Fugue From Path': 1 tests -----------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
100000000            24.4954  25.4659  24.8331  0.4173  24.5519  0.7804       3;0  0.0403      10           1
-------------------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'Fugue Polars': 1 tests -------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
100000000            47.3120  54.0295  49.5615  2.0719  48.7384  3.1286       2;0  0.0202      10           1
-------------------------------------------------------------------------------------------------------------

------------------------------------------ benchmark 'Polars': 1 tests -------------------------------------------
Name (time in s)          Min       Max      Mean  StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
100000000            195.6008  213.6346  204.2704  6.5701  204.6616  10.4155       4;0  0.0049      10           1
------------------------------------------------------------------------------------------------------------------

@goodwanghan goodwanghan changed the title FugueSQL implementation WIP: FugueSQL implementation Jan 13, 2024
@fdosani
Copy link
Member

fdosani commented Mar 25, 2024

@goodwanghan Sorry for the delay here. Been traveling and want to wait for #275 to get merged so I can focus on this PR. Again sorry for the delay, haven't forgotten about this.

@fdosani
Copy link
Member

fdosani commented Mar 25, 2024

@goodwanghan I'm going to push a rebase to the PR as there have been some downstream changes.

Copy link
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial questions. Will clone locally and dive a bit deeper into the sql logic. Thank you for this! 🎉

datacompy/_fsql_utils.py Outdated Show resolved Hide resolved
datacompy/fsql.py Outdated Show resolved Hide resolved
.github/workflows/test-package.yml Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@fdosani
Copy link
Member

fdosani commented Apr 15, 2024

Hey @goodwanghan no rush just wanted to touch base with some of the initial questions on the review.

@goodwanghan
Copy link
Contributor Author

Hey @goodwanghan no rush just wanted to touch base with some of the initial questions on the review.

Hi @fdosani my apology I forgot to reply.

@fdosani
Copy link
Member

fdosani commented Apr 16, 2024

Hey @goodwanghan no rush just wanted to touch base with some of the initial questions on the review.

Hi @fdosani my apology I forgot to reply.

No worries at all! Not going to lie, I've been a bit busy myself. Plan on spending some more time on the PR tomorow. Appreciate your help as always.

@fdosani
Copy link
Member

fdosani commented Apr 18, 2024

@goodwanghan Been looking through the code and playing around with it locally. Looks good!
Couple of questions and comments:

  • Is the thought the SQL version would just replace the current fugue implementation?
  • I like the CompareResult class. Wondering if we can have it mimic the report a bit closer to the native implementations.
  • Also it seems like duckdb is pretty performant. Kind of tangent tot his but what are your thoughts on maybe re-implementing the pandas (native) to actually use duckdb in the background?
  • Just to confirm is this still a WIP? Wasn't 100% sure.

Thanks again!

@jdawang not sure if you have some time to review and tinker here but would be nice to get another set of eyes on this.

@goodwanghan
Copy link
Contributor Author

@goodwanghan Been looking through the code and playing around with it locally. Looks good! Couple of questions and comments:

  • Is the thought the SQL version would just replace the current fugue implementation?
  • I like the CompareResult class. Wondering if we can have it mimic the report a bit closer to the native implementations.
  • Also it seems like duckdb is pretty performant. Kind of tangent tot his but what are your thoughts on maybe re-implementing the pandas (native) to actually use duckdb in the background?
  • Just to confirm is this still a WIP? Wasn't 100% sure.

Thanks again!

@jdawang not sure if you have some time to review and tinker here but would be nice to get another set of eyes on this.

I think the SQL version will ultimately replace the current fugue version for datacompy
We can mimic the native implementation, we just need to add some adapters. The CompareResult idea is a more general solution in my opinion
I really think the duckdb solution should be the "native" solution. The installation is very lightweight, and the speed is superb, I don't see any concern to do that.

This is no longer a WIP. It is ready to merge.

Thanks :)

@fdosani
Copy link
Member

fdosani commented Apr 22, 2024

This is no longer a WIP. It is ready to merge.

Thanks :)

Sounds good. I think there is a conflict with pyproject.toml. I'm thinking it might make sense to merge once the 0.9.0 of fugue is released? I think I also had some suggestions if we are ok to include those (copyright dates). I might propose some more changes if thats cool.

@fdosani
Copy link
Member

fdosani commented Apr 22, 2024

Tests will fail since 0.9.0 is not released yet.

@fdosani
Copy link
Member

fdosani commented Apr 30, 2024

Aiming for this PR to go into release v0.12.1.

@goodwanghan
Copy link
Contributor Author

Cool, Fugue 0.9.0 was released yesterday. I will double check tonight to see if there is anything remaining.

@fdosani
Copy link
Member

fdosani commented May 1, 2024

Cool, Fugue 0.9.0 was released yesterday. I will double check tonight to see if there is anything remaining.

Might just need to rebase the PR

Copy link

@ak-gupta ak-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doing a review now. Can you run black on the PR? So far the code looks good but the linter and adding some whitespace to separate logical chunks of code might help me review faster.

@fdosani fdosani changed the title WIP: FugueSQL implementation FugueSQL implementation May 20, 2024
@fdosani fdosani changed the base branch from develop to fugue-sql-updates May 23, 2024 15:26
@fdosani
Copy link
Member

fdosani commented May 23, 2024

Merging this into a local branch. Need to do some docstring updates and clean up.

@fdosani fdosani merged commit 84a6dee into capitalone:fugue-sql-updates May 23, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants