-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FugueSQL implementation #259
Conversation
@goodwanghan Sorry for the delay here. Been traveling and want to wait for #275 to get merged so I can focus on this PR. Again sorry for the delay, haven't forgotten about this. |
@goodwanghan I'm going to push a rebase to the PR as there have been some downstream changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial questions. Will clone locally and dive a bit deeper into the sql logic. Thank you for this! 🎉
Hey @goodwanghan no rush just wanted to touch base with some of the initial questions on the review. |
Hi @fdosani my apology I forgot to reply. |
No worries at all! Not going to lie, I've been a bit busy myself. Plan on spending some more time on the PR tomorow. Appreciate your help as always. |
@goodwanghan Been looking through the code and playing around with it locally. Looks good!
Thanks again! @jdawang not sure if you have some time to review and tinker here but would be nice to get another set of eyes on this. |
I think the SQL version will ultimately replace the current fugue version for datacompy This is no longer a WIP. It is ready to merge. Thanks :) |
Sounds good. I think there is a conflict with |
Tests will fail since 0.9.0 is not released yet. |
Aiming for this PR to go into release |
Cool, Fugue 0.9.0 was released yesterday. I will double check tonight to see if there is anything remaining. |
Might just need to rebase the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just doing a review now. Can you run black
on the PR? So far the code looks good but the linter and adding some whitespace to separate logical chunks of code might help me review faster.
Merging this into a local branch. Need to do some docstring updates and clean up. |
This change includes the totally redesigned Fugue solution based on Fugue SQL. For some distributed backends such as Ray, we will use map + FugueSQL solution.
This change also includes the standardized perf tests.
The benchmark is on a 8cpu 230GB vm:
The first batch contains 1k and 1m dataframes. In this case Fugue SQL solution has a fixed overhead around 200ms. So you can see in the 1k case, the native Pandas and Polars solutions are better. However, in 1m case, Fugue outperforms the native solutions in every way.
This following are the tests on 10m dataframes. Now Fugue tests are a few times faster
This following are the tests on 100m dataframes.