-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas on Spark refactor #275
Conversation
Just want to share some benchmarks from my personal computer (8 cores, 32GB). 5 millions rows, 10 columns:
10 Million rows, 10 columns:
Polars seems to win the horse race here and can handle fairly large data frames on my modest machine. |
Converting this back to a draft. The minimal spark version support is a bit concerning. I think I might take another stab at the native PySpark version over Pandas on PySpark UPDATE: I've spent more time than I'd like to admit, to get some form of backwards computability with older Spark version and Pandas 2.0+. Not sure, it is worth spending more time on this. @ak-gupta @jdawang @NikhilJArora if you don't have any objections to only being Spark 3.5+ compatible in 0.12.0 on wards we can flag this PR as ready for review. The other option is to focus on the following branch: https://github.com/capitalone/datacompy/tree/spark-compare-rewrite |
I actually was able to track down the support ticket. So Pandas 2.0 isn't supported until Spark 4. See here: https://issues.apache.org/jira/browse/SPARK-44101 |
f237a80
to
6813cd2
Compare
6813cd2
to
6c59e0b
Compare
20 executors, 8 cores, 32 GB
|
3b8ddd0
to
959b320
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few nitpicky comments if you'll indulge me and one question. We're purposefully removing LegacySparkCompare
docs because we don't want people to start using it right?
datacompy/__init__.py
Outdated
@@ -25,4 +25,4 @@ | |||
unq_columns, | |||
) | |||
from datacompy.polars import PolarsCompare | |||
from datacompy.spark import NUMERIC_SPARK_TYPES, SparkCompare | |||
from datacompy.spark import SparkCompare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pass # Let non-Spark people at least enjoy the loveliness of the pandas datacompy functionality | ||
|
||
|
||
warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message doesn't seem accurate if I understand correctly? This is in the legacy module so it seems like we should either delete this message from this module or change the string to something warning that we're going to delete legacy functionality in a vx.x.x release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a tweak.
@@ -19,11 +19,10 @@ jobs: | |||
strategy: | |||
fail-fast: false | |||
matrix: | |||
python-version: [3.8, 3.9, '3.10', '3.11'] | |||
spark-version: [3.1.3, 3.2.4, 3.3.4, 3.4.2, 3.5.0] | |||
python-version: [3.9, '3.10', '3.11'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not testing on py38, I think this would warrant a formal declaration of dropping support + changing supported versions in the pyproject.toml. Learned this the hard way haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making changes in pyproject.toml
datacompy/spark.py
Outdated
class SparkCompare: | ||
"""Comparison class used to compare two Spark Dataframes. | ||
class SparkCompare(BaseCompare): | ||
"""Comparison class to be used to compare whether two Pandas on Spark dataframes as equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky grammar, but as equal
-> are equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
datacompy/spark.py
Outdated
.cumcount() | ||
) | ||
else: | ||
return dataframe[join_columns].groupby(join_columns).cumcount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/source/spark_usage.rst
Outdated
|
||
DataComPy's Panads on Spark implementation ``SparkCompare`` (new in ``v0.12.0``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panads
-> Pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/source/spark_usage.rst
Outdated
|
||
DataComPy's Panads on Spark implementation ``SparkCompare`` (new in ``v0.12.0``) | ||
is very similar port of the Pandas version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is very
-> is a very
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/source/spark_usage.rst
Outdated
|
||
|
||
Until then we will not be supporting Pandas 2 for the Pandas on Spark API implementaion. | ||
For Fugue and the Native Pandas implementation Pandas 2 is support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support -> supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Yeah exactly. I want to keep it incase someone needs or wants it, but I don't want to support it moving forward. The new implementation is better aligned to Pandas in terms of logic so we should favour that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.
- There is an additional
_merge_right
column which is not in the original dataframes, which could cause a bit of confusion for users. - We're displaying the column names as their aliases, which could also be a bit confusing. It would be best to translate them back to their original names.
Not a blocker for this, but we should open a follow-up issue to keep track of this.
import pandas as pd
import pyspark.pandas as ps
pdf1 = pd.DataFrame.from_dict({"id": [1,2,3,4,5], "a": [2,3,2,3, 2], "b": ["a", "b", "c", "d", ""]})
pdf2 = pd.DataFrame.from_dict({"id": [1,2,3,4,5, 6], "a": [2,3,2,3, 2, np.nan], "b": ["a", "b", "c", "d", "", pd.NA]})
df1 = ps.DataFrame(pdf1)
df2 = ps.DataFrame(pdf2)
Spark
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 3 5
1 df2 3 6
Column Summary
--------------
Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 a int64 float64 0 0.0 0
Sample Rows with Unequal Values
-------------------------------
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
id_df2 a_df2 b_df2 _merge_right
5 6 NaN None True
Pandas
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 3 5
1 df2 3 6
Column Summary
--------------
Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 a int64 float64 0 0.0 0
Sample Rows with Unequal Values
-------------------------------
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
id a b
0 6 NaN <NA>
Ahhh good catch! |
Yeah, up to you, i don't think this is big enough to be a blocker if you want to merge :) |
* refactor SparkCompare * tweaking SparkCompare and adding back Legacy * conditional import * cleaning up tests and using pytest-spark for legacy * adding docs * caching and some typo fixes * adding in doc and pandas 2 changes * adding pandas to testing matrix * drop 3.8 * drop 3.8 * refactoring ^ * rebase fix for capitalone#277 * fixing legacy uncode column names * unicode fix for legacy * unicode test for new spark logic * typo fix * changes from PR review
So a major update here.
Closes #274
Closes #13