Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix when all columns match but no rows match #277

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

fdosani
Copy link
Member

@fdosani fdosani commented Mar 12, 2024

Fixes #276

@SimonBFrank would you mind pulling down this branch and seeing if it fixes your issue. Long story short, there are no results and the subsequent dictionary with values which get displayed out is not able to generate because of None values.

More of a hack since we will be deprecating this legacy Spark implementation (#275) for a much more readable and logically similar to Pandas version. Just catching the TypeError for this situation and putting out {} for columns_with_any_diffs and columns_fully_matching

@fdosani fdosani added the bug Something isn't working label Mar 12, 2024
@fdosani fdosani marked this pull request as ready for review March 12, 2024 00:30
@SimonBFrank
Copy link

When I compare the two dataframes with join columns ["id", "label"] this is the result:

Dataframe 1:

id label tmp
1 foo 1
2 bar 1

Dataframe 2:

id label tmp
3 foo 1
4 bar 1
****** Column Summary ******
Number of columns in common with matching schemas: 3
Number of columns in common with schema differences: 0
Number of columns in base but not compare: 0
Number of columns in compare but not base: 0

****** Row Summary ******
Number of rows in common: 0
Number of rows in base but not compare: 2
Number of rows in compare but not base: 2
Number of duplicate rows found in base: 0
Number of duplicate rows found in compare: 0

****** Row Comparison ******
Number of rows with some columns unequal: 0
Number of rows with all columns equal: 0

****** Column Comparison ******
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 0

****** Columns with Unequal Values ******
Base Column Name  Compare Column Name  Base Dtype     Compare Dtype  # Matches  # Mismatches
----------------  -------------------  -------------  -------------  ---------  ------------

I believe Number of rows with some columns unequal and Number of columns compared with some values unequal should be 2 since only the values in the column id are different. Additionally, Columns with Unequal Values should have id.

@fdosani
Copy link
Member Author

fdosani commented Mar 12, 2024

I believe Number of rows with some columns unequal and Number of columns compared with some values unequal should be 2 since only the values in the column id are different. Additionally, Columns with Unequal Values should have id.

I might not be following right, but I think since none of the join columns match (["id", "label"]) in this situation it should all be 0. It is joining on both the fields, not just one.

@SimonBFrank
Copy link

I believe Number of rows with some columns unequal and Number of columns compared with some values unequal should be 2 since only the values in the column id are different. Additionally, Columns with Unequal Values should have id.

I might not be following right, but I think since none of the join columns match (["id", "label"]) in this situation it should all be 0. It is joining on both the fields, not just one.

Whoops, I must've been late and I didn't understand it correctly. LGTM

@fdosani fdosani merged commit 930e038 into develop Mar 12, 2024
28 checks passed
@fdosani fdosani deleted the spark-no-rows-match branch March 12, 2024 14:10
fdosani pushed a commit that referenced this pull request Mar 12, 2024
fdosani pushed a commit that referenced this pull request Mar 12, 2024
fdosani pushed a commit that referenced this pull request Mar 25, 2024
fdosani added a commit that referenced this pull request Mar 25, 2024
* refactor SparkCompare

* tweaking SparkCompare and adding back Legacy

* conditional import

* cleaning up tests and using pytest-spark for legacy

* adding docs

* caching and some typo fixes

* adding in doc and pandas 2 changes

* adding pandas to testing matrix

* drop 3.8

* drop 3.8

* refactoring ^

* rebase fix for #277

* fixing legacy uncode column names

* unicode fix for legacy

* unicode test for new spark logic

* typo fix

* changes from PR review
rhaffar pushed a commit to rhaffar/datacompy that referenced this pull request Sep 12, 2024
rhaffar pushed a commit to rhaffar/datacompy that referenced this pull request Sep 12, 2024
* refactor SparkCompare

* tweaking SparkCompare and adding back Legacy

* conditional import

* cleaning up tests and using pytest-spark for legacy

* adding docs

* caching and some typo fixes

* adding in doc and pandas 2 changes

* adding pandas to testing matrix

* drop 3.8

* drop 3.8

* refactoring ^

* rebase fix for capitalone#277

* fixing legacy uncode column names

* unicode fix for legacy

* unicode test for new spark logic

* typo fix

* changes from PR review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

report throws an exception when all columns match but no rows match
3 participants