-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datacompy: Object/string misinterpreted as float -> false equal result #121
Comments
Thanks for flagging this. I can take a closer look tomorrow morning. Sent from ProtonMail for iOS On Tue, Nov 2, 2021 at 6:35 AM, petrafakler ***@***.***> wrote:
While using datacompy.compare a string/object was misinterpreted as float (because string has only digits). After all the strings have got length 35 and are only different in the last digit. The misinterpreted float was cutted and compare says that the values are equal.
Example:
import pandas as pd
import datacompy as datacompy
df1 = pd.DataFrame({'ID':[1], 'REFER_NR': ['9998700990704001708177961516923014']})
df2 = pd.DataFrame({'ID':[1], 'REFER_NR': ['9998700990704001708177961516923015']})
compare = datacompy.Compare(
df1,
df2,
join_columns='ID', #You can also specify a list of columns
abs_tol=0, #Optional, defaults to 0
rel_tol=0, #Optional, defaults to 0
df1_name='TEST', #Optional, defaults to 'df1'
df2_name='INTE' #Optional, defaults to 'df2'
)
print(compare.report())
result:
Column Summary
Number of columns in common: 2
Number of columns in TEST but not in INTE: 0
Number of columns in INTE but not in TEST: 0
Row Summary
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 1
Number of rows in TEST but not in INTE: 0
Number of rows in INTE but not in TEST: 0
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 1
Column Comparison
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 0
Maybe number of digits can help to interpret float and object.
—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android.
|
So looking into this issue, it seem to be happening here due to the following code. Using @jborchma @elzzhu @ak-gupta @theianrobertson any thoughts/opinions on this? |
Thanks for analysis so far :-) |
I think it makes sense to add a flag to not cast since there are instances where IDs are numerical but you don't necessarily want to treat them as such. If the flag was added, would the default behaviour be the current behaviour? |
@elzzhu I think it would default to the current behaviour:
This should solve for the issue and keep existing behaviour. Only thing is for now it might need to be on all columns vs picking and choosing. |
I’m going to take a stab at a fix this week for this. Sorry fell off my radar. |
Any word on this one. We are experiencing this issue as well. |
@james-stead sorry about that. This sort of fell off the radar a bit. I'm assuming you have some numbers (as strings) which are being cast into a float type correct? If you are ok with the above proposal we can add a new optional flag to not cast certain columns? |
While using datacompy.compare a string/object was misinterpreted as float (because string has only digits). After all the strings have got length 35 and are only different in the last digit. The misinterpreted float was cutted and compare says that the values are equal.
Example:
import pandas as pd
import datacompy as datacompy
df1 = pd.DataFrame({'ID':[1], 'REFER_NR': ['9998700990704001708177961516923014']})
df2 = pd.DataFrame({'ID':[1], 'REFER_NR': ['9998700990704001708177961516923015']})
compare = datacompy.Compare(
df1,
df2,
join_columns='ID', #You can also specify a list of columns
abs_tol=0, #Optional, defaults to 0
rel_tol=0, #Optional, defaults to 0
df1_name='TEST', #Optional, defaults to 'df1'
df2_name='INTE' #Optional, defaults to 'df2'
)
print(compare.report())
result:
Column Summary
Number of columns in common: 2
Number of columns in TEST but not in INTE: 0
Number of columns in INTE but not in TEST: 0
Row Summary
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 1
Number of rows in TEST but not in INTE: 0
Number of rows in INTE but not in TEST: 0
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 1
Column Comparison
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 0
Maybe number of digits can help to interpret float and object.
The text was updated successfully, but these errors were encountered: