-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inequality when comparing two empty numpy arrays #58
Comments
Hey @simonwongwong hope all is well! |
@jborchma Just want to circle back on this. Thoughts on just checking if both are empty and throwing an exception? This might be something which is never encountered (comparing 2 empty dataframes) |
So technically two empty dataframes should be equal. Maybe we could return |
I'm aligned with that. I'll try and do a quick PR here. |
@jborchma So it seems like @simonwongwong is comparing arrays here. So obviously empty arrays make sense. But the following also doesn't work
Mainly due to the fact that:
|
@simonwongwong Are you comparing a lot of np.arrays? (Could you have arrays of > 1 length?) I'd like to think about the use case a bit more if you have thoughts. |
My use case was reading CSV files with empty arrays using pandas -- pandas will read arrays as numpy arrays and two empty numpy arrays cannot be equal |
Makes sense. I think the issue boils down to how Pandas internalises the dtype for an array. It will be an |
In my case it wasn't always empty. |
@theianrobertson Thoughts on this issue? Dataframes with numpy arrays in columns. |
So what Simon really wants is elementwise comparison of the arrays, right? |
Yup that’s exactly it. Not an actual empty data frame.
…On Sun, May 17, 2020 at 11:56 AM Jan Borchmann ***@***.***> wrote:
So what Simon really wants is elementwise comparison of the arrays, right?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#58 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPJA5T7DIWYEVLRBRZLRNLRR73LBANCNFSM4LFGD7SQ>
.
|
I guess we would want to use something like the numpy function to compare arrays. |
Yeah, on non-empty arrays it'll do a normal element wise comparison, but empty |
The main issue is detecting if it is a np array. Since it comes up as an
object in pandas for the dtype. We’d need to inspect the actual element to
know for sure what we are dealing with. But then the questions comes up of
we should support other things.
…On Sun, May 17, 2020 at 12:25 PM Simon Wong ***@***.***> wrote:
Yeah, on non-empty arrays it'll do a normal element wise comparison, but
empty np.arrays will never be equal
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#58 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPJA5SOHKSNZUMZXZZQCT3RR76W5ANCNFSM4LFGD7SQ>
.
|
If you look at my above example I’m not sure datacompy will automatically work. It will complain and suggest any or all. |
@jborchma Any further thoughts on this. I think the main issue is where and if you draw the line of things to compare. |
@simonwongwong This was a while back now. I'm going to close this issue, but feel free to reopen if it seems like something we need to rehash. Trying to organize our backlog and work through some of these older issue if needed. |
Hi everyone! This is a feature we're missing and I'm happy to spend some time implementing a solution (and also coming up with a proposal how to move forward, if you want). |
Hey @jonashaag yes please. Would love contributions and thoughts from others. Happy to have you take this on. Appreciate you willing to help out. 🚀 |
Had a look into the implementation -- the actual column comparison code ( A) Add new fixed logic for NumPy arrays: try to detect NumPy array columns by looking at the actual series values. Use B) Add a new system for custom declaration of "comparators", ie. give more flexibility to the user to configure how columns are compared. We would ship a default configuration that mimics the current behavior, and users would be free to change the configuration to their liking. This could be as simple as giving a list of comparators that are tried in order until one of them "understand" the data, ie. the user could pass something like: columns_equal(..., comparators=[
FloatComparator(rtol=1e-3),
StringComparator(case_sensitive=False),
ArrayComparator(aggregate="all") # calls .all()
]) Or it could be an explicit list of comparators for each column, or something similar. |
@jonashaag ill take a look at this on Monday. Been on vacation all week. Thanks for your help with this. I do think datacompy could be ready for a major refactor to be honest. Especially aligning the spark and pandas APIs |
Reopening this issue. I like the idea of option B @jonashaag . But that seems like a bit of a refactor and something I've been thinking about with the package. I'd like to revisit it and see if there are opportunities to one make it more flexible and also play nicer with Spark/Pandas all in one spot. I was thinking maybe koalas might be a good option here. Option A would be the quickest and solve this direct issue immediately it seems. Thoughts? @jonashaag @jborchma @elzzhu @theianrobertson ? |
I have little experience with Spark and I'm not sure if I'll be able to invest the learning time right now. |
That is perfectly fine, that is something I can lean into. |
Comparison of two empty
numpy
arrays currently returnFalse
, which results in showing diffs where there shouldn't be.This is due to the way
numpy
compares empty arrays.Running
bool(np.array([]) == np.array([]))
returnsFalse
and throws this warning:Reproduce this bug with:
output:
The text was updated successfully, but these errors were encountered: