Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.join with different names doesn't return None name #57074

Closed
wants to merge 6 commits into from

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Index Related to the Index class or subclasses labels Jan 25, 2024
@mroeschke mroeschke added this to the 3.0 milestone Jan 25, 2024
@lukemanley
Copy link
Member

lukemanley commented Jan 25, 2024

I think this introduces an inconsistency between the non-unique and monotonic join paths.

xref #55815 where @rhshadrach and I tried to determine the intended behavior.

When left/right names differ, the choices seem to be:

  • set name to none
  • set name to caller (left)
  • set name to caller (left) in all cases except how="right"

As of 2.2.0 I think there are cases of 1 and 3 occuring in various places. #56948 attempted to implement case 3 consistently but maybe it should be case 1 as you're doing in this PR.

@mroeschke
Copy link
Member Author

Thanks for the reference @lukemanley

For set name to caller (left) in all cases except how="right", is there any case where "caller" is "ambiguous" with a function based join like pd.concat or pd.merge? Should that apply to arithmetic operations too idx1 + idx2 ~ idx1.__add__(idx2)?

I slightly prefer the return None approach as it's already being done in arithmetic cases and appears simpler, but not sure if there's a use case where joins specifically should keep names.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 26, 2024

I slightly prefer the return None approach as it's already being done in arithmetic cases and appears simpler, but not sure if there's a use case where joins specifically should keep names.

Are we meaning to have different rules for Index vs Series?

idx1 = Index([1, 1, 2, 3], name="a")
idx2 = Index([2, 5, 3, 4], name="b")

ser1 = pd.Series(1, index=idx1)
ser2 = pd.Series(2, index=idx2)

print(idx1+idx2)
# Index([3, 6, 5, 7], dtype='int64')

print(ser1+ser2)
# a
# 1    NaN
# 1    NaN
# 2    3.0
# 3    3.0
# 4    NaN
# 5    NaN
# dtype: float64

I can't say I've ever directly used Index.join, only DataFrame.join. Likewise with arithmetic. But if there isn't a good reason why these two behaviors should diverge (not sure), I would think users expect them to match.

@lukemanley
Copy link
Member

I slightly prefer the return None approach as it's already being done in arithmetic cases and appears simpler, but not sure if there's a use case where joins specifically should keep names.

Keeping consistency with implicit alignment (e.g. arithmetic cases) seems reasonable to me.

@mroeschke
Copy link
Member Author

OK with merging here?

@lukemanley
Copy link
Member

OK with merging here?

It looks like this only impacts the _join_non_unique path. Doesn't this create an inconsistency with _join_monotonic? For example, what would this return:

import pandas as pd

a = pd.Index([1, 2, 3], name="a")
b = pd.Index([2, 3, 4], name="b")

a.join(b, how="inner")

@rhshadrach
Copy link
Member

OK with merging here?

I think it'd be good to decide if we are okay with the discrepancy between Index and Series behavior (#57074 (comment)) that this is further widening. It seems to me changing Series behavior would be quite disruptive, while I personally never do arithmetic with index objects.

@mroeschke
Copy link
Member Author

I guess this needs more discussion first so I'll close for now

@mroeschke mroeschke closed this Mar 18, 2024
@mroeschke mroeschke deleted the bug/join/name branch March 18, 2024 18:56
@rhshadrach
Copy link
Member

@mroeschke - any thoughts on the discrepancy between the resulting name of idx1 + idx2 and ser1 + ser2 in #57074 (comment)?

@mroeschke
Copy link
Member Author

I also think there shouldn't be a result-name discrepancy between idx1 + idx2 and ser1 + ser2, but namely get_op_result_name should also be used in both these cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Index.join preserving names incorrectly from pandas-2.x
3 participants