Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Index.map adding back tz to tz-agnostic result #57475

Closed
wants to merge 4 commits into from

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Feb 17, 2024

It seems to me we should skip casting back to the caller's dtype when the caller dtype is datetime, since these are reliably represented as objects (unlike ints, which loses the size). But datetimes/tzs are definitely not in my wheelhouse.

cc @jbrockmendel

@rhshadrach rhshadrach added Regression Functionality that used to work in a prior pandas version Apply Apply, Aggregate, Transform, Map Index Related to the Index class or subclasses labels Feb 17, 2024
@rhshadrach rhshadrach added this to the 2.2.1 milestone Feb 17, 2024
new_values = maybe_cast_pointwise_result(
new_values, self.dtype, same_dtype=same_dtype
)
if self.inferred_type != "datetime64":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what other dtypes might arise here - need to check

@rhshadrach rhshadrach marked this pull request as draft February 19, 2024 23:03
@jbrockmendel
Copy link
Member

I'm pretty sure the problem is in DatetimeArray._from_scalars, where we need to check that dont have tzawareness-mismatch. instead of

if lib.infer_dtype(scalars, skipna=True) not in ["datetime", "datetime64"]:
    raise ValueError

something like

if isinstance(dtype, DatetimeTZDtype) and not lib.is_datetime_with_singletz_array(scalars);
     raise ValueError
elif isinstance(dtype, np.dtype) and not lib.is_naive_datetime_array(scalars):
    raise ValueError

is_naive_datetime_array doesn't actually exist but seems like it should.

@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@rhshadrach
Copy link
Member Author

rhshadrach commented Mar 9, 2024

Thanks @jbrockmendel - this is much better. One thing I'm running into is that is_datetime_with_singletz_array doesn't check if any of the values are actually datetimes (will always return True when there are no datetimes), so I think this still needs to have the line with lib.infer_dtype(scalars, skipna=True) or maybe use lib.is_datetime_or_datetime64_array? Any recommendations?

@jbrockmendel
Copy link
Member

I'm trying to get rid of uses of lib.infer_dtype as it is very rarely the right tool to reach for.

Seems like is_datetime_with_singletz_array(values) and not isna(values).all() is the simplest way to get what you've described. Could try to combine those into a single function i guess but i wouldn't bother at this point.

# TODO: if dtype is passed, check for tzawareness compat?
# TODO: require any NAs be valid-for-DTA
# TODO: if dtype is passed, check for tzawareness compat?
if not lib.is_datetime64_array(scalars):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need this (or something like it) because e.g. lib.is_datetime_with_singletz_array does not check if there are datetimes:

print(lib.is_datetime_with_singletz_array(np.array([1, 2, 3])))
# True

However, I do not currently understand the behavior differences between this and the previous lib.infer_dtype and need to look into this more. Any guidance here is much appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we get that behavior from is_datetime_with_singletz_array bc it implicitly assumes it is called from maybe_convert_objects, where it will only get called if the first non-NaT it sees is a datetime with a tz.

I'm not a fan of infer_dtype bc the string returns are clunky (and hard to reason about, e.g. when do you expect "datetime" vs "datetime64") and fragile when we add new return values (e.g. "integer-na"). Also in cases when you only care about datetimes, it will not short-circuit when it sees a non-datetime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - makes perfect sense to move away from infer_dtype. For the old behavior, I'm seeing the use of infer_dtype allow the following to not raise on a non-empty input:

  • If the scalars has the attribute type, kind, name, or base set to one of "datetime64[ns]", "M", datetime
  • is_datetime64_array returns True
  • is_datetime_array returns True

As far as I can tell, the input scalars here must be an ndarray, Index, or ExtensionArray, but I don't feel confident about this from the code path in Index.map.

With this, I'm wondering if the following is a good equivalent:

if (
    not is_datetime64_any_dtype(scalars)
    and not lib.is_datetime64_array(scalars)
    and not lib.is_datetime_array(scalars)
):
    raise ValueError

@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 11, 2024
@rhshadrach rhshadrach closed this Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas 2.2.0 DataFrame.groupby regression when group labels of type pd.Timestamp
3 participants