Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of two pandas "period"-Columns is always False/not working #336

Closed
Salfiii opened this issue Oct 10, 2024 · 4 comments
Closed

Comments

@Salfiii
Copy link

Salfiii commented Oct 10, 2024

Hi,

we have multiple cases where we have to work dates that don´t fit into the pandas "datetime64[ns]" type. We are using pandas Period as a replacement.

when comparing dataframes with period-Columns, datacompy always returns all rows as non-matching, even if they should match. Below an example:

import datetime

import datacompy
import pandas as pd

"""
source_df: pd.DataFrame = pd.DataFrame(
    columns=[
        "PK", "STRING_COLUMN", "INT_COLUMN", "TIMESTAMP_COLUMN", "DECIMAL_COLUMN", "CHAR_COLUMN",
        "FLOAT_COLUMN", "BOOLEAN_COLUMN", "PERIOD_COLUMN"
    ], data=[
        [0, "same", 0, datetime.datetime(year=2024, month=1, day=1), 1.2345, "CHAR", 1.23, True,
         datetime.datetime(year=9999, month=1, day=1, second=1)],
        [1, "same", 1, datetime.datetime(year=2024, month=2, day=1), 2.2345, "CHAR", 2.23, False,
         datetime.datetime(year=9999, month=1, day=1, second=1)],
        [2, "different", 2, datetime.datetime(year=2024, month=3, day=1), 3.2345, "CHAR", 3.23, False,
         datetime.datetime(year=9999, month=2, day=1, second=1)],
        [3, "same", 3, datetime.datetime(year=2024, month=4, day=1), 4.2345, "CHAR", 4.23, False,
         datetime.datetime(year=9999, month=3, day=1, second=1)],
        [4, "different", 4, datetime.datetime(year=2024, month=5, day=1), 5.2345, "CHAR", 5.23, True,
         datetime.datetime(year=9999, month=4, day=1, second=1)],
        [5, "different", 5, datetime.datetime(year=2024, month=6, day=1), 6.2345, "CHAR", 6.23, True,
         datetime.datetime(year=9999, month=5, day=1, second=1)]

    ]
)
"""

source_df: pd.DataFrame = pd.DataFrame(
    columns=[
        "PK",  "PERIOD_COLUMN"
    ], data=[
        [1,
         datetime.datetime(year=9999, month=1, day=1, second=1)],
        [2,
         datetime.datetime(year=9999, month=1, day=1, second=1)],
        [3,
         datetime.datetime(year=9999, month=2, day=1, second=1)],
        [4,
         datetime.datetime(year=9999, month=3, day=1, second=1)],
        [5,
         datetime.datetime(year=9999, month=4, day=1, second=1)],
        [6,
         datetime.datetime(year=9999, month=5, day=1, second=1)]

    ]
)


dtypes: dict = {"PK": pd.Int64Dtype(),
                #"STRING_COLUMN": pd.StringDtype(),
                #"INT_COLUMN": pd.Int64Dtype(),
                #"TIMESTAMP_COLUMN": "datetime64[ns]",
                #"DECIMAL_COLUMN": pd.Float64Dtype(), "CHAR_COLUMN": pd.StringDtype(),
                #"FLOAT_COLUMN": pd.Float64Dtype(), "BOOLEAN_COLUMN": pd.BooleanDtype(),
                "PERIOD_COLUMN": "period[S]"}

source_df = source_df.astype(dtypes)

compare_df: pd.DataFrame = source_df.copy(deep=True)

compare = datacompy.Compare(
    df1=source_df,
    df2=compare_df,
    join_columns='PK', 
    ignore_spaces=True,
    df1_name='source',
    df2_name="compare")

# The report always shows all rows as non-matching, but the values should be the same
print(compare.report())
overlap = compare.all_rows_overlap()
print(overlap)

I´ve left an larger dataframe with additonals columns in the example code. The behaviour does not change if additonal columns are present.

I tried to adjust "abs_tol" with no luck.

If you can point me to the right direction in your codebase, I´m willing to try to provide a PR.

Best regards

@fdosani
Copy link
Member

fdosani commented Oct 10, 2024

Just out of the office today but will take a look into this tomorrow.

@fdosani
Copy link
Member

fdosani commented Oct 16, 2024

sorry for the delay. Finally been able to look into this. Can you tell me what version you are using? Using the latest dev branch there is a fix i think in there which solves this issue: #335

Running this with the MVE I get:

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0    source        2     6
1   compare        2     6

Column Summary
--------------

Number of columns in common: 2
Number of columns in source but not in compare: 0
Number of columns in compare but not in source: 0

Row Summary
-----------

Matched on: pk
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 6
Number of rows in source but not in compare: 0
Number of rows in compare but not in source: 0

Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 6

Column Comparison
-----------------

Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 0

@fdosani
Copy link
Member

fdosani commented Oct 16, 2024

This should be fixed in the latest v0.14.0 release.
please feel free to reopen if you still have issues after upgrading.

@fdosani fdosani closed this as completed Oct 16, 2024
@Salfiii
Copy link
Author

Salfiii commented Nov 5, 2024

Hi @fdosani ,

thanks for your reply, I can confirm its fixed in 0.14.3.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants