Skip to content

Commit

Permalink
Staging/main/0.10.4 (#1029)
Browse files Browse the repository at this point in the history
* modified the assignees for issue creation (#1016)

* Minor: Profiler Path Fix in Example Notebook (#1021)

* Bump actions/checkout from 3 to 4 (#1024)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Turner <[email protected]>

* Make sure random_state is a list before indexed assignment (#968)

* Make sure random_state is a list before indexed assignment

Currently, a mypy error occurs because we attempt to assign to
random_state[1] when random_state has type
Union[list[Any], tuple[Any]]. Tuples are immutable so this is a type
error.

We fix this by making random_state into a list before doing indexed
assignment on it.

* Add type guards for random_state

* Check random_state before random_state[1]

Co-authored-by: Michael Davis <[email protected]>

* Reorder conditions for consistency

Co-authored-by: Taylor Turner <[email protected]>

---------

Co-authored-by: Michael Davis <[email protected]>
Co-authored-by: Taylor Turner <[email protected]>

* added psi calculation to categorical columns (#1027)

* added psi calculation to categorical columns

* Changed test value to non-calculated assignment

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Navid Nafiuzzaman <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Turner <[email protected]>
Co-authored-by: Junho Lee <[email protected]>
Co-authored-by: Michael Davis <[email protected]>
  • Loading branch information
6 people authored Sep 21, 2023
1 parent b0b8510 commit acae2da
Show file tree
Hide file tree
Showing 10 changed files with 67 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Bug report
about: Create a report to help us improve
title: ''
labels: Bug
assignees: JGSweets, ksneab7, micdavis, taylorfturner
assignees: ksneab7, micdavis, taylorfturner, tyfarnan

---

Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/documentation_issue.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Documentation Issue
about: Is there an issue with the documentation?
title: ''
labels: Documentation
assignees: JGSweets, ksneab7, micdavis, taylorfturner
assignees: ksneab7, micdavis, taylorfturner, tyfarnan

---

Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Feature request
about: Suggest an idea for this project
title: ''
labels: New Feature
assignees: JGSweets, ksneab7, micdavis, taylorfturner
assignees: ksneab7, micdavis, taylorfturner, tyfarnan

---

Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/open_issue.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@ name: Open Issue
about: Open an issue other than a bug, feature, or documentation issue
title: ''
labels: ''
assignees: JGSweets, ksneab7, micdavis, taylorfturner
assignees: ksneab7, micdavis, taylorfturner, tyfarnan

---
2 changes: 1 addition & 1 deletion .github/workflows/publish-python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
python-version: [3.8, 3.9, "3.10"]

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
Expand Down
8 changes: 4 additions & 4 deletions dataprofiler/labelers/data_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1555,8 +1555,8 @@ def __init__(
random_state = random.Random(random_state)
elif isinstance(random_state, (list, tuple)) and len(random_state) == 3:
# tuple required for random state to be set, lists do not work
if isinstance(random_state[1], list):
random_state[1] = tuple(random_state[1]) # type: ignore
if isinstance(random_state, list) and isinstance(random_state[1], list):
random_state[1] = tuple(random_state[1])
if isinstance(random_state, list):
random_state = tuple(random_state)
temp_random_state = random.Random()
Expand Down Expand Up @@ -1894,8 +1894,8 @@ def __init__(
random_state = random.Random(random_state)
elif isinstance(random_state, (list, tuple)) and len(random_state) == 3:
# tuple required for random state to be set, lists do not work
if isinstance(random_state[1], list):
random_state[1] = tuple(random_state[1]) # type: ignore
if isinstance(random_state, list) and isinstance(random_state[1], list):
random_state[1] = tuple(random_state[1])
if isinstance(random_state, list):
random_state = tuple(random_state)
temp_random_state = random.Random()
Expand Down
16 changes: 16 additions & 0 deletions dataprofiler/profilers/categorical_column_profile.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
"""Contains class for categorical column profiler."""
from __future__ import annotations

import math
import warnings
from collections import defaultdict
from operator import itemgetter
from typing import cast
Expand Down Expand Up @@ -304,6 +306,20 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
other_profile._categories.items(), key=itemgetter(1), reverse=True
)
)
if cat_count1.keys() == cat_count2.keys():
total_psi = 0.0
for key in cat_count1.keys():
perc_A = cat_count1[key] / self.sample_size
perc_B = cat_count2[key] / other_profile.sample_size
total_psi += (perc_B - perc_A) * math.log(perc_B / perc_A)
differences["statistics"]["psi"] = total_psi
else:
warnings.warn(
"psi was not calculated due to the differences in categories "
"of the profiles. Differences:\n"
f"{set(cat_count1.keys()) ^ set(cat_count2.keys())}",
RuntimeWarning,
)

differences["statistics"][
"categorical_count"
Expand Down
41 changes: 39 additions & 2 deletions dataprofiler/tests/profilers/test_categorical_column_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -728,8 +728,13 @@ def test_categorical_diff(self):
},
},
}

self.assertDictEqual(expected_diff, profile.diff(profile2))
with self.assertWarnsRegex(
RuntimeWarning,
"psi was not calculated due to the differences in categories "
"of the profiles. Differences:\n{'maybe'}",
):
test_profile_diff = profile.diff(profile2)
self.assertDictEqual(expected_diff, test_profile_diff)

# Test with one categorical column matching
df_not_categorical = pd.Series(
Expand All @@ -756,6 +761,38 @@ def test_categorical_diff(self):
}
self.assertDictEqual(expected_diff, profile.diff(profile2))

# Test diff with psi enabled
df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n", "maybe"])
profile = CategoricalColumn(df_categorical.name)
profile.update(df_categorical)

df_categorical = pd.Series(["y", "maybe", "y", "y", "n", "n", "maybe"])
profile2 = CategoricalColumn(df_categorical.name)
profile2.update(df_categorical)

# chi2-statistic = sum((observed-expected)^2/expected for each category in each column)
# df = categories - 1
# psi = (% of records based on Sample (A) - % of records Sample (B)) * ln(A/ B)
# p-value found through using chi2 CDF
expected_diff = {
"categorical": "unchanged",
"statistics": {
"unique_count": "unchanged",
"unique_ratio": -0.05357142857142855,
"chi2-test": {
"chi2-statistic": 0.6122448979591839,
"df": 2,
"p-value": 0.7362964551863367,
},
"categories": "unchanged",
"gini_impurity": -0.059311224489795866,
"unalikeability": -0.08333333333333326,
"psi": 0.16814961527477595,
"categorical_count": {"y": 1, "n": 1, "maybe": -1},
},
}
self.assertDictEqual(expected_diff, profile.diff(profile2))

def test_unalikeability(self):
df_categorical = pd.Series(["a", "a"])
profile = CategoricalColumn(df_categorical.name)
Expand Down
4 changes: 2 additions & 2 deletions examples/merge_profile_list.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@
"try:\n",
" sys.path.insert(0, '..')\n",
" import dataprofiler as dp\n",
" from dataprofiler.profilers.utils import merge_profile_list\n",
" from dataprofiler.profilers.profiler_utils import merge_profile_list\n",
"except ImportError:\n",
" import dataprofiler as dp\n",
" from dataprofiler.profilers.utils import merge_profile_list\n",
" from dataprofiler.profilers.profiler_utils import merge_profile_list\n",
"\n",
"# remove extra tf loggin\n",
"tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)"
Expand Down

0 comments on commit acae2da

Please sign in to comment.