Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pklm implementation #157

Closed
wants to merge 15 commits into from
Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docs/analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The analysis module provides tools to characterize the type of holes.

The MNAR case is the trickiest, the user must first consider whether their missing data mechanism is MNAR. In the meantime, we make assume that the missing-data mechanism is ignorable (ie., it is not MNAR). If an MNAR mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>` for relevant actions.

Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
Then Qolmat proposes two tests to determine whether the missing data mechanism is MCAR or MAR.

2. How to use the results
-------------------------
Expand Down Expand Up @@ -50,7 +50,11 @@ The best-known MCAR test is the :ref:`Little [2]<Little-article>` test, and it h
b. PKLM Test
^^^^^^^^^^^^

The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. It is not implemented yet in Qolmat.
The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. The :class:`PKLMTest` is now implemented in Qolmat.
To carry out this test, we perform random projections in the variable space of the data. These random projections allow us to construct a fully observed sub-matrix and an associated number of missing patterns.
The idea is then to compare the distributions of the missing patterns through the Kullback-Leibler distance.
To do this, the distributions for each pattern are estimated using Random Forests.


References
----------
Expand Down
154 changes: 154 additions & 0 deletions examples/pklm/p_value_validity/p_value_validity.ipynb

Large diffs are not rendered by default.

260 changes: 239 additions & 21 deletions examples/tutorials/plot_tuto_mcar.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Tutorial for Testing the MCAR Case
============================================

In this tutorial, we show how to test the MCAR case using the Little's test.
In this tutorial, we show how to test the MCAR case using the Little and the PKLM tests.
"""

# %%
Expand All @@ -14,7 +14,7 @@
import pandas as pd
from scipy.stats import norm

from qolmat.analysis.holes_characterization import LittleTest
from qolmat.analysis.holes_characterization import LittleTest, PKLMTest
from qolmat.benchmark.missing_patterns import UniformHoleGenerator

plt.rcParams.update({"font.size": 12})
Expand All @@ -31,22 +31,32 @@
q975 = norm.ppf(0.975)

# %%
# 1. Testing the MCAR case with the Little's test and the PKLM test.
# ------------------------------------------------------------------
#
# The Little's test
# ---------------------------------------------------------------
# =================
#
# First, we need to introduce the concept of a missing pattern. A missing pattern, also called a
# pattern, is the structure of observed and missing values in a dataset. For example, in a
# dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1
# (0) indicates that the column value is missing (observed).
#
# The null hypothesis, H0, is: "The means of observations within each pattern are similar.".

# %%
# The PKLM test
# =============
# The test compares distributions of different missing patterns.
#
# The null hypothesis, H0, is: "Distributions within each pattern are similar.".
# We choose to use the classic threshold of 5%. If the test p-value is below this threshold,
# we reject the null hypothesis.
#
# This notebook shows how the Little's test performs on a simplistic case and its limitations. We
# instanciate a test object with a random state for reproducibility.
# This notebook shows how the Little and PKLM tests perform on a simplistic case and their
# limitations. We instanciate a test object with a random state for reproducibility.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instantiate


test_mcar = LittleTest(random_state=rng)
little_test_mcar = LittleTest(random_state=rng)
pklm_test_mcar = PKLMTest(random_state=rng)

# %%
# Case 1: MCAR holes (True negative)
Expand Down Expand Up @@ -77,11 +87,13 @@
plt.show()

# %%
result = test_mcar.test(df_nan)
print(f"Test p-value: {result:.2%}")
little_result = little_test_mcar.test(df_nan)
pklm_result = pklm_test_mcar.test(df_nan)
print(f"The p-value of the Little's test is: {little_result:.2%}")
print(f"The p-value of the PKLM test is: {pklm_result:.2%}")
# %%
# The p-value is larger than 0.05, therefore we don't reject the HO MCAR assumption. In this case
# this is a true negative.
# The two p-values are larger than 0.05, therefore we don't reject the H0 MCAR assumption.
# In this case this is a true negative.

# %%
# Case 2: MAR holes with mean bias (True positive)
Expand Down Expand Up @@ -110,11 +122,13 @@

# %%

result = test_mcar.test(df_nan)
print(f"Test p-value: {result:.2%}")
little_result = little_test_mcar.test(df_nan)
pklm_result = pklm_test_mcar.test(df_nan)
print(f"The p-value of the Little's test is: {little_result:.2%}")
print(f"The p-value of the PKLM test is: {pklm_result:.2%}")
# %%
# The p-value is smaller than 0.05, therefore we reject the HO MCAR assumption. In this case
# this is a true positive.
# The two p-values are smaller than 0.05, therefore we reject the H0 MCAR assumption.
# In this case this is a true positive.

# %%
# Case 3: MAR holes with any mean bias (False negative)
Expand Down Expand Up @@ -149,17 +163,221 @@

# %%

result = test_mcar.test(df_nan)
print(f"Test p-value: {result:.2%}")
little_result = little_test_mcar.test(df_nan)
pklm_result = pklm_test_mcar.test(df_nan)
print(f"The p-value of the Little's test is: {little_result:.2%}")
print(f"The p-value of the PKLM test is: {pklm_result:.2%}")
# %%
# The p-value is larger than 0.05, therefore we don't reject the HO MCAR assumption. In this case
# this is a false negative since the missingness mechanism is MAR.
# The Little's p-value is larger than 0.05, therefore, using this test we don't reject the H0 MCAR
# assumption. In this case this is a false negative since the missingness mechanism is MAR.
#
# However the PKLM test p-value is smaller than 0.05 therefore we don't reject the H0 MCAR
# assumption. In this case this is a true negative.

# %%
# Limitations
# -----------
# Limitations and conclusion
# ==========================
# In this tutoriel, we can see that Little's test fails to detect covariance heterogeneity between
# patterns.
#
# We also note that the Little's test does not handle categorical data or temporally
# correlated data.
#
# This is why we have implemented the PKLM test, which makes up for the shortcomings of the Little
# test. We present this test in more detail in the next section.

# %%
# 2. The PKLM test
# ------------------------------------------------------------------
#
# The PKLM test is very powerful for several reasons. Firstly, it covers the concerns that Little's
# test may have (covariance heterogeneity). Secondly, it is currently the only MCAR test applicable
# to mixed data. Finally, it proposes a concept of partial p-value which enables us to carry out a
# variable-by-variable diagnosis to identify the potential causes of a MAR mechanism.
#
# There is a parameter in the paper called size.res.set. The authors of the paper recommend setting
# this parameter to 2. We have chosen to follow this advice and not leave the possibility of
# increasing this parameter. The results are satisfactory and the code is simpler.
#
# It does have one disadvantage, however: its calculation time.
#

# %%

"""
Calculation time
================

+------------+------------+----------------------+
| **n_rows** | **n_cols** | **Calculation_time** |
+============+============+======================+
| 200 | 2 | 2"12 |
+------------+------------+----------------------+
| 500 | 2 | 2"24 |
+------------+------------+----------------------+
| 500 | 4 | 2"18 |
+------------+------------+----------------------+
| 1000 | 4 | 2"48 |
+------------+------------+----------------------+
| 1000 | 6 | 2"42 |
+------------+------------+----------------------+
| 10000 | 6 | 20"54 |
+------------+------------+----------------------+
| 10000 | 10 | 14"48 |
+------------+------------+----------------------+
| 100000 | 10 | 4'51" |
+------------+------------+----------------------+
| 100000 | 15 | 3'06" |
+------------+------------+----------------------+
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, for the same number of lines, if the number of columns increases, the time decreases (the difference is more noticeable the more lines we have) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's because the .fit() of the classifier takes less time when there are more features.


# %%
# 2.1 Parameters and Hyperparmaters
# ================================================
#
# To use the PKLM test properly, it may be necessary to understand the use of hyper-parameters.
#
# * ``nb_projections``: Number of projections on which the test statistic is calculated. This
# parameter has the greatest influence on test calculation time. Its defaut value
# ``nb_projections=100``.
# Est-ce qu'on donne des ordres de grandeurs utiles ? J'avais un peu fait ce travail.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

J'avais fait un petit travail de probabilités pour estimer le bon nombre de projection pour une taille de dataset donné. Est-ce utile ?

#
# * ``nb_permutation`` : Number of permutations of the projected targets. The higher is better. This
# parameter has little impact on calculation time.
# Its default value ``nb_permutation=30``.
#
# * ``nb_trees_per_proj`` : The number of subtrees in each random forest fitted. In order to
# estimate the Kullback-Leibler divergence, we need to obtain probabilities of belonging to
# certain missing patterns. Random Forests are used to estimate these probabilities. This
# hyperparameter has a significant impact on test calculation time. Its default
# value is ``nb_trees_per_proj=200``
#
# * ``compute_partial_p_values``: Boolean that indicates if you want to compute the partial
# p-values. Those partial p-values could help the user to identify the variables responsible for
# the MAR missing-data mechanism. Please see the section 2.3 for examples. Its default value is
# ``compute_partial_p_values=False``.
#
# * ``encoder``: Scikit-Learn encoder to encode non-numerical values.
# Its default value ``encoder=sklearn.preprocessing.OneHotEncoder()``
#
# * ``random_state``: Controls the randomness. Pass an int for reproducible output across
# multiple function calls. Its default value ``random_state=None``

# %%
# 2.2 Application on mixed data types
# ================================================
#
# As we have seen, Little's test only applies to quantitative data. In real life, however, it is
# common to have to deal with mixed data. Here's an example of how to use the PKLM test on a dataset
# with mixed data types.

# %%
n_rows = 100

col1 = rng.rand(n_rows) * 100
col2 = rng.randint(1, 100, n_rows)
col3 = rng.choice([True, False], n_rows)
modalities = ['A', 'B', 'C', 'D']
col4 = rng.choice(modalities, n_rows)

df = pd.DataFrame({
'Numeric1': col1,
'Numeric2': col2,
'Boolean': col3,
'Object': col4
})

hole_gen = UniformHoleGenerator(
n_splits=1,
ratio_masked=0.2,
subset=['Numeric1', 'Numeric2', 'Boolean', 'Object'],
random_state=rng
)
df_mask = hole_gen.generate_mask(df)
df_nan = df.where(~df_mask, np.nan)
df_nan.dtypes

# %%
pklm_result = pklm_test_mcar.test(df_nan)
print(f"The p-value of the PKLM test is: {pklm_result:.2%}")

# %%
# To perform the PKLM test over mixed data types, non numerical features need to be encoded. The
# default encoder in the :class:`~qolmat.analysis.holes_characterization.PKLMTest` class is the
# default OneHotEncoder from scikit-learn. If you wish to use an encoder adapted to your data, you
# can perform this encoding step beforehand, and then use the PKLM test.
# Currently, we do not support the following types :
#
# - datetimes
#
# - timedeltas
#
# - Pandas datetimetz

# %%
# 2.3 Partial p-values
# ================================================
#
# In addition, the PKLM test can be used to calculate partial p-values. We denote as many partial
# p-values as there are columns in the input dataframe. This “partial” p-value corresponds to the
# effect of removing the patterns induced by variable k.
#
# Let's take a look at an example of how to use this feature

# %%
data = rng.multivariate_normal(
mean=[0, 0, 0, 0],
cov=[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]],
size=400
)
df = pd.DataFrame(data=data, columns=["Column 1", "Column 2", "Column 3", "Column 4"])

df_mask = pd.DataFrame(
{
"Column 1": False,
"Column 2": df["Column 1"] > q975,
"Column 3": False,
"Column 4": False,
},
index=df.index
)
df_nan = df.where(~df_mask, np.nan)

# %%
# The missing-data mechanism is clearly MAR. Intuitively, if we remove the second column from the
# matrix, the missing-data mechanism is MCAR. Let's see how the PKLM test can help us identify the
# variable responsible for the MAR mechanism.

# %%
pklm_test = PKLMTest(random_state=rng, compute_partial_p_values=True)
p_value, partial_p_values = pklm_test.test(df_nan)
print(f"The p-value of the PKLM test is: {p_value:.2%}")

# %%
# The test result confirms that we can reject the null hypothesis and therefore assume that the
# missing-data mechanism is MAR.
# Let's now take a look at what partial p-values can tell us.

# %%
for col_index, partial_p_v in enumerate(partial_p_values):
print(f"The partial p-value for the column index {col_index + 1} is: {partial_p_v:.2%}")

# %%
# As a result, by removing the missing patterns induced by variable 2, the p-value rises
# above the significance threshold set beforehand. Thus in this sense, the test detects that the
# main culprit of the MAR mechanism lies in the second variable.


# %%
# Calculation time -> TO BE DELETED
# | **n_rows** | **n_cols** | **Calculation_time** |
# |------------|------------|----------------------|
# | 200 | 2 | 2"12 |
# | 500 | 2 | 2"24 |
# | 500 | 4 | 2"18 |
# | 1000 | 4 | 2"48 |
# | 1000 | 6 | 2"42 |
# | 10000 | 6 | 20"54 |
# | 10000 | 10 | 14"48 |
# | 100000 | 10 | 4'51" |
# | 100000 | 15 | 3'06" |
Loading
Loading