Pklm implementation #157

adriencrtr · 2024-08-09T13:15:43Z

No description provided.

… the projections. Creation of the associated tests

Sill have problems with random_state usage.

…cients

Also add a notebook to show how it works

adriencrtr · 2024-08-09T13:17:54Z

qolmat/analysis/holes_characterization.py

+ nb_permutation: int = 30,
+ nb_trees_per_proj: int = 200,
+ exact_p_value: bool = False,
+ encoder: Union[None, OneHotEncoder] = None, # We could define more encoders.


On pourrait ajouter davantage d'encoder. A voir lesquels on choisit. Pour le moment seul le OneHotencoder est disponible.

adriencrtr · 2024-08-09T13:19:06Z

qolmat/analysis/holes_characterization.py

+ self.exact_p_value = exact_p_value
+ self.encoder = encoder
+
+ if self.exact_p_value:


L'implémentation du calcul "exact" de cette p_value ne semble pas être correct (cf notebook).

Je ne sais pas où ça pêche, peut être ici.

adriencrtr · 2024-08-09T13:20:09Z

qolmat/analysis/holes_characterization.py

+ TypeNotHandled
+ If any column has a data type that is not numeric, string, or boolean.
+ """
+ allowed_types = [


Actuellement on conserve les types numériques et on encode les types:

str

bool

On pourrait ajouter aussi les categories. On pourrait aussi se poser la question pour les dates.

adriencrtr · 2024-08-09T13:25:10Z

Ce qui reste à faire :

Régler le souci de la reproductibilité (random_state).
Se pencher sur la question du calcul 'exact' de la p-value.
Se besoin, développer la partie sur l'encodage : Ajouter d'autre encoder et/ou supporter davantage de types.
Une fois ces points validés, inclure cette feature dans la doc + tutoriel (reprendre le notebook).
Reprendre le notebook de temps de calcul sur mac M1 (voir s'il y a une différence notable).
Comment expliquer le comportement lorsque n_perm -> inf ?
Dire qu'on se restreint à size.resp.set = 2

…l for the PKLM test.

adriencrtr · 2024-08-28T13:16:02Z

examples/tutorials/plot_tuto_mcar.py

+# * ``nb_projections``: Number of projections on which the test statistic is calculated. This
+# parameter has the greatest influence on test calculation time. Its defaut value
+# ``nb_projections=100``.
+# Est-ce qu'on donne des ordres de grandeurs utiles ? J'avais un peu fait ce travail.


J'avais fait un petit travail de probabilités pour estimer le bon nombre de projection pour une taille de dataset donné. Est-ce utile ?

adriencrtr · 2024-08-28T13:22:04Z

qolmat/analysis/holes_characterization.py

+ def _encode_dataframe(self, df: pd.DataFrame) -> np.ndarray:
+ """
+ Encodes the DataFrame by converting numeric columns to a numpy array
+ and applying one-hot encoding to non-numeric columns.


Retirer simplement "one-hot"! Ce ne sera peut être pas toujours du one-hot.

adriencrtr · 2024-08-28T13:22:59Z

qolmat/analysis/holes_characterization.py

+ def _check_draw(df: np.ndarray, features_idx: np.ndarray, target_idx: int) -> np.bool_:
+ """
+ Checks if the drawn features and target are valid.
+ # TODO : Need to develop ?


ça demande un peu de rentrer dans les détails du papier si l'on souhaite développer cette partie. Je ne suis pas certain qui ça vaille le coup

adriencrtr · 2024-08-29T13:57:53Z

Supprimer la condition sur les patterns
Les fonctions de vérifications des types d'input à mettre dans utils/input_check.py
Se poser la question de l'encodage des nan (Categorical_encoder avec une gestion des nan par multiplication des nans). avec le paramètre handle_missing et utiliser l'arg "cols_cat" pour ne pas avoir à split le df
Être explicite dans le types (df vs X).
Changer la fonction de draw en qqch de soit exhaustif soit sans remise
Changer le "feature_index" en type list
Dans les docstring, donner l'équivalence entre les noms de variables et les noms dans l'article.
Détailler pourquoi on itère sur les projections et non sur les perms.
Se pencher sur la question du calcul d'espérance explicite.
Inverser l'ordre dans le code

Also add the p_value_validity notebook for the meeting with Jeffrey NAF.

qolmat/analysis/holes_characterization.py

tests/analysis/test_holes_characterization.py

hlbotterman · 2024-09-06T14:18:01Z

examples/tutorials/plot_tuto_mcar.py

-# This notebook shows how the Little's test performs on a simplistic case and its limitations. We
-# instanciate a test object with a random state for reproducibility.
+# This notebook shows how the Little and PKLM tests perform on a simplistic case and their
+# limitations. We instanciate a test object with a random state for reproducibility.


instantiate

hlbotterman · 2024-09-06T14:24:25Z

examples/tutorials/plot_tuto_mcar.py

+------------+------------+----------------------+
+| 100000 | 15 | 3'06" |
+------------+------------+----------------------+
+"""


Why, for the same number of lines, if the number of columns increases, the time decreases (the difference is more noticeable the more lines we have) ?

I think that's because the .fit() of the classifier takes less time when there are more features.

adriencrtrcap added 6 commits August 5, 2024 18:05

✨ Creation of the PKLMTest class. Implement the first methos to draws…

0a8c79a

… the projections. Creation of the associated tests

✅ Add and improve some tests

8730f96

🚀 Ends the test implementation and associated tests.

33b95ce

Sill have problems with random_state usage.

💩 Implement the 'exact' p_value computation but results are not suffi…

0601f2d

…cients

✨ Start to support mixed data types.

8f105e5

✨ Add the feature to deal ith mixed data types and pandas dataframes.

e48dd4f

Also add a notebook to show how it works

adriencrtr added the enhancement New feature or request label Aug 9, 2024

adriencrtr requested a review from JulienRoussel77 August 9, 2024 13:15

adriencrtr self-assigned this Aug 9, 2024

adriencrtr commented Aug 9, 2024

View reviewed changes

adriencrtrcap added 4 commits August 28, 2024 11:17

✨ Add the partial p-values feature.

0374dec

📝 Update the tutorial with the partial p-values section.

c954180

Developp the "parameters and hyper-parameters" section in the tutoria…

f91f421

…l for the PKLM test.

Delete notebok

102c027

adriencrtr commented Aug 28, 2024

View reviewed changes

adriencrtrcap added 3 commits August 30, 2024 16:05

🔥 Remove useless code & Move some generic functions into utils folder.

234a5a1

Also add the p_value_validity notebook for the meeting with Jeffrey NAF.

✨ When it's possible, take all the projections for small datasets.

574f3db

🐛 Patch the 'Category encoder bug'.

8ddc3dd

hlbotterman reviewed Sep 3, 2024

View reviewed changes

Typo in docstring

5b988ec

hlbotterman reviewed Sep 6, 2024

View reviewed changes

Detail some dostrings and fix typo in the tuto

ab61060

adriencrtr closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pklm implementation #157

Pklm implementation #157

adriencrtr commented Aug 9, 2024

adriencrtr Aug 9, 2024

adriencrtr Aug 9, 2024

adriencrtr Aug 9, 2024

adriencrtr commented Aug 9, 2024 •

edited

Loading

adriencrtr Aug 28, 2024

adriencrtr Aug 28, 2024

adriencrtr Aug 28, 2024

adriencrtr commented Aug 29, 2024 •

edited

Loading

hlbotterman Sep 6, 2024

hlbotterman Sep 6, 2024

adriencrtr Sep 6, 2024

Pklm implementation #157

Pklm implementation #157

Conversation

adriencrtr commented Aug 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriencrtr commented Aug 9, 2024 • edited Loading

Ce qui reste à faire :

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriencrtr commented Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriencrtr commented Aug 9, 2024 •

edited

Loading

adriencrtr commented Aug 29, 2024 •

edited

Loading