Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pklm implementation #157

Closed
wants to merge 15 commits into from
Closed

Pklm implementation #157

wants to merge 15 commits into from

Conversation

adriencrtr
Copy link
Collaborator

No description provided.

@adriencrtr adriencrtr added the enhancement New feature or request label Aug 9, 2024
@adriencrtr adriencrtr self-assigned this Aug 9, 2024
nb_permutation: int = 30,
nb_trees_per_proj: int = 200,
exact_p_value: bool = False,
encoder: Union[None, OneHotEncoder] = None, # We could define more encoders.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On pourrait ajouter davantage d'encoder. A voir lesquels on choisit. Pour le moment seul le OneHotencoder est disponible.

self.exact_p_value = exact_p_value
self.encoder = encoder

if self.exact_p_value:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L'implémentation du calcul "exact" de cette p_value ne semble pas être correct (cf notebook).

Je ne sais pas où ça pêche, peut être ici.

TypeNotHandled
If any column has a data type that is not numeric, string, or boolean.
"""
allowed_types = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actuellement on conserve les types numériques et on encode les types:

  • str
  • bool

On pourrait ajouter aussi les categories. On pourrait aussi se poser la question pour les dates.

@adriencrtr
Copy link
Collaborator Author

adriencrtr commented Aug 9, 2024

Ce qui reste à faire :

  • Régler le souci de la reproductibilité (random_state).
  • Se pencher sur la question du calcul 'exact' de la p-value.
  • Se besoin, développer la partie sur l'encodage : Ajouter d'autre encoder et/ou supporter davantage de types.
  • Une fois ces points validés, inclure cette feature dans la doc + tutoriel (reprendre le notebook).
  • Reprendre le notebook de temps de calcul sur mac M1 (voir s'il y a une différence notable).
  • Comment expliquer le comportement lorsque n_perm -> inf ?
  • Dire qu'on se restreint à size.resp.set = 2

# * ``nb_projections``: Number of projections on which the test statistic is calculated. This
# parameter has the greatest influence on test calculation time. Its defaut value
# ``nb_projections=100``.
# Est-ce qu'on donne des ordres de grandeurs utiles ? J'avais un peu fait ce travail.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

J'avais fait un petit travail de probabilités pour estimer le bon nombre de projection pour une taille de dataset donné. Est-ce utile ?

def _encode_dataframe(self, df: pd.DataFrame) -> np.ndarray:
"""
Encodes the DataFrame by converting numeric columns to a numpy array
and applying one-hot encoding to non-numeric columns.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retirer simplement "one-hot"! Ce ne sera peut être pas toujours du one-hot.

def _check_draw(df: np.ndarray, features_idx: np.ndarray, target_idx: int) -> np.bool_:
"""
Checks if the drawn features and target are valid.
# TODO : Need to develop ?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ça demande un peu de rentrer dans les détails du papier si l'on souhaite développer cette partie. Je ne suis pas certain qui ça vaille le coup

@adriencrtr
Copy link
Collaborator Author

adriencrtr commented Aug 29, 2024

  • Supprimer la condition sur les patterns
  • Les fonctions de vérifications des types d'input à mettre dans utils/input_check.py
  • Se poser la question de l'encodage des nan (Categorical_encoder avec une gestion des nan par multiplication des nans). avec le paramètre handle_missing et utiliser l'arg "cols_cat" pour ne pas avoir à split le df
  • Être explicite dans le types (df vs X).
  • Changer la fonction de draw en qqch de soit exhaustif soit sans remise
  • Changer le "feature_index" en type list
  • Dans les docstring, donner l'équivalence entre les noms de variables et les noms dans l'article.
  • Détailler pourquoi on itère sur les projections et non sur les perms.
  • Se pencher sur la question du calcul d'espérance explicite.
  • Inverser l'ordre dans le code

qolmat/analysis/holes_characterization.py Outdated Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Outdated Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Outdated Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Outdated Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Outdated Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Show resolved Hide resolved
qolmat/analysis/holes_characterization.py Outdated Show resolved Hide resolved
tests/analysis/test_holes_characterization.py Show resolved Hide resolved
# This notebook shows how the Little's test performs on a simplistic case and its limitations. We
# instanciate a test object with a random state for reproducibility.
# This notebook shows how the Little and PKLM tests perform on a simplistic case and their
# limitations. We instanciate a test object with a random state for reproducibility.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instantiate

+------------+------------+----------------------+
| 100000 | 15 | 3'06" |
+------------+------------+----------------------+
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, for the same number of lines, if the number of columns increases, the time decreases (the difference is more noticeable the more lines we have) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's because the .fit() of the classifier takes less time when there are more features.

@adriencrtr adriencrtr closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants