X-Learner: Use the same sample splits in all base models. #84

kklein · 2024-08-15T18:25:17Z

TODOs:

Make sure that row indices not present in cv are actually not used for training when passing cv to cross_validate.
Assess whether the fold-specific nuisance prediction should move to CrossFitEstimator.
Devise a plan as to whether synchronize_cross_fitting should be allowed to be False for the X-Learner.

Observations

import numpy as np
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_validate


class Memorizer(BaseEstimator):
    def fit(self, X, y):
        self._y = y
        print(len(y))
        return self

    def score(self, X, y):
        return 0


n_samples = 100
n_folds = 4
# We define cvs such that when combining the training and test set of every 'split', we have a strict subset of 
# the dataset (X, y). 
cvs = [
    (np.array([fold_index]), np.array(fold_index + 50)) for fold_index in range(n_folds)
]
estimator = Memorizer()

X = np.random.normal(size=(n_samples, 2))
y = np.random.normal(size=n_samples)
cross_validate(
    estimator,
    X,
    y,
    cv=cvs,
)

yields the following output:

Checklist

Added a CHANGELOG.rst entry

kklein · 2024-08-15T18:50:34Z

FYI @MatthiasLoefflerQC I created a first draft of how the same splits could be used for all base learners, including treatment models. As of now the estimates are still clearly awry, e.g. an RMSE of ~13 compared to ~0.05. This happens for both in-sample and out-of-sample estimation. I currently have no real ideas on what's going wrong; will try to make some progress still

kklein · 2024-08-15T18:51:57Z

metalearners/xlearner.py

-            if synchronize_cross_fitting:
-                cv_split_indices = self._split(
-                    index_matrix(X, self._treatment_variants_indices[treatment_variant])
+            treatment_indices = np.where(


This is an opaque way of turning an array [True, True, False, False, True] into an array [0, 1, 4]. Not sure if there's a neater way of doing that.

[index for index, value in enumerate(vector) if value] would work too, I guess, and is more verbose, but I like the np.where :)

kklein · 2024-08-15T21:08:37Z

As of now the estimates are still clearly awry, e.g. an RMSE of ~13 compared to ~0.05.

The base models all seem to be doing fine wrt their individual targets at hand. Yet, when I compare pairs of treatment effect model estimates at prediction time, it become blatantly apparent that something is going wrong:

np.mean(tau_hat_control - tau_hat_treatment)
>>> 27.051119307766754
np.mean(tau_hat_control)
>>> 14.104902455634836
np.mean(tau_hat_treatment)
>>> -12.946216852131919

Update: These discrepancies have been substantially reduced by bbfff15. The RMSEs on true cates are still massive when compared to status quo.

metalearners/xlearner.py

Co-authored-by: Matthias Loeffler <[email protected]>

metalearners/xlearner.py

MatthiasLoefflerQC · 2024-08-22T13:39:41Z

metalearners/xlearner.py

                    model_kind=CONTROL_EFFECT_MODEL,
                    model_ord=treatment_variant - 1,
                    is_oos=False,
-                )
+                )[control_indices]


do we need is_oos=False below (and likewise for tau_hat_treatment)? Might be worth a try.

kklein added 2 commits August 15, 2024 20:24

Draft usage of same splits in all models.

5afa7af

Clean up.

413e5b0

kklein linked an issue Aug 15, 2024 that may be closed by this pull request

Leakage in X-Learner in-sample prediction #80

Open

Fix attribute reference.

59554b1

kklein commented Aug 15, 2024

View reviewed changes

kklein added 3 commits August 15, 2024 21:02

Filter properly.

c0bdcbd

Fix out-of-sample evaluate.

fe16b75

Fix in-sample evaluate.

410e9e7

kklein force-pushed the xlearner-sync branch from 0785dcd to 410e9e7 Compare August 15, 2024 19:44

kklein changed the title ~~Use the same sample splits in all base models.~~ X-Learner: Use the same sample splits in all base models. Aug 15, 2024

Adapt synchronization-related tests.

6a43c9c

kklein and others added 2 commits August 15, 2024 23:39

Fix cao estimation only taking place for seen variant.

bbfff15

Merge branch 'main' into xlearner-sync

c8dd77e

MatthiasLoefflerQC reviewed Aug 16, 2024

View reviewed changes

metalearners/xlearner.py Outdated Show resolved Hide resolved

Update metalearners/xlearner.py

6803096

Co-authored-by: Matthias Loeffler <[email protected]>

MatthiasLoefflerQC reviewed Aug 16, 2024

View reviewed changes

metalearners/xlearner.py Show resolved Hide resolved

Add type hints for cv-split-related attributes.

b005eb7

MatthiasLoefflerQC reviewed Aug 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

X-Learner: Use the same sample splits in all base models. #84

X-Learner: Use the same sample splits in all base models. #84

kklein commented Aug 15, 2024 •

edited

Loading

kklein commented Aug 15, 2024 •

edited

Loading

kklein Aug 15, 2024 •

edited

Loading

MatthiasLoefflerQC Aug 16, 2024 •

edited

Loading

kklein commented Aug 15, 2024 •

edited

Loading

MatthiasLoefflerQC Aug 22, 2024

X-Learner: Use the same sample splits in all base models. #84

Are you sure you want to change the base?

X-Learner: Use the same sample splits in all base models. #84

Conversation

kklein commented Aug 15, 2024 • edited Loading

TODOs:

Observations

Checklist

kklein commented Aug 15, 2024 • edited Loading

kklein Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

MatthiasLoefflerQC Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

kklein commented Aug 15, 2024 • edited Loading

MatthiasLoefflerQC Aug 22, 2024

Choose a reason for hiding this comment

kklein commented Aug 15, 2024 •

edited

Loading

kklein commented Aug 15, 2024 •

edited

Loading

kklein Aug 15, 2024 •

edited

Loading

MatthiasLoefflerQC Aug 16, 2024 •

edited

Loading

kklein commented Aug 15, 2024 •

edited

Loading