Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve calibration #114

Merged
merged 3 commits into from
Jul 9, 2024
Merged

Conversation

el-hult
Copy link
Contributor

@el-hult el-hult commented Jun 19, 2024

The gfit function is brittle, and fails in some cases. Instead of giving error messages or warnings, the calibration routine just spits out nonsense like infinite, nan or zero variance. Example of failure modes are listed in #93 .

This PR suggest changes to help identify these failures, and to some degree mitigate them. In the process of doing these updats, I also made minor changes to improve readability and maintainability.

My ambition was to:

  1. Make it correct
  2. Make it throw warnings and errors when applicable

Minimal reproduction

The simplest way to reproduce some aspects of the problems is to run the example https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/master/examples/plot_mpg.py, and run it for 100 trees (instead of 1000). This gives rise to the overflow errors and more.

I could isolate the call to gfit as in the snippet below.

X = [ 1.28250709e+00, -9.62199807e-01, -2.04701193e+00, -1.02443034e+00, 6.38396847e-01,  6.93655605e-01,  2.02769335e-01,  2.28704442e+00, 1.28821259e+00, -2.07737984e+00,  2.45556492e-02,  2.89656660e-01, 4.53925023e-01,  3.35682503e+00, -4.54257770e-01, -1.23714287e-01, 3.77470001e+00, -4.23614898e-01,  7.46533342e-01, -8.81119202e-01, 2.08394853e+00, -1.05411916e-01,  3.31927296e+00,  7.17650800e-01, 2.25908057e+00, -7.53290424e-01, -2.21822602e-01, -6.27804326e-01, 9.58914997e-02, -1.14970276e+00,  1.70686211e+00,  1.47393397e-01, 1.29608780e+00,  2.94860970e-01,  9.31508065e-01,  9.68559862e+00, 3.47850459e+00,  5.20463355e-01,  1.07368578e-02,  6.57683255e-01, 1.45860815e+00,  5.14249866e+00,  2.57982131e+01,  3.19180999e-01,-3.30013368e+00,  3.49159227e-01,  4.65446974e-02, -2.08369433e+00, 8.59651030e-01,  5.86331704e-01,  3.32865616e+00,  1.82593891e-01, 7.18148878e-01,  5.89549645e-02,  5.39916639e-01,  1.02031244e+00, 2.64314510e+00,  1.10181223e-01,  1.05365282e+00,  1.84280809e-01, 2.04459181e+00, -6.69517305e-01,  4.63034132e+00, -2.03630094e+00,-4.16557994e-01,  6.05384570e-01, -7.41193944e-01,  3.28397015e-01,-2.02724573e-01,  2.81343215e-01, -4.52882044e-02, -5.11185513e-01, 1.45608437e-02, -1.81530128e+00, -2.56268846e+00, -3.21805107e-01, 8.91628062e+00,  9.19464777e-01, -1.45044659e+00, -4.56431159e+00, 7.72050142e-01,  5.69059211e+00, -7.16028668e-01,  1.48268265e+00, 6.25840167e-02,  1.59672239e-01]
sigma = 4.545817017854692
from forestci.calibration import gfit
gfit(X,sigma)

Analysis

The "calibration" routine is not well documented. There is a reference in the python code to the Wager/Athey paper, but that article does not mention calibration. The original R code by Wager do the calibration, but has references to any article on how it is motivated. So I worked through the code to understand it.

The calibrateEB function is a denoising function. It assumes that the input are noisy observations and tries to find denoised observations. The modelling approach is mu ~ G, X ~ N(mu, sigma^2). If only the prior distribution G was known, then for each observed variance x we would take the expectation over the posterior distribution; x_calib = E [ posterior(mu|x)) ] . Since G is not known, we model it with empirical bayes. The function gfit does this fitting for us.

The things I found while analyzing:

  1. in calibrateEB, there is a use of list() and map() with a functool.partial. List comprehensions are typically faster, need fewer imports and are arguably easier to read.
  2. the docstring for gbayes had a mistake on the g_est parameter

Now we can focus on gfit

  1. the limits in parameter space is unnessecarily low. It is currently chosen to be min(min(...),0), but all nonpositive values are useless, since the mask variable effectively will cancel them out in calculation further down in the code. I suppose this was a mistake in the original code, and should be max(min(...),0)

min_x = min(min(X) - 2 * np.std(X, ddof=1), 0)
max_x = max(max(X) + 2 * np.std(X, ddof=1),
np.std(X, ddof=1))

  1. The model for G is a mixture model; a uniform 'slab' of probability 0.1 plus 0.9 probability proportional to of exp(eta[0]*x + eta[1]*x**2 + ... +eta[p]*x**p) . The linear combination of the slab and the power expansion is not weighted properly, so the posterior does not normalize to 1. It is correctly done in the neg_loglik function, but the parenthesis are wrong in the follwoing snippet. The problem can be simplified by introducing a g_slab density separately, reducing the number of parentheses later in the code.

g_eta_main = g_eta_raw / sum(g_eta_raw)
g_eta = ((1 - unif_fraction) * g_eta_main +
unif_fraction * mask) / sum(mask)

  1. The convolutional kernel to produce the data distribution f_eta is called noise_rotate. If xvals is not symmetric around 0 it will differ from the noise kernel in gbayes, making the whole procedure invalid. I think that a better definition of the noise kernel would be along the lines of
noise_rotate = norm(scale=sigma,loc=xvals.mean()).pdf(xvals)
noise_rotate /= noise_kernel.sum()
  1. The expansion of the parametric model is numerically unstable. It is currently done like
    XX = np.zeros((p, len(xvals)), dtype="float")
    for ind, exp in enumerate(range(1, p+1)):
    mask = np.ones_like(xvals)
    mask[np.where(xvals <= 0)[0]] = 0
    XX[ind, :] = pow(xvals, exp) * mask
    XX = XX.T

    but this has numerical issues in later optimization for eta. When the p=5, which is the settings in this library, you have a matrix with vastly different orders of magnitutes. Normalizing the columns improves the situation somewhat. I think the code should do this. If xvals >=0 we can also skip the multiplication with the mask.
XX = np.column_stack([ pow(xvals, exp) for exp in range(1, p+1)])
XX /= np.sum(XX,axis = 0, keepdims=True) # normalize each feature column for better numerical stability

It does not solve the whole problem, and relaxing the tolerance a bit on the solver might be required as well. I reduced the tolerance until the MPG example completed successfully.
9. The defaults differ in this code compared to the R-code, without motivation. That code uses p=2 and nbins=1000 while this code uses p=5 and nbins=200. I think the R code defaults are more sensible.
10. The code warning on arithmetic overflows is harmless. It should be ignored.
11. The optimization routine fails quite often, and I think the user should get a warning in those cases.

Impact of the PR

Impact on the minimal example

I implemented the suggestions above, and the previously failing call to gfit is now sucessful. No warnings. The following diagnostic plots, using the same X and sigma are good for understanding the problem better.

bild

It is clear that the slab is responsible to large part for the inflated variance in the calibration compared to observed data. The EB model will give us a model of variances that is not really good for this observed dataset, but this is the best it can do. The outlier at ca 27 is likely the culprit - it shows up clearer in the scatter plot below

bild

Impact on the library examples

With my updated gfit, the MPG example looks like below

bild

The same plot before my changes was

sphx_glr_plot_mpg_001

The results are clearly very similar. The impact of the different empirical bayes priors had no dramatic impact here.

This was referenced Jun 19, 2024
@danieleongari
Copy link
Collaborator

danieleongari commented Jul 2, 2024

I tested your PR with a simple quadratic function, find the notebook HERE in PDF.

In these plots the CI is considered as "relative" to the CI computed with the largest 5000-trees-RF, on training and test set.
image

image

I must admit that I'm still skeptical about calibration, as:

  • it still frequently raised UserWarning: Fitting the empirical bayes prior failed with message Desired error not necessarily achieved due to precision loss..,
  • few-trees RF models have some weird deviance
  • it is not very evident that they are reaching convergence faster.

I would recommend to the new user to simply use many trees.
However, there was a lack of documentation for the calibration, and you did a very appreciable job to rationalize it: since the code is running fine, I propose to push the PR as it is, and take the occasion to make a new release to include also the previous PRs that were not in the 0.6 version.

@arokem, ok with you?

@danieleongari danieleongari changed the title attemped fix of the EB computation Improve calibration Jul 3, 2024
@danieleongari danieleongari merged commit 0dc01e2 into scikit-learn-contrib:master Jul 9, 2024
3 checks passed
@el-hult
Copy link
Contributor Author

el-hult commented Jul 14, 2024

I totally agree that the whole calibration thing is questionable... And that the better thing to do is to increase the number of trees in the forest.

N.B. the intervals are only asymptotically valid anyways, as the size of the dataset and the number of trees goes to infinity.

@el-hult el-hult deleted the nanhunt branch July 14, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants