Improve calibration #114

el-hult · 2024-06-19T11:36:26Z

The gfit function is brittle, and fails in some cases. Instead of giving error messages or warnings, the calibration routine just spits out nonsense like infinite, nan or zero variance. Example of failure modes are listed in #93 .

This PR suggest changes to help identify these failures, and to some degree mitigate them. In the process of doing these updats, I also made minor changes to improve readability and maintainability.

My ambition was to:

Make it correct
Make it throw warnings and errors when applicable

Minimal reproduction

The simplest way to reproduce some aspects of the problems is to run the example https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/master/examples/plot_mpg.py, and run it for 100 trees (instead of 1000). This gives rise to the overflow errors and more.

I could isolate the call to gfit as in the snippet below.

X = [ 1.28250709e+00, -9.62199807e-01, -2.04701193e+00, -1.02443034e+00, 6.38396847e-01,  6.93655605e-01,  2.02769335e-01,  2.28704442e+00, 1.28821259e+00, -2.07737984e+00,  2.45556492e-02,  2.89656660e-01, 4.53925023e-01,  3.35682503e+00, -4.54257770e-01, -1.23714287e-01, 3.77470001e+00, -4.23614898e-01,  7.46533342e-01, -8.81119202e-01, 2.08394853e+00, -1.05411916e-01,  3.31927296e+00,  7.17650800e-01, 2.25908057e+00, -7.53290424e-01, -2.21822602e-01, -6.27804326e-01, 9.58914997e-02, -1.14970276e+00,  1.70686211e+00,  1.47393397e-01, 1.29608780e+00,  2.94860970e-01,  9.31508065e-01,  9.68559862e+00, 3.47850459e+00,  5.20463355e-01,  1.07368578e-02,  6.57683255e-01, 1.45860815e+00,  5.14249866e+00,  2.57982131e+01,  3.19180999e-01,-3.30013368e+00,  3.49159227e-01,  4.65446974e-02, -2.08369433e+00, 8.59651030e-01,  5.86331704e-01,  3.32865616e+00,  1.82593891e-01, 7.18148878e-01,  5.89549645e-02,  5.39916639e-01,  1.02031244e+00, 2.64314510e+00,  1.10181223e-01,  1.05365282e+00,  1.84280809e-01, 2.04459181e+00, -6.69517305e-01,  4.63034132e+00, -2.03630094e+00,-4.16557994e-01,  6.05384570e-01, -7.41193944e-01,  3.28397015e-01,-2.02724573e-01,  2.81343215e-01, -4.52882044e-02, -5.11185513e-01, 1.45608437e-02, -1.81530128e+00, -2.56268846e+00, -3.21805107e-01, 8.91628062e+00,  9.19464777e-01, -1.45044659e+00, -4.56431159e+00, 7.72050142e-01,  5.69059211e+00, -7.16028668e-01,  1.48268265e+00, 6.25840167e-02,  1.59672239e-01]
sigma = 4.545817017854692
from forestci.calibration import gfit
gfit(X,sigma)

Analysis

The "calibration" routine is not well documented. There is a reference in the python code to the Wager/Athey paper, but that article does not mention calibration. The original R code by Wager do the calibration, but has references to any article on how it is motivated. So I worked through the code to understand it.

The calibrateEB function is a denoising function. It assumes that the input are noisy observations and tries to find denoised observations. The modelling approach is mu ~ G, X ~ N(mu, sigma^2). If only the prior distribution G was known, then for each observed variance x we would take the expectation over the posterior distribution; x_calib = E [ posterior(mu|x)) ] . Since G is not known, we model it with empirical bayes. The function gfit does this fitting for us.

The things I found while analyzing:

in calibrateEB, there is a use of list() and map() with a functool.partial. List comprehensions are typically faster, need fewer imports and are arguably easier to read.
the docstring for gbayes had a mistake on the g_est parameter

Now we can focus on gfit

the limits in parameter space is unnessecarily low. It is currently chosen to be min(min(...),0), but all nonpositive values are useless, since the mask variable effectively will cancel them out in calculation further down in the code. I suppose this was a mistake in the original code, and should be max(min(...),0)

forest-confidence-interval/forestci/calibration.py

Lines 61 to 63 in ffa7227

 min_x = min(min(X) - 2 * np.std(X, ddof=1), 0) 

 max_x = max(max(X) + 2 * np.std(X, ddof=1), 

 np.std(X, ddof=1))

The model for G is a mixture model; a uniform 'slab' of probability 0.1 plus 0.9 probability proportional to of exp(eta[0]*x + eta[1]*x**2 + ... +eta[p]*x**p) . The linear combination of the slab and the power expansion is not weighted properly, so the posterior does not normalize to 1. It is correctly done in the neg_loglik function, but the parenthesis are wrong in the follwoing snippet. The problem can be simplified by introducing a g_slab density separately, reducing the number of parentheses later in the code.

forest-confidence-interval/forestci/calibration.py

Lines 102 to 104 in ffa7227

 g_eta_main = g_eta_raw / sum(g_eta_raw) 

 g_eta = ((1 - unif_fraction) * g_eta_main + 

 unif_fraction * mask) / sum(mask)

The convolutional kernel to produce the data distribution f_eta is called noise_rotate. If xvals is not symmetric around 0 it will differ from the noise kernel in gbayes, making the whole procedure invalid. I think that a better definition of the noise kernel would be along the lines of

noise_rotate = norm(scale=sigma,loc=xvals.mean()).pdf(xvals)
noise_rotate /= noise_kernel.sum()

The expansion of the parametric model is numerically unstable. It is currently done like

forest-confidence-interval/forestci/calibration.py

Lines 76 to 81 in ffa7227

 XX = np.zeros((p, len(xvals)), dtype="float") 

 for ind, exp in enumerate(range(1, p+1)): 

 mask = np.ones_like(xvals) 

 mask[np.where(xvals <= 0)[0]] = 0 

 XX[ind, :] = pow(xvals, exp) * mask 

 XX = XX.T

but this has numerical issues in later optimization for eta. When the p=5, which is the settings in this library, you have a matrix with vastly different orders of magnitutes. Normalizing the columns improves the situation somewhat. I think the code should do this. If xvals >=0 we can also skip the multiplication with the mask.

XX = np.column_stack([ pow(xvals, exp) for exp in range(1, p+1)])
XX /= np.sum(XX,axis = 0, keepdims=True) # normalize each feature column for better numerical stability

It does not solve the whole problem, and relaxing the tolerance a bit on the solver might be required as well. I reduced the tolerance until the MPG example completed successfully.
9. The defaults differ in this code compared to the R-code, without motivation. That code uses p=2 and nbins=1000 while this code uses p=5 and nbins=200. I think the R code defaults are more sensible.
10. The code warning on arithmetic overflows is harmless. It should be ignored.
11. The optimization routine fails quite often, and I think the user should get a warning in those cases.

Impact of the PR

Impact on the minimal example

I implemented the suggestions above, and the previously failing call to gfit is now sucessful. No warnings. The following diagnostic plots, using the same X and sigma are good for understanding the problem better.

It is clear that the slab is responsible to large part for the inflated variance in the calibration compared to observed data. The EB model will give us a model of variances that is not really good for this observed dataset, but this is the best it can do. The outlier at ca 27 is likely the culprit - it shows up clearer in the scatter plot below

Impact on the library examples

With my updated gfit, the MPG example looks like below

The same plot before my changes was

The results are clearly very similar. The impact of the different empirical bayes priors had no dramatic impact here.

danieleongari · 2024-07-02T10:26:11Z

I tested your PR with a simple quadratic function, find the notebook HERE in PDF.

In these plots the CI is considered as "relative" to the CI computed with the largest 5000-trees-RF, on training and test set.

I must admit that I'm still skeptical about calibration, as:

it still frequently raised UserWarning: Fitting the empirical bayes prior failed with message Desired error not necessarily achieved due to precision loss..,
few-trees RF models have some weird deviance
it is not very evident that they are reaching convergence faster.

I would recommend to the new user to simply use many trees.
However, there was a lack of documentation for the calibration, and you did a very appreciable job to rationalize it: since the code is running fine, I propose to push the PR as it is, and take the occasion to make a new release to include also the previous PRs that were not in the 0.6 version.

@arokem, ok with you?

el-hult · 2024-07-14T13:32:35Z

I totally agree that the whole calibration thing is questionable... And that the better thing to do is to increase the number of trees in the forest.

N.B. the intervals are only asymptotically valid anyways, as the size of the dataset and the number of trees goes to infinity.

el-hult added 3 commits June 19, 2024 13:25

attemped fix of the EB computation

4ec5dc2

remove forgotten debug-print statement

a6f1b04

use builtin pow, since np.pow is only for numpy version 2.0.0 and later

6ae1d9e

This was referenced Jun 19, 2024

forest error are all NaN #97

Closed

Zero confidence intervals #72

Closed

danieleongari changed the title ~~attemped fix of the EB computation~~ Improve calibration Jul 3, 2024

danieleongari merged commit 0dc01e2 into scikit-learn-contrib:master Jul 9, 2024
3 checks passed

el-hult deleted the nanhunt branch July 14, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve calibration #114

Improve calibration #114

el-hult commented Jun 19, 2024 •

edited

Loading

danieleongari commented Jul 2, 2024 •

edited

Loading

el-hult commented Jul 14, 2024

	min_x = min(min(X) - 2 * np.std(X, ddof=1), 0)
	max_x = max(max(X) + 2 * np.std(X, ddof=1),
	np.std(X, ddof=1))

	g_eta_main = g_eta_raw / sum(g_eta_raw)
	g_eta = ((1 - unif_fraction) * g_eta_main +
	unif_fraction * mask) / sum(mask)

	XX = np.zeros((p, len(xvals)), dtype="float")
	for ind, exp in enumerate(range(1, p+1)):
	mask = np.ones_like(xvals)
	mask[np.where(xvals <= 0)[0]] = 0
	XX[ind, :] = pow(xvals, exp) * mask
	XX = XX.T

Improve calibration #114

Improve calibration #114

Conversation

el-hult commented Jun 19, 2024 • edited Loading

Minimal reproduction

Analysis

Impact of the PR

Impact on the minimal example

Impact on the library examples

danieleongari commented Jul 2, 2024 • edited Loading

el-hult commented Jul 14, 2024

el-hult commented Jun 19, 2024 •

edited

Loading

danieleongari commented Jul 2, 2024 •

edited

Loading