Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
gfit
function is brittle, and fails in some cases. Instead of giving error messages or warnings, the calibration routine just spits out nonsense like infinite, nan or zero variance. Example of failure modes are listed in #93 .This PR suggest changes to help identify these failures, and to some degree mitigate them. In the process of doing these updats, I also made minor changes to improve readability and maintainability.
My ambition was to:
Minimal reproduction
The simplest way to reproduce some aspects of the problems is to run the example
https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/master/examples/plot_mpg.py
, and run it for 100 trees (instead of 1000). This gives rise to the overflow errors and more.I could isolate the call to
gfit
as in the snippet below.Analysis
The "calibration" routine is not well documented. There is a reference in the python code to the Wager/Athey paper, but that article does not mention calibration. The original R code by Wager do the calibration, but has references to any article on how it is motivated. So I worked through the code to understand it.
The
calibrateEB
function is a denoising function. It assumes that the input are noisy observations and tries to find denoised observations. The modelling approach ismu ~ G, X ~ N(mu, sigma^2)
. If only the prior distributionG
was known, then for each observed variancex
we would take the expectation over the posterior distribution;x_calib = E [ posterior(mu|x)) ]
. SinceG
is not known, we model it with empirical bayes. The functiongfit
does this fitting for us.The things I found while analyzing:
calibrateEB
, there is a use oflist()
andmap()
with afunctool.partial
. List comprehensions are typically faster, need fewer imports and are arguably easier to read.gbayes
had a mistake on theg_est
parameterNow we can focus on
gfit
min(min(...),0)
, but all nonpositive values are useless, since themask
variable effectively will cancel them out in calculation further down in the code. I suppose this was a mistake in the original code, and should bemax(min(...),0)
forest-confidence-interval/forestci/calibration.py
Lines 61 to 63 in ffa7227
G
is a mixture model; a uniform 'slab' of probability 0.1 plus 0.9 probability proportional to ofexp(eta[0]*x + eta[1]*x**2 + ... +eta[p]*x**p)
. The linear combination of the slab and the power expansion is not weighted properly, so the posterior does not normalize to 1. It is correctly done in theneg_loglik
function, but the parenthesis are wrong in the follwoing snippet. The problem can be simplified by introducing ag_slab
density separately, reducing the number of parentheses later in the code.forest-confidence-interval/forestci/calibration.py
Lines 102 to 104 in ffa7227
f_eta
is called noise_rotate. Ifxvals
is not symmetric around 0 it will differ from the noise kernel ingbayes
, making the whole procedure invalid. I think that a better definition of the noise kernel would be along the lines offorest-confidence-interval/forestci/calibration.py
Lines 76 to 81 in ffa7227
but this has numerical issues in later optimization for
eta
. When thep=5
, which is the settings in this library, you have a matrix with vastly different orders of magnitutes. Normalizing the columns improves the situation somewhat. I think the code should do this. Ifxvals >=0
we can also skip the multiplication with the mask.It does not solve the whole problem, and relaxing the tolerance a bit on the solver might be required as well. I reduced the tolerance until the MPG example completed successfully.
9. The defaults differ in this code compared to the R-code, without motivation. That code uses
p=2
andnbins=1000
while this code usesp=5
andnbins=200
. I think the R code defaults are more sensible.10. The code warning on arithmetic overflows is harmless. It should be ignored.
11. The optimization routine fails quite often, and I think the user should get a warning in those cases.
Impact of the PR
Impact on the minimal example
I implemented the suggestions above, and the previously failing call to
gfit
is now sucessful. No warnings. The following diagnostic plots, using the sameX
andsigma
are good for understanding the problem better.It is clear that the
slab
is responsible to large part for the inflated variance in the calibration compared to observed data. The EB model will give us a model of variances that is not really good for this observed dataset, but this is the best it can do. The outlier at ca 27 is likely the culprit - it shows up clearer in the scatter plot belowImpact on the library examples
With my updated
gfit
, the MPG example looks like belowThe same plot before my changes was
The results are clearly very similar. The impact of the different empirical bayes priors had no dramatic impact here.