Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on a simple dataset #62

Open
Kodiologist opened this issue Oct 30, 2018 · 3 comments
Open

Poor performance on a simple dataset #62

Kodiologist opened this issue Oct 30, 2018 · 3 comments

Comments

@Kodiologist
Copy link

Kodiologist commented Oct 30, 2018

I'm not familiar with the statistical approach taken by mlrMBO, so excuse me if I'm missing something. Anyway, I was going to ask a question about overfitting in autoxgboost, but it looks like I actually have an underfitting problem. Below is a simple example using DART and mostly default settings. Training and test error for the autoxgboost model hovers near the SD and is much worse than that of linear regression. Increasing the number of iterations to 500 didn't seem to help. What am I doing wrong?

library(autoxgboost)
library(mlr)
library(ParamHelpers)

set.seed(456)
xgb.threads = 10

autoxgbparset.dart = makeParamSet(
    makeNumericParam("eta", lower = 0.01, upper = 0.2),
    makeNumericParam("gamma", lower = -7, upper = 6, trafo = function(x) 2^x),
    makeIntegerParam("max_depth", lower = 3, upper = 20),
    makeNumericParam("colsample_bytree", lower = 0.5, upper = 1),
    makeNumericParam("colsample_bylevel", lower = 0.5, upper = 1),
    makeNumericParam("lambda", lower = -10, upper = 10, trafo = function(x) 2^x),
    makeNumericParam("alpha", lower = -10, upper = 10, trafo = function(x) 2^x),
    makeNumericParam("subsample", lower = 0.5, upper = 1),
    makeDiscreteParam("booster", values = "dart"),
    makeDiscreteParam("sample_type", values = c("uniform", "weighted")),
    makeDiscreteParam("normalize_type", values = c("tree", "forest")),
    makeNumericParam("rate_drop", lower = 0, upper = 1),
    makeNumericParam("skip_drop", lower = 0, upper = 1),
    makeLogicalParam("one_drop"))

N = 2000
d = transform(data.frame(
    x1 = rnorm(N),
    x2 = rnorm(N),
    x3 = rnorm(N)),
    y = 2*x2 + (abs(x3) < 1) + rnorm(N))
train = (1 : N) <= 1000

m.lm = lm(y ~ ., data = d[train,])
m.axgb = autoxgboost(
    task = makeRegrTask(target = "y", data = d[train,]),
    measure = rmse,
    par.set = autoxgbparset.dart,
    design.size = 30L,
    nthread = xgb.threads)

f = function(a, b) sqrt(mean((a - b)^2))
print(t(data.frame(
    SD = sd(d[!train, "y"]),
    perfect.train = f(d[train, "y"],
        with(d[train,], 2*x2 + (abs(x3) < 1))),
    perfect.test = f(d[!train, "y"],
        with(d[!train,], 2*x2 + (abs(x3) < 1))),
    lm.train = f(d[train, "y"], predict(m.lm)),
    lm.test = f(d[!train, "y"], predict(m.lm, newdata = d[!train,])),
    axgb.train = f(d[train, "y"],
        predict(m.axgb, newdata = d[train,])$data$response),
    axgb.test = f(d[!train, "y"],
        predict(m.axgb, newdata = d[!train,])$data$response))))

Output:

                  [,1]
SD            2.258125
perfect.train 1.001122
perfect.test  1.021544
lm.train      1.119062
lm.test       1.137798
axgb.train    2.114853
axgb.test     2.234926

CCing my cow-orkers: @allanjust, @liuyanguu

@liuyanguu
Copy link

liuyanguu commented Nov 1, 2018

I had an interesting further finding. The problem might lie in the predict function of autoxgboost. If I extract the parameters using mlr::getHyperPars and run a separate xgboost::xgboost, both the testing and training error went back to around 1.1, which looks right. I am glad I always extracted the hyperparameters... Strangely enough, I never noticed such a problem before. And indeed the difference is not obvious in some large dataset.

Following Kodi's code above, if we continue to run:

param_dart <- mlr::getHyperPars(m.axgb$final.learner)
set.seed(1234)
m.xgboost <-  xgboost::xgboost(data = as.matrix(d[train, 1:3]),
                               label = d[train, "y"], 
                               params = param_dart, nrounds = param_dart$nround, 
                               verbose = T, print_every_n = param_dart$nround)
xgb_pred <- predict(m.xgboost, as.matrix(d[!train, 1:3]), ntreelimit = param_dart$nrounds)
(rmse_xgb <- sqrt(mean((d[!train, "y"] - xgb_pred)^2)))

we get 1.193 as the testing rmse

An extra issue is that to use 'dart' correctly, we need to pass argument ntreelimit = param_dart$nrounds to the predict function, otherwise the results would be inconsistant:

# And this is what happened if we do not set `ntreelimit`:
n0 <- 1000
rmse_xgb <- rep(NA, n0)
set.seed(1234)
for (i in 1:n0){
  xgb_pred <- predict(m.xgboost, as.matrix(d[!train, 1:3]))
  rmse_xgb[i] <- sqrt(mean((d[!train, "y"] - xgb_pred)^2))
}
summary(rmse_xgb)

Output:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.193   1.193   1.236   1.254   1.290   1.658

After setting ntreelimit = param_dart$nrounds we will only get 1.193.
This is a minor and extra issue but I guess there is no way to pass a further argument to autoxgboost's predict. And of course, not using dart won't correct the main problem above (difference in predict output).

@liuyanguu
Copy link

@ja-thomas , I am not very familiar with mlr... I am curious how is the predict function called when the object is autoxgboost?

@ja-thomas
Copy link
Owner

ja-thomas commented Nov 5, 2018

Hi,

sorry for the late reply, I was away for a few days.

thanks for the issue, this is indeed very surprising and I found the problem to be that we call cpoDropConstants which seems to drop features that are far away from constant. This is a bug in mlrCPO.

For now I'll drop this step from the preprocessing, until it is fixed in mlrCPO.

see here: mlr-org/mlrCPO#59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants