Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical variables mishandled by varimp, variable.step, bart.step #31

Open
AMBarbosa opened this issue Dec 5, 2022 · 4 comments
Open

Comments

@AMBarbosa
Copy link

AMBarbosa commented Dec 5, 2022

When the predictors include categorical variables, dbarts::bart includes them but embarcadero removes them. This appears to be because embarcadero removes variables based on unlist(attr(model$fit$data@x, "drop")), where the categorical variables are actually split and renamed to reflect their categories. This leads to an error in varimp (which fails for models that include categorical predictors) and to categorical variables being automatically excluded a priori by variable.step and bart.step, with a message unfairly blaming dbarts. Here some reproducible code:

# generate some data as in ?bart examples:

f <- function(x) {
  10 * sin(pi * x[,1] * x[,2]) + 20 * (x[,3] - 0.5)^2 +
    10 * x[,4] + 5 * x[,5]
}

set.seed(99)
sigma <- 1.0
n     <- 100

x  <- matrix(runif(n * 10), n, 10)
Ey <- f(x)
y  <- rnorm(n, Ey, sigma)


# make 'y' binary:
y <- ifelse(y > mean(y), 1, 0)

# make one of the x variables categorical:
x <- data.frame(x)
x[,1] <- ifelse(x[,1] > mean(x[,1]), "high", "low")
head(x)


# fit a bart model:
set.seed(99)
bartFit <- bart(x, y, keeptrees = TRUE)

summary(bartFit)  # notice 10 variables (i.e. including the categorical one) in predictor list

bartFit$fit$data
unlist(attr(bartFit$fit$data@x, "drop"))  # notice X1 (categorical variable) named here as X11 and X12 (one for each category)
# X11 X12  X2  X3  X4  X5  X6  X7  X8  X9 X10 
#  52  48   0   0   0   0   0   0   0   0   0 

# attempt to compute variable importance with 'embarcadero':
varimp(bartFit)  # Error in data.frame(names, varimps) : arguments imply differing number of rows: 9, 10

# but the variable importance info is there, including for the categorical variable (though it's also renamed here):
rel_imp <- bartFit$varcount / rowSums(bartFit$varcount)
colnames(rel_imp)
# [1] "X1.low"   "X2"     "X3"     "X4"     "X5"     "X6"     "X7"     "X8"     "X9"     "X10"

# attempt to simplify the model with 'embarcadero':
variable.step(x, y)  # X1 (categorical variable) said to be dropped by 'dbarts', but it wasn't really -- it was dropped by 'embarcadero' when expecting unlist(attr(bartFit$fit$data@x, "drop")) to have the original variables' names
@AMBarbosa
Copy link
Author

I can attempt to fix this and submit a pull request for your consideration once I've managed to.

@charleygros
Copy link

I can attempt to fix this and submit a pull request for your consideration once I've managed to.

Wondering if you managed to fix this? I'd be interested, it'd be very appreciated @AMBarbosa

@AMBarbosa
Copy link
Author

I'm working on it. I forked 'embarcadero' and if you install my branch with install_github('AMBarbosa/embarcadero') (function 'install_github' from 'devtools' or from 'remotes' pkg) you can try it out already -- I'd actually appreciate some feedback on how it's working. I still haven't finished testing and adapting this also to 'rbart' models.

@charleygros
Copy link

@AMBarbosa : many thanks for this. I installed your branch and tested it on my data: the results all made sense to me and the functions worked as expected. I haven't look at the changes per se tho. Great one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants