Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for eval(validation) data #189

Open
talegari opened this issue Nov 26, 2022 · 4 comments
Open

Support for eval(validation) data #189

talegari opened this issue Nov 26, 2022 · 4 comments

Comments

@talegari
Copy link

The following problem arose where one of the preprocessing steps was embed:: step_lencode_glm (generalized target encoding) and model was xgboost.

From the documentation of parsnip::xgb_train it appears that evaluation data cannot be used for early stopping. While the argument validation sets aside some validation(eval) data for early stopping, its not clear if recipe is applied after splitting train and validation parts. How does this work?

It might be a good idea to support something like this:

# case 1: User specifies train and eval
workflow() %>% 
    add_recipe(some_recipe) %>% 
    add_model(some_model) %>% 
    fit(train_data = A, eval_data = B, use_eval_in_early_stopping = TRUE)

# case 2: Use the existing 'initial_split' class object
splitter = initial_split(dataset, 0.7)

workflow() %>% 
    add_recipe(some_recipe) %>% 
    add_model(some_model) %>% 
    fit(splitter, use_eval_in_early_stopping = TRUE)

where

  1. a recipe is always trained on the train part and baked on the eval(validation) part
  2. eval data to be used in early stopping if the algorithm supports it and the flag is set to true.
@simonpcouch
Copy link
Contributor

Thanks for the issue!

From the documentation of parsnip::xgb_train it appears that evaluation data cannot be used for early stopping. While the argument validation sets aside some validation(eval) data for early stopping, its not clear if recipe is applied after splitting train and validation parts. How does this work?

While fitting that workflow on some training set A, the recipe is first applied and processes all of A, then passes bake(A) to parsnip::xgb_train. The validation argument in that function can then be used to allot some of bake(A) for use in a watchlist (i.e. as a validation set, possibly for early stopping).

eval data to be used in early stopping if the algorithm supports it and the flag is set to true.

The functionality that (I believe) the use_eval_in_early_stopping argument proposed here is implementing is already possible by passing validation and setting a non-NULL early_stop argument:

some_model <-
  boost_tree() %>%
  set_engine("xgboost", validation = .2, early_stop = 10)

If this doesn't cover your use case, could you please provide a minimal reprex (reproducible example) demonstrating the functionality you'd like to see? Could you also clarify your notation "eval(validation)" and "validation(eval)"?

@talegari
Copy link
Author

talegari commented Dec 3, 2022

Simon, thanks for your response.

The problem is right here:

While fitting that workflow on some training set A, the recipe is first applied and processes all of A, then passes bake(A) to parsnip::xgb_train. The validation argument in that function can then be used to allot some of bake(A) for use in a watchlist (i.e. as a validation set, possibly for early stopping).

Current flow:

data (A) 
--> train(prep) recipe on A 
--> apply recipe on A to get A_new
--> split A_new into A_train and A_validation
--> model on A_train and use A_validation for early stopping

Expected flow:

data (A) 
--> split into A_train and A_validation 
--> train(prep) a recipe on A_train
--> apply(bake) trained(prepped) recipe on A_train and A_validation to obtain train_new and validation_new
--> model on train_new and validation_new for early stopping

In the current flow, there is a data leak as recipe learns from combined dataset.

@topepo
Copy link
Member

topepo commented Dec 6, 2022

So there is the possibility that the validation set used inside of xgb_train() might lead to optimistic results (due to preprocessing - not the model).

If that is a potential issue for your recipe, I would use validation_split() instead of the validation argument of xgb_train(). That will quarantine the holdout data for both the model and preprocessor. You can tune trees to use this external validation set to determine when to stop boosting.

The api that you suggest is difficult to implement since xgb_train() does not have access to the recipe and could not preprocess that data separately.

@talegari
Copy link
Author

talegari commented Dec 7, 2022

Max, thanks for your response.

You can tune trees to use this external validation set to determine when to stop boosting.

Would you mind helping me some example(code) of achieving this?

PS: tidymodels offers a great system to build and use models.
I would like to keep workflow clean and not make a call to xgboost::xgb.train with validation data in watchlist parameter, which takes me out of tidymodels paradigm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants