Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Wrong output shapes for Shapley values #342

Open
RAMitchell opened this issue Mar 10, 2023 · 3 comments
Open

[Bug] Wrong output shapes for Shapley values #342

RAMitchell opened this issue Mar 10, 2023 · 3 comments

Comments

@RAMitchell
Copy link
Contributor

In the case of a binary classification model from sklearn we expect the output for both positive and negative classes (this would be consistent with the normal prediction output). As the model is transferred in the treelite format it has num_classes set to 1 in the xgboost style and so the Shapley values are written as a single column. There is no way to detect if the model is a regression or binary classification model from the information given in treelite, so we cannot just mirror the output to correct the result on the triton-fil side without also causing this to happen for every regression model.

@RAMitchell
Copy link
Contributor Author

At this stage I think the path of least resistance is to output the shapley values only for the positive class. This is not ideal because generally we want shapley values to add up to the normal prediction output e.g. shapely_values.sum(axis=-1) == prediction_output.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 15, 2023

There is no way to detect if the model is a regression or binary classification model from the information given in treelite

Would it be useful if Treelite stored a flag to indicate whether the model is a regression model? I can get it in for Treelite 4.0

@RAMitchell
Copy link
Contributor Author

Yes this is a good idea. This is currently only a problem for the random forest models. In the case of xgboost we can tell from the output transformation that it is classification, but the random forest classification uses the identity transform so we can't actually tell the difference.

I think this will become a problem in future as well for multi-output regression models. The current implementation assumes that all multi-output models are classification - it may be helpful for downstream applications to differentiate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants