A framework for Bayesian model selection (BMS) and Bayesian model Averaging (BMA).
A framework to implement Bayesian model selection and model averaging scheme to linear regression models. It can be used to identify the best model amongst several competing linear regression models. Furthermore, when the model uncertainty is high, it can be used to perform Bayesian model averaging to make reliable and robust estimation of the quantities of interest.
To use this package, you need the following packages:
numpy>=1.12
sklearn>=0.20.1
matplotlib>=3.1
pandas>=0.25.1
After you have all the required dependency, you can download the code from github using:
git clone https://github.com/mamunm/BayesianFramework
Then, you can run setup.py
to install:
python setup.py install
Alternatively, you can add the file location to your PYTHONPATH
:
export PYTHONPATH="/path/to/BayesianFramework:${PYTHONPATH}"
What's all the bells and whistles this project can perform?
- linreg: performs linear regression on data
- linzoo: constructs the linear regression zoo from the dataframe
- bayesframe: performs Bayesian model selction and averaging and can be used to make future prediction.
Here, I demonstrate a simple code snippet to show how to use bayesframe.LinReg
on any data:
#import module
from bayesframe import LinReg
from sklearn.datasets import load_boston
#get the data
data = load_boston()
X = data['data'][:, 1]
y = data['target']
# Initialize the model
lin = LinReg(X=X, y=y, val_scheme="leave_one_out")
#print model data
print(lin.get_model_data())
Running this script will produce the following output:
{'slope': array([0.14213999]), 'intercept': 20.917579117799832, 'rmse': 8.60490557714858, 'n_dp': 506}
Here, the slope
and intercept
are the model parameters, n_dp
is the number of data points used to fit the model, and rmse
is the model's root mean squred error. If val_scheme
is used None
or default, it will compute the sample variance. To approximate the population variance, you can use leave_one_out
or k-Fold
cross validation scheme. In the k-Fold
scheme, you need to give an integer for the value of k
, e.g., for 5 fold cv, use 5-Fold
.
Now, a simple demonstration of the bayesframe.LinZoo
function:
#import module
from bayesframe import load_data
from bayesframe import LinZoo
#get the data
data = load_data()
#Initialize the model
linzoo = LinZoo(df=data, target="Target", val_scheme=None, bic_scheme="per_n")
#Build the zoo
linzoo.build_zoo()
#Plot and show the envelope
linzoo.plot_envelope().show()
This will show the following plot in a matplotlib GUI:
To save a plot, use the following one liner:
linzoo.plot_envelope().savefig('BIC_Envelope.png')
In the above code snippet for LinZoo
demonstration, df
is the dataframe, target
is the column name of the target properties, val_scheme
is same as validation scheme described in LinReg
section, and bic_scheme
is the scheme to compute the Bayesian Information Criteria (BIC). For equal amount of data points for each model, use None
but for varying amount of data points, use per_n
. Insted of passing a dataframe to instantiate the class object, you can also pass the fpath
of the csv file.
Now, we are all set to move on to building the Bayesian Model Selection and Averaging part. The following code illustrates how to use it:
#Import modeule
from bayesframe import BayesFrame
from bayesframe import load_data
from bayesframe import load_test_data
#Load the data
data = load_data()
#Initialize the model
bframe = BayesFrame(df=data, target="Target", val_scheme=None,
bic_scheme="per_n", model_scheme=["selection"])
#Print the best model
print(bframe.zoo)
It will print:
{'O_N_CH2_NH': {'BIC': 1.1925678847679801,
'Delta_BIC': 0.0,
'intercept': 1.5896455262621,
'n_dp': 20,
'rmse': 0.08917024884254064,
'slope': array([-1.04318739, 0.52604511, 0.39987429, 1.14262368])}}
In the above code, df
(you can also use fpath
), target
, val_scheme
, and bic_scheme
has the same meaning as LinZoo
class. model_scheme
is the specification of which sheme to use for model deployment. ["selection"]
will use the best model to make future prediction. To use Bayesian Model Averaging (BMA), use ["averaging", "all"]
which will use all models in the averaging scheme. If you want to use only the low lying models, use numerical value instead of "all", e.g., 0.5
to take models which are within σo
to σo + 0.5 σo Occam's window.
To make prediction with the model:
#Load the test data
t_data = load_test_data()
#Make prediction on the test data
bframe(data=t_data, outpath="out.csv", target="Target", print_rmse=True)
It will print out the following line in the console:
Computed RMSE: 0.0902776566069057
Here, data
is the dataframe to make prediction on. Alternatively, you can specify the path to csv file using fpath
. outpath
is the path to the csv file where output will be written in. target
is the column of the target properties and print_rmse
will print rmse onto the console.
To use averaging, change few lines:
#Import modeule
from bayesframe import BayesFrame
from bayesframe import load_data
from bayesframe import load_test_data
#Load the data
data = load_data()
#Initialize the model
bframe = BayesFrame(df=data, target="Target", val_scheme=None,
bic_scheme="per_n", model_scheme=["averaging", "all"])
#Load the test data
t_data = load_test_data()
#Make prediction on the test data
bframe(data=t_data, outpath="out.csv", target="Target", print_rmse=True)
The printed RMSE:
Computed RMSE: 0.10988286281396
As expected, model averaging performance is slightly worse than model selection performance as we are averaging over all the models rather than taking the best one, in return the prediction will be much more robust than the model selection scheme.