This repository aims to compare the performances of multiple machine learning (ML) algorisms when the data distribution is highly imbalanced with one overwhelming response category. The dataset was randomly divided into two parts: training and test sets. Then, I will develop a statistical model out of the training set and apply it to the test set, recording down the misclassification errors.
Furthermore, I will use ROC and AUC to compare the performances and conclude KNN, as a non-parametrical method, outperforms the others when the distribution is highly imbalanced.
For the entire dataset, please refer to my Medium post: A Pain in the Neck: Predict A Rare Event using 5 Machine Learning Methods, https://towardsdatascience.com/classifying-rare-events-using-five-machine-learning-techniques-fab464573233.
This project is conducted in the R environment, and you have to pre-install the following libraries: readr, knitr, dplyr, plyr, class, reshape2, tree, randomForest, car, and e1071.
This dataset is collected by a Portuguese banking institution to assess the effect of direct marketing campaigns (phone calls) in predicting if the client will subscribe to a term deposit. The data source can be accessed here at https://archive.ics.uci.edu/ml/datasets/bank+marketing.
Leihua Ye is a Ph.D. Researcher at the UC, Santa Barbara. He has received extensive training in Causal Inference, Research Design, Machine Learning, Big Data, and Machine Learning.
He receives his B.A. and M.A. from the Uni. of Nottingham.
Email: [email protected]
LinkedIn: www.linkedin.com/in/leihuaye
Tech Blog: https://leihua-ye.medium.com