Machine Learning (ML) model for price prediction using Linear Regression.
This code was written in MATLAB for the competition presented by Kaggle. The proposed ML model was developed in order to represent one of the possible solutions for the housing price prediction problem.
The dataset provided by Kaggle consists of 2919 samples with 79 features each. This dataset originally split into training and testing datasets with 1460 and 1459 samples, respectively. In order to justify our models performance, the training dataset is split into two subsets of data. One subset contains 86% of the original training data and is used to train our model, second subset that is called validation subset contains the remaining 14% and is used to validate our model. The accuracy of validation with the 14% of the training data will provide us with an understanding of the efficiency of our design.
Data preprocessing consists of the following steps:
- The data is cleaned from features that contained more than 50% of missing data;
- All categorical features are transformed into numerical features;
- The features are sorted so it would be possible to describe them linearly;
- Some features with very low variance are deleted;
- Outliers are deleted;
- All missing values are found and changed to either 0 or most frequent values of the features that contain these missing values, wherever it makes sence;
- The data is normalized, where necessary.
The training dataset is used to calculate "w" and "b". This is done by solving the equation ŵ = (X̂T X̂ - εI68)-1X̂Ty, where X̂ is a modified version of the training dataset, y is a vector that contains the labels (prices), and &epsilon is a small value, in our case 0.01. It is important to notice that it is necessary to include a term εI68 in the equation to ensure that the inverse (X̂T X̂ - εI68)-1 does exist. Otherwise, the matrix can be badly scaled and results may be inaccurate. After solving the equation we can receive optimal parameters "w*" and "b*". They are used to create a linear model that is able to predict the prices of the houses based on their features. The prices can be found by solving Y = wTX + b.
As a prediction method it was decided to use a linear regression method since the given data can be described linearly. This method turned out to be fairly accurate as it showed a high percentage of accuracy. As one of the evaluation methods, RMSLE is used to calculate error. As another form of evaluation, relative prediction error percentage is used. As the final result RMSLE showed an error of 14%. Relative prediction error percentage was 12%. As it can be seen, the final accuracy of the algorithm is approximately 87%.