This notebook shows how you can:
- run a processing job to run a Scikit-Learn script to clean, pre-process, perform feature engineering, and split the input data into train and test sets.
- run a training job on the pre-processed training data to train a model model
- run a processing job on the pre-processed test data to evaluate the trained model's performance
- use your own custom container with to run processing jobs with your own Python libraries and dependencies.
The dataset used is the Census-Income KDD Dataset. We will select features from this dataset, clean the data, and turn the data into features that our training algorithm can use to train a binary classification model, and split the data into train and test sets.
The task is to predict whether rows representing census responders have an income greater than $50K
, or less than 50K
. The dataset is heavily class imbalanced, with most records being labeled as earning less than $50K
. After training a logistic regression model, we will evaluate the model against a hold-out test dataset, and save the classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.