In this repository you can find a set of tools to automatically fine-tune a RoBertA base model (Hugging Face) with Twitter sentiment data (more info on Kaggle.
The overarching goal is to make the LLM to be able to correctly classify good or bad Twitter sentiment. In order to achieve that, the training is comprised of two steps:
- Fine-tuning Masked Language Modelling task (more info see Hugging Face)
- Fine-tuning on classification task
Please note: both tasks are trained on the same dataset
The code is comprised of three main sections:
- the main one: which contains all the necessary scripts to make it run on Ubuntu 22.04 EC2 machine (with GPU)
- notebooks: contains some exploration made with Jupyter notebook (TO BE DELETED)
- training_language_model: contains all the relevant code used to create the Docker images of the two fine-tuning steps
Considering using this on EC2 machine (GPU enabled) with Ubuntu 22.04. First we need to bootstrap our environment:
- run prepare_docker.sh
- run prepare_docker_compose.sh
- run install_NVIDIA_docker_toolkit.sh
- run install_ubuntu_NVIDIA_drivers.sh
Then we can call Docker compose:
- generate a .env* file
- run 'docker compose up service_name
The two services are generated by two docker images:
- trainining_mlm: fine tune on masked language modelling
- training_clf: fine tune on classification
Please note: you can build new images by using the code inside training_language_model folder.
Generate an .env file with the below information: INPUT NECESSARY
- HF_USER= your Hugging Face username, to push the model to Hugging Face Hub
- HF_TOKEN= your Hugging Face token, to push the model to Hugging Face Hub NEED TO CHANGE ONLY IF YOU WANT TO CHANGE THE CODE BASE (new model, different mount point, new dataset)
- DIRPATH=data
- MODEL_VERSION_MLM=roberta-fine-tuned-twitter
- MODEL_VERSION_CLF=roberta-fine-tuned-twitter-sentiment
- DATASET_VERSION=TwitterSentiment140