This repo is my attempt to experiment with a few aspects of GPT and getting hands on experience of all my theoretical learnings. I have used Karpthy's version of a nano GPT for this experimentation. You can find more info about the same here : https://github.com/karpathy/nanoGPT
I focused on the following tasks :
Setting up the basic flow for our experiment and generating some initial results out of it
Prepare
python data/shakespeare_char/prepare.py
Train
python train.py config/train_shakespeare_char.py --device=mps --compile=False
Sample
python sample.py --out_dir=out-shakespeare-char --device=cpu
Example generated by the model
BRUTUS:
For the devil'd the beast torm:
When could I should be saw the pride
That should be thou not be subject.'
MERCUTIO:
For what, comes the way
Methink of my company?
MERCUTIO:
Tranio, Romeo, go, tyrant, and since to speak.
SIRREY:
Then did your hearts first,
For I make more call them again.
BRUTUS:
Come, sir, my lord.
SICINIUS:
Sir, sir.
CORIOLANUS:
Well, let us murderer?
First Servingman:
Take me to have better.
First Citizen:
I can perfort you are thou wert not the good?
CORIOLAN
Modifying hyperparameters such as number of heads, layers in order to achieve a setting which produces the lowest validation loss.
Use the following command to train nano GPT on Shakespeare data using your Mac's on-chip GPU. Using lower settings for hyperparameter so that it doesn't take more than 10 mins to run. Feel free to play around with these hyperparameters.
python train.py config/train_shakespeare_char.py --device=mps --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=3000 --lr_decay_iters=3000 --dropout=0.0
We get the following losses with different number of heads and layers :
-
Layers = 4, Heads = 4:
- Train Loss: 1.6169
- Validation Loss: 1.7946
-
Layers = 8, Heads = 8:
- Train Loss: 1.6393
- Validation Loss: 1.7759
-
Layers = 16, Heads = 16:
- Train Loss: 1.5978
- Validation Loss: 1.7658
-
Layers = 32, Heads = 32:
- Train Loss: 1.5765
- Validation Loss: 1.7662
Meant to capture how close the generated data distribution is to the training distribution.
I have used Bleu score for my specific metric. The BLEU (Bilingual Evaluation Understudy) score is computed by comparing n-grams (contiguous sequences of n words or characters) between the generated text and reference texts. BLEU typically considers n-grams of different lengths, from unigrams (single words) to higher-order n-grams like bigrams, trigrams, and so on. Using this kind of evaluation BLEU measures the similarity between the generated text and the reference text (training text). This seems ideal for our current use case of comparing how close the generated data distribution is to the training distribution.
I wrote a script evaluation_bleu.py which uses the nltk library in order to compute the bleu score.
python evaluation_bleu.py --data_dir=shakespeare_char
Result : BLEU Score: 0.5455284552845528
Meant to capture how our model performs in general for text generation regardless of data it has been trained on.
I have used a simple spell check function to test if our model produces words which actually exist in the English language. This makes sense as we are training a character level GPT and if our model is able to associate characters into meaningful words we are in a win-win situation.
I wrote a script evaluation_spell_check.py which uses the nltk words corpus for spell checking the generated words by our model.
python evaluation_spell_check.py --data_dir=shakespeare_char
Result : Percentage of words not correctly spelled: 8.292682926829269 %
Here I experiment training nano GPT with my favourite dataset which is the screenplay scripts from the popular TV series Breaking Bad
I have downloaded this dataset from : https://bulletproofscreenwriting.tv/breaking-bad-tv-script-download/
In particular, I use scripts from the following episodes:
Follow these steps to experiment on this dataset:
Prepare
python data/breaking_bad_char/prepare.py --num_scripts=1
Train
python train.py config/train_breaking_bad_char.py --device=mps --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=16 --n_head=16 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0 --ckpt_file=breaking_bad_ckpt.pt
Sample
python sample.py --out_dir=out-shakespeare-char --ckpt_file=breaking_bad_ckpt.pt --device=cpu
An interesting experiment I perform on this is to vary the number of characters in the input and obsereve the variation of the above evaluation metrics. You can very the input character size using the num_scripts flag in the prepare command above.
To produce the evalutation metrics result use the following commands :
python evaluation_bleu.py --data_dir=breaking_bad_char
Length of Dataset in Characters (1K scale) | Bleu Score |
---|---|
74 | 0.4557877814 |
149 | 0.4930498774 |
221 | 0.4955752212 |
281 | 0.5 |
353 | 0.5193415638 |
415 | 0.5375203915 |
478 | 0.5366795367 |
553 | 0.4962593516 |
python evaluation_spell_check.py --data_dir=breaking_bad_char
Length of Dataset in Characters (1K scale) | Spell Check Score |
---|---|
74 | 0.813 |
149 | 0.827 |
221 | 0.832 |
281 | 0.843 |
353 | 0.859 |
415 | 0.868 |
478 | 0.864 |
553 | 0.832 |
Now we fine tune the model trained on Shakespeare dataset on our Breaking Bad dataset. We will also carry out some evaluation to see how much data is required to go from Shakesperean output to something that resembles our dataset.
Fine tune training
python train.py config/finetune_breaking_bad.py --device=mps --compile=False
Comparing the pre-train and Fine tune data distributions
python compare_similarity.py
Results :
Data in Characters (1K scale) | Training Fine-Tune (max_iters) | Bleu of output compared to Shakespeare | Bleu of output compared to Breaking Bad (Our Dataset) |
---|---|---|---|
74 | 20 | 0.45340501 | 0.3324372 |
74 | 50 | 0.42239858 | 0.3959435 |
74 | 100 | 0.3683274 | 0.4092526 |
149 | 20 | 0.4535809 | 0.3545534 |
149 | 50 | 0.3669391 | 0.3687556 |
149 | 100 | 0.38263950 | 0.402125 |
281 | 20 | 0.4276672 | 0.342676 |
281 | 50 | 0.39730941 | 0.3757847 |
281 | 100 | 0.3564266 | 0.4020054 |
553 | 20 | 0.45182 | 0.3509 |
553 | 50 | 0.412078 | 0.3925399 |
553 | 100 | 0.38356 | 0.392694 |