From 73688419394c7afe66f6cc1d9b0f28227bc211a5 Mon Sep 17 00:00:00 2001 From: Sebastian Raschka Date: Mon, 23 Sep 2024 12:41:38 -0700 Subject: [PATCH] Add pretrain conversion (#1735) --- tutorials/pretrain.md | 56 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 50 insertions(+), 6 deletions(-) diff --git a/tutorials/pretrain.md b/tutorials/pretrain.md index f2901c25e3..6e37f2c8ea 100644 --- a/tutorials/pretrain.md +++ b/tutorials/pretrain.md @@ -96,6 +96,12 @@ litgpt pretrain pythia-14m \ Often, it makes sense to adopt an existing pretrained model and further pretrain it on our own custom data. The existing pretrained model can be either our own pretrained model or a model downloaded from a model hub. +The following subsections illustrate three typical scenarioes: + +1. Starting from a downloaded base model +2. Continuing the pretraining after interruption +3. Further pretraining on a different dataset +   > [!NOTE] @@ -107,7 +113,10 @@ Often, it makes sense to adopt an existing pretrained model and further pretrain > Using this approach is only recommended for small datasets. Since text data is highly compressible, it is often stored in compressed format, and often in file formats where documents can be loaded row by row without having to load entire files at once. In other words, this `TextFiles` approach is only feasible to store the data in plain text files due to the limited size. > For datasets that take up multiple gigabytes, we recommend preprocessing it with [LitData](https://github.com/Lightning-AI/litdata) and then reading it from a local directory or S3 connection using `--data LitData --data.path path/to/your/data`. +   +### 1) Continued pretraining when starting from a downloaded base model + For instance, let's assume we download a Pythia model: @@ -118,20 +127,55 @@ litgpt download EleutherAI/pythia-14m Next, assume we have a custom dataset stored in text files similar to the *Pretrain on custom data* above. We can further pretrain the Pythia model via the `--initial_checkpoint_dir` setting as follows: ```bash -litgpt pretrain pythia-14m \ - --initial_checkpoint_dir EleutherAI/pythia-14m \ - --tokenizer_dir EleutherAI/pythia-14m \ - --out_dir new_phi-2_checkpoint \ +litgpt pretrain pythia-160m \ + --initial_checkpoint_dir EleutherAI/pythia-160m \ + --tokenizer_dir EleutherAI/pythia-160m \ + --out_dir ./new_pretrained_checkpoint \ --data TextFiles \ --data.train_data_path custom_pretraining_data \ - --train.lr_warmup_steps=200 - --optimizer.lr 0.005 + --train.max_tokens 1_000_000 ```   > [!TIP] > Use the `litgpt pretrain --data.help TextFiles` command to list additional dataset options. + +   +### 2) Continued pretraining after interruption + +In case a you interrupted a training run, you can continue it with the `--resume` option, for example: + +```bash +litgpt pretrain pythia-160m \ + --resume "auto" \ + --tokenizer_dir EleutherAI/pythia-160m \ + --out_dir ./new_pretrained_checkpoint \ + --data TextFiles \ + --data.train_data_path custom_pretraining_data \ + --train.max_tokens 1_000_000 +``` + +  +### 3) Continued pretraining on a new dataset + +Suppose you pretrained a model using the examples above. To further pretrain the model on a new dataset, you first need to convert the pretrained checkpoint via the following command: + +```bash +litgpt convert_pretrained_checkpoint ./new_pretrained_checkpoint/final ./new_pretrained_checkpoint_converted +``` + +Then, you can pretrain the converted model on the new dataset as follows: + +```bash +litgpt pretrain pythia-160m \ + --initial_checkpoint_dir ./new_pretrained_checkpoint_converted \ + --tokenizer_dir EleutherAI/pythia-160m \ + --out_dir ./new_pretrained_checkpoint_2 \ + --data TextFiles \ + --data.train_data_path custom_pretraining_data_2 \ + --train.max_tokens 1_000_000 +```