Add online trainers #204

vwxyzjn · 2024-07-24T22:16:16Z

The online trainers are ready for review! The docs are available here if you want to try: https://github.com/allenai/open-instruct/blob/online-trainers/docs/algorithms/online_dpo.md

Check out this wandb report: https://wandb.ai/ai2-llm/open_instruct_internal/reports/PPO-vs-online-DPO--Vmlldzo5MzM3NDU0

Screen.Recording.2024-09-11.at.4.40.59.PM.mov

Implemented auto resume as well, but only tested with small models. Larger models may take more time to save and is blocked by https://github.com/allenai/beaker/issues/5420

vwxyzjn · 2024-07-25T15:04:23Z

open_instruct/online_sft_trainer.py

+            with torch.no_grad():
+                queries = data["input_ids"].to(device)
+                # repeat interleave [q1, q2, q3] -> [q1, q1, q1, q2, q2, q2, q3, q3, q3]
+                queries = queries.repeat_interleave(args.num_generation_per_prompt, 0)


Here is how the online SFT logic works:

we first repeat interleave the queries like [q1, q2, q3] -> [q1, q1, q1, q2, q2, q2, q3, q3, q3]. This means queries[0] == queries[1].

vwxyzjn · 2024-07-25T15:05:19Z

open_instruct/online_sft_trainer.py

+                # online SFT logic:
+                # at this point, we have interleave repeated the queries for `args.num_generation_per_prompt` times
+                # say, we have 64 queries repeated 10 times, we have 640 responses
+                # we reshape the scores to (10, 64), so each query has 10 scores.
+                # we then find the index of each query's best response
+                scores_per_query = scores.reshape(args.num_generation_per_prompt, args.local_batch_size)
+                best_idxes = scores_per_query.argmax(0)
+                worst_idxes = scores_per_query.argmin(0)
+                best_idxes_offset = (
+                    best_idxes + torch.arange(args.local_batch_size, device=device) * args.num_generation_per_prompt
+                )
+                worst_idxes_offset = (
+                    worst_idxes + torch.arange(args.local_batch_size, device=device) * args.num_generation_per_prompt
+                )
+                best_query_responses = query_responses[best_idxes_offset]
+                # worst_query_responses = query_responses[worst_idxes_offset] TODO: maybe interesting to see the worse responses.
+                best_scores = scores[best_idxes_offset]
+                worst_scores = scores[worst_idxes_offset]
+                scores_margin = best_scores - worst_scores


Then we basically find out the indices corresponding to the best scores and get the best query responses for online SFT.

ValentinaPy

Code looks good to me! I'll test it out soon

nouhadziri

Looks great Costa, I have some nitpick about dop_utils.py for DPO data processing in L173 which was contributed by Nathan and Jacob I guess. I usually prefer processing the data (in this case concatenating chosen and rejected ids) before passing the data to DataLoader to avoid looping over the batch. But all good since the tensors are not placed in the GPUs yet

vwxyzjn · 2024-09-13T15:43:02Z

@nouhadziri, that's a good point. We can look into refactoring dpo_utils.py after the current projects :)

Merging as is now. Thanks @nouhadziri and @ValentinaPy for review.

vwxyzjn added 4 commits July 24, 2024 22:14

Add online trainers

e352062

black / isort

84645bb

push online trainers

3e492ff

black

8e49eb6

vwxyzjn commented Jul 25, 2024

View reviewed changes

vwxyzjn added 25 commits August 5, 2024 19:49

add reward trainer

74ae736

push changes

5e32e2f

format

7a5b5d8

bug fix

07192cc

only do visualize in main process

51f72b0

quick fix

3d1e6b9

black

84cdd98

update docs even more

00af870

formatting

8e2228d

fix eval logging

72c5bf0

add ds config

4175535

Merge branch 'main' into reward-trainer

eb999ea

Merge branch 'main' into reward-trainer

3d2d0ae

maybe use ai2 wandb entity

8c624b8

quick change

ca9cbc0

remove unused print

3f14f24

Merge branch 'main' into online-trainers

9f0c8c2

Merge branch 'reward-trainer' into online-trainers

e69d5ec

remove online sft trainer; doesn't work

ed27422

make the dataset column configurable; remove eos from prompt token;

10a617b

Merge branch 'reward-trainer' into online-trainers

fa8b63c

more self-contained online DPO

df2d3f9

Merge branch 'main' into reward-trainer

d479c59

add a generate collator

f58c4d9

Merge branch 'reward-trainer' into online-trainers

9e81514

style and formt

1b4c2bd

vwxyzjn marked this pull request as ready for review September 10, 2024 21:50

vwxyzjn added 12 commits September 10, 2024 22:08

update docs

e36535a

handle preemption

6ae5642

quick change

f09b85f

push

e459a3d

Merge branch 'main' into online-trainers

1a70317

Merge branch 'main' into online-trainers

eb07d34

remove changes

070883d

quick change

7e77b30

quick change

423877d

update mason

a03d565

quick change

db171f0

push changes

9c3e38e

vwxyzjn requested review from natolambert, nouhadziri, ValentinaPy and jacob-morrison September 11, 2024 20:49

quick change

f8b2898

ValentinaPy approved these changes Sep 11, 2024

View reviewed changes

vwxyzjn added 3 commits September 12, 2024 20:45

push changes

d104cda

quick change

368e402

quick change

34d5bbc

vwxyzjn force-pushed the online-trainers branch from 732e3e4 to 34d5bbc Compare September 13, 2024 14:52

vwxyzjn added 2 commits September 13, 2024 14:55

style quality

6f3e02c

add comment

7e448c0

nouhadziri reviewed Sep 13, 2024

View reviewed changes

quick change

c42a2b1

vwxyzjn merged commit 649c9e3 into main Sep 13, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add online trainers #204

Add online trainers #204

vwxyzjn commented Jul 24, 2024 •

edited

Loading

vwxyzjn Jul 25, 2024

vwxyzjn Jul 25, 2024

ValentinaPy left a comment

nouhadziri left a comment

vwxyzjn commented Sep 13, 2024

Add online trainers #204

Add online trainers #204

Conversation

vwxyzjn commented Jul 24, 2024 • edited Loading

vwxyzjn Jul 25, 2024

Choose a reason for hiding this comment

vwxyzjn Jul 25, 2024

Choose a reason for hiding this comment

ValentinaPy left a comment

Choose a reason for hiding this comment

nouhadziri left a comment

Choose a reason for hiding this comment

vwxyzjn commented Sep 13, 2024

vwxyzjn commented Jul 24, 2024 •

edited

Loading