Skip to content

Commit

Permalink
quick change
Browse files Browse the repository at this point in the history
  • Loading branch information
vwxyzjn committed Sep 13, 2024
1 parent 368e402 commit 732e3e4
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 5 deletions.
57 changes: 53 additions & 4 deletions docs/algorithms/online_dpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ python mason.py \
--preemptible \
--budget ai2/allennlp \
--gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
open_instruct/ppo_vllm_thread.py \
open_instruct/online_dpo_vllm_thread.py \
--dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
--dataset_train_splits train \
--dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
Expand Down Expand Up @@ -139,8 +139,8 @@ python mason.py \
--preemptible \
--budget ai2/allennlp \
--gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
open_instruct/ppo_vllm_thread.py \
--exp_name "ppo_vllm_thread_beta_0.03" \
open_instruct/online_dpo_vllm_thread.py \
--exp_name "online_dpo_vllm_thread_beta_0.03" \
--dataset_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
--dataset_train_splits train \
--dataset_eval_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
Expand All @@ -152,7 +152,6 @@ python mason.py \
--chat_template tulu \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--no_async_mode \
--gradient_accumulation_steps 32 \
--local_rollout_forward_batch_size 1 \
--vllm_device cuda:7 \
Expand All @@ -179,6 +178,56 @@ python mason.py \
* Trained model: https://huggingface.co/vwxyzjn/online_dpo_vllm_thread__cleanrl_EleutherAI_pythia-1b-deduped__sft__tldr/tree/online_dpo_vllm_thread__1__1726080959


### LEVEL 3: 8 GPU; Training on ultrafeedback RM

```bash
# for running chat based models you should use an 8xH100 node.
# use ai2/jupiter-cirrascale-2 or ai2/pluto-cirrascale
python mason.py \
--cluster ai2/jupiter-cirrascale-2 \
--image costah/open_instruct_onlinedpo2 --pure_docker_mode \
--workspace ai2/tulu-3-dev \
--priority high \
--preemptible \
--budget ai2/allennlp \
--gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
open_instruct/online_dpo_vllm_thread.py \
--exp_name "online_dpo_vllm_thread_beta_0.03" \
--dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
--sft_messages_key chosen \
--dataset_train_splits train_prefs \
--dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
--dataset_eval_splits test_prefs \
--max_token_length 1024 \
--max_prompt_token_lenth 512 \
--learning_rate 8e-7 \
--output_dir /output/ \
--chat_template tulu \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--local_rollout_forward_batch_size 1 \
--vllm_device cuda:7 \
--num_epochs 1 \
--num_mini_batches 1 \
--total_episodes 300000 \
--model_name_or_path allenai/open_instruct_dev \
--model_revision costa_finetune_tulu3_8b_norobot__meta-llama_Meta-Llama-3.1-8B__42__1725559869 \
--reward_model_path vwxyzjn/reward_modeling__allenai_open_instruct_dev \
--reward_model_revision reward_modeling__1__1725760619 \
--non_stop_penalty \
--stop_token eos \
--penalty_reward_value -10.0 \
--beta 0.03 \
--num_evals 1 \
--seed 3 \
--response_length 1024 \
--gradient_checkpointing \
--with_tracking \
--push_to_hub
```




### Quality of life tools
Expand Down
54 changes: 53 additions & 1 deletion docs/algorithms/ppo.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,6 @@ python mason.py \
--chat_template tulu \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--no_async_mode \
--gradient_accumulation_steps 32 \
--local_rollout_forward_batch_size 1 \
--vllm_device cuda:7 \
Expand All @@ -211,6 +210,59 @@ python mason.py \
* Trained model: https://huggingface.co/vwxyzjn/online_dpo_vllm_thread_beta_0.03__allenai_open_instruct_dev/tree/online_dpo_vllm_thread_beta_0.03__3__1726101734


### LEVEL 3: 8 GPU; Training on ultrafeedback RM

```bash
# for running chat based models you should use an 8xH100 node.
# use ai2/jupiter-cirrascale-2 or ai2/pluto-cirrascale
python mason.py \
--cluster ai2/pluto-cirrascale \
--image costah/open_instruct_onlinedpo2 --pure_docker_mode \
--workspace ai2/tulu-3-dev \
--priority high \
--preemptible \
--budget ai2/allennlp \
--gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
open_instruct/ppo_vllm_thread.py \
--exp_name "ppo_vllm_thread_beta_0.03" \
--dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
--sft_messages_key chosen \
--dataset_train_splits train_prefs \
--dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
--dataset_eval_splits test_prefs \
--max_token_length 1024 \
--max_prompt_token_lenth 512 \
--learning_rate 8e-7 \
--output_dir /output/ \
--chat_template tulu \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--local_rollout_forward_batch_size 1 \
--vllm_device cuda:7 \
--num_epochs 1 \
--num_mini_batches 1 \
--total_episodes 300000 \
--model_name_or_path allenai/open_instruct_dev \
--model_revision finetune__meta-llama_Meta-Llama-3.1-8B__42__1725751338 \
--reward_model_path vwxyzjn/reward_modeling__allenai_llama-3-tulu-2-8b \
--reward_model_revision reward_modeling__1__1726175049 \
--non_stop_penalty \
--stop_token eos \
--penalty_reward_value -10.0 \
--beta 0.03 \
--num_evals 3 \
--seed 3 \
--response_length 1024 \
--gradient_checkpointing \
--with_tracking \
--push_to_hub
```

* Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/nmdf1z7z
* Trained model: https://huggingface.co/vwxyzjn/online_dpo_vllm_thread_beta_0.03__allenai_open_instruct_dev/tree/online_dpo_vllm_thread_beta_0.03__3__1726101734




### Quality of life tools
Expand Down

0 comments on commit 732e3e4

Please sign in to comment.