diff --git a/docs/algorithms/online_dpo.md b/docs/algorithms/online_dpo.md
new file mode 100644
index 000000000..76a227049
--- /dev/null
+++ b/docs/algorithms/online_dpo.md
@@ -0,0 +1,366 @@
+# Reward model training
+
+`open_instruct/online_dpo.py` contains the script for training online DPO models.
+
+
+## Get started
+
+In the sections below, we will include some examples on how to train models and demonstrating different features. A couple of notes:
+
+* You should adjust your `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly to maximize throughput on a particular GPU type.
+* For the examples below, we use `mason.py` to invoke experiment orchastration on Ai2's cluster. For external users, you can copy the command after the `--` and run it on your system or debug locally. For example: the documentation will have commands like the following, but you can just run `$YOUR_COMMAND` on your system and make sure it matches `$NUM_GPUS`.
+    * You can you `--image costah/open_instruct_onlinedpo2` to specify a custom image or if you don't specify any it's going to use the default image.
+    * If you installed your python on NFS you can run a debug mode by **not toggling** `--pure_docker_mode` and it will mount your python environment on the docker container.
+
+```bash
+python mason.py \
+    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --priority preemptible \
+    --budget ai2/allennlp \
+    --gpus $NUM_GPUS -- $YOUR_COMMAND
+```
+
+
+### Level 0: single GPU; quick debug. Should take less than 10 minutes to finish
+
+```bash
+python open_instruct/online_dpo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
+    --reward_model_path cleanrl/reward_modeling__EleutherAI_pythia-1b-deduped_sentiment \
+    --chat_template simple_concat_with_space \
+    --learning_rate 3e-6 \
+    --total_episodes 4000 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 64 \
+    --max_token_length 2048 \
+    --max_prompt_token_lenth 512 \
+    --num_train_epochs 1 \
+    --stop_token period \
+    --beta 0.1 \
+    --output_dir models/rm/rm_sentiment_1b \
+    --vllm_device cuda:0 \
+    --vllm_gpu_memory_utilization 0.1 \
+    --with_tracking \
+    --push_to_hub \
+
+# LEVEL 0.1: two GPU; quick debug; using 1 GPU for training and 1 GPU for vllm generation via --vllm_device cuda:1
+python open_instruct/online_dpo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
+    --reward_model_path cleanrl/reward_modeling__EleutherAI_pythia-1b-deduped_sentiment \
+    --chat_template simple_concat_with_space \
+    --learning_rate 3e-6 \
+    --total_episodes 3000 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 64 \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --num_train_epochs 1 \
+    --stop_token period \
+    --beta 0.1 \
+    --output_dir models/rm/rm_sentiment_1b \
+    --vllm_device cuda:1 \
+    --with_tracking \
+    --push_to_hub \
+```
+
+
+
+
+### LEVEL 1: 8 GPU; TL;DR summarization
+
+Here we are using --vllm_device cuda:7 to say we want to launch the vllm generation engine on the 8th GPU (or GPU_7 using 0 index)
+```bash
+# for running TL;DR you can likely use GPUs with less memory
+python mason.py \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale ai2/general-cirrascale \
+    --priority normal \
+    --resumable \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_vllm_thread_tldr \
+    --per_device_train_batch_size 16 \
+    --local_rollout_forward_batch_size 32 \
+    --gradient_accumulation_steps 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --response_length 53 \
+    --with_tracking \
+    --push_to_hub \
+    --hf_metadata_dataset '""' \
+    --no_try_launch_beaker_eval_jobs \
+    --vllm_device cuda:7
+```
+
+* Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/fub45jhm
+* Trained model: https://huggingface.co/vwxyzjn/online_dpo_vllm_thread__cleanrl_EleutherAI_pythia-1b-deduped__sft__tldr/tree/online_dpo_vllm_thread__1__1726080959
+
+
+### LEVEL 2: 8 GPU; Huggingface no robot
+
+```bash
+# for running chat based models you should use an 8xH100 node.
+# use ai2/jupiter-cirrascale-2 or ai2/pluto-cirrascale
+python mason.py \
+    --cluster ai2/jupiter-cirrascale-2 \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --workspace ai2/tulu-3-dev \
+    --priority high \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/online_dpo_vllm_thread.py \
+    --exp_name "online_dpo_vllm_thread_beta_0.03" \
+    --dataset_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
+    --dataset_eval_splits test \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 8e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 1 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 100000 \
+    --model_name_or_path allenai/open_instruct_dev  \
+    --model_revision costa_finetune_tulu3_8b_norobot__meta-llama_Meta-Llama-3.1-8B__42__1725559869 \
+    --reward_model_path vwxyzjn/reward_modeling__allenai_open_instruct_dev \
+    --reward_model_revision reward_modeling__1__1725760619 \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 3 \
+    --seed 3 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+```
+
+* Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/do4nuqhh
+* Trained model: https://huggingface.co/vwxyzjn/online_dpo_vllm_thread_beta_0.03__allenai_open_instruct_dev/tree/online_dpo_vllm_thread_beta_0.03__3__1726200312
+
+
+### LEVEL 3: 8 GPU; Training on ultrafeedback RM
+
+```bash
+# for running chat based models you should use an 8xH100 node.
+# use ai2/jupiter-cirrascale-2 or ai2/pluto-cirrascale
+python mason.py \
+    --cluster ai2/jupiter-cirrascale-2 \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --workspace ai2/tulu-3-dev \
+    --priority high \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/online_dpo_vllm_thread.py \
+    --exp_name "online_dpo_vllm_thread_beta_0.03" \
+    --dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
+    --sft_messages_key chosen \
+    --dataset_train_splits train_prefs \
+    --dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
+    --dataset_eval_splits test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 8e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 1 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 300000 \
+    --model_name_or_path allenai/open_instruct_dev  \
+    --model_revision costa_finetune_tulu3_8b_norobot__meta-llama_Meta-Llama-3.1-8B__42__1725559869 \
+    --reward_model_path vwxyzjn/reward_modeling__allenai_open_instruct_dev \
+    --reward_model_revision reward_modeling__1__1725760619 \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 1 \
+    --seed 3 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+```
+
+TBD.
+
+
+### Quality of life tools
+
+
+Note that when running with `--push_to_hub` and `--with_tracking`, the HF repo is automatically tracked to wandb, so we link the tracked run and the trained model.
+
+![reward modeling tracked hf repo](reward_modeling_hf_repo.png)
+
+
+Furthermore, we also track the dataset length visualization in wandb (see detail in [here](#dataset-processing))
+
+
+![token length visualization in wandb](reward_modeling_token_wandb.png)
+
+
+Finally, we also include samples 
+
+![reward modeling preference sample texts](reward_modeling_preference_sample_texts.png)
+
+
+## Explanation of the logged metrics
+
+
+* `episode`: the global episode number training has gone through (e.g., `3000` means we have trained on 3000 data points already)
+* `lr`: the current learning rate
+* `epoch`: the fraction or multiple of the epoch (e.g., `2.7` means we have trained on the dataset for 2 epochs and 70% of the third epoch)
+* `objective/kl`: the KL divergence between the current policy and the reference policy (sum of the KL divergence of each response token)
+* `objective/scores`: the scores of the current response, rated by a reward model
+* `objective/rlhf_reward`: the RLHF reward, which is `objective/scores` - `beta` * `objective/kl`
+* `objective/non_score_reward`: `beta` * `objective/kl`
+* `objective/entropy`: the entropy of the current policy
+* `objective/scores_margin`: the difference between the chosen response scores and the rejected response scores. We pick the chosen response to be the response with higher scores, and the rejected response to be the response with lower scores
+* `objective/loss`: the DPO loss
+* `logps/chosen`: the log probability of the chosen response
+* `logps/rejected`: the log probability of the rejected response
+* `reward/chosen`: the implicit DPO reward of the chosen response
+* `reward/rejected`: the implicit DPO reward of the rejected response
+* `reward_margin`: the difference between the implicit PDO chosen reward and the implicit rejected reward
+* `time/from_scratch`: the time taken to train the model from scratch
+* `time/training`: the time taken to do one training step
+* `val/sequence_lengths`: the length of the sequences in the generated responses
+* `val/num_stop_token_ids`: the number of stop tokens in the generated responses
+
+
+
+
+## Implementation details
+
+These are relevant implementation details on reward modeling:
+
+1. The tokenizer pads from the left, so it's straightforward to do generations.
+1. Disable dropout in the model: this is an implementation detail in PPO training (see p.3. in https://arxiv.org/pdf/1909.08593).
+1. Layer initialization: we initialize the score's weight according to `std=1 / np.sqrt(model.config.hidden_size + 1)` (see p. 11 in https://arxiv.org/abs/2009.01325)
+1. Vocab size for RM and Policy: we use the same vocab size for the reward model and the policy model. This is to ensure that the reward model can score all the tokens in the policy model. We added a `ValueError` for situations when `policy.config.vocab_size != reward_model.config.vocab_size`.
+1. Retrain on the same prompts: say we only have 10k prompts but we specified `--episodes 100k`, we will shuffle the prompts at every 10k episodes and retrain on them.
+1. Truncate responses at the stop token: we truncate the responses at the `--stop_token eos` to ensure the generation is stopped at the stop token.
+1. Non-stop penalty: we use a non-stop penalty to the reward model to penalize the model for not stopping at the stop token. For example, if the model does not end at the stop token, we penalize the model by `-10.0` (see `--penalty_reward_value -10.0`).
+1. Async training and generation: we follow the architecture in https://arxiv.org/abs/2310.00036 to do rollout and training asynchronously. This is to ensure that the training is not bottlenecked by the generation.
+
+```python
+import queue
+import threading
+import time
+
+class Agent():
+    def __init__(self):
+        self.param = 1
+
+    def learn(self, data):
+        self.param += 1
+
+def query_generator_fn():
+    for i in range(1, 100):
+        yield i
+
+
+ITER = 7
+batch_size = 32
+agent = Agent()
+data_Q = queue.Queue(maxsize=1)
+param_and_query_Q = queue.Queue(maxsize=1)
+def actor():
+    for i in range(1, ITER + 1):
+        params, query = param_and_query_Q.get()
+        data = params
+        print(f"[actor] generating data π_{params} -> p_{query} D_π_{data}")
+        time.sleep(1) # simulate data generation
+        data_Q.put((query, data))
+
+actor_thread = threading.Thread(target=actor)
+actor_thread.start()
+
+# initial param put
+generator = query_generator_fn()
+next_queries = next(generator)
+param_and_query_Q.put((agent.param, next_queries))
+
+# cleanba style stuff
+async_mode = True
+start_time = time.time()
+for g in range(1, ITER + 1):
+    queries = next_queries
+    if async_mode:
+        if g != 1:
+            next_queries = next(generator)
+        param_and_query_Q.put((agent.param, queries))
+    else:
+        if g != 1:
+            next_queries = next(generator)
+            param_and_query_Q.put((agent.param, next_queries)) # note the indent here is different
+    _, data = data_Q.get()
+    old_param = agent.param
+    agent.learn(data)
+    time.sleep(1) # simulate training
+    print(f"--[leaner] get π_{old_param} ->  p_{queries} D_π_{data} -> π_{agent.param}, time: {time.time() - start_time}")
+actor_thread.join()
+```
+```
+[actor] generating data π_1 -> p_1 D_π_1
+[actor] generating data π_1 -> p_1 D_π_1
+--[leaner] get π_1 ->  p_1 D_π_1 -> π_2, time: 2.0022709369659424
+[actor] generating data π_2 -> p_1 D_π_2
+--[leaner] get π_2 ->  p_1 D_π_1 -> π_3, time: 3.003502607345581
+[actor] generating data π_3 -> p_2 D_π_3
+--[leaner] get π_3 ->  p_2 D_π_2 -> π_4, time: 4.004725933074951
+[actor] generating data π_4 -> p_3 D_π_4
+--[leaner] get π_4 ->  p_3 D_π_3 -> π_5, time: 5.005916118621826
+[actor] generating data π_5 -> p_4 D_π_5
+--[leaner] get π_5 ->  p_4 D_π_4 -> π_6, time: 6.007085800170898
+[actor] generating data π_6 -> p_5 D_π_6
+--[leaner] get π_6 ->  p_5 D_π_5 -> π_7, time: 7.007669448852539
+--[leaner] get π_7 ->  p_6 D_π_6 -> π_8, time: 8.009439706802368
+```
+
+
diff --git a/docs/algorithms/ppo.md b/docs/algorithms/ppo.md
new file mode 100644
index 000000000..6217c54b1
--- /dev/null
+++ b/docs/algorithms/ppo.md
@@ -0,0 +1,397 @@
+# Reward model training
+
+`open_instruct/ppo_vllm_thread.py` contains the script for training PPO models.
+
+
+## Get started
+
+In the sections below, we will include some examples on how to train models and demonstrating different features. A couple of notes:
+
+* You should adjust your `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly to maximize throughput on a particular GPU type.
+* For the examples below, we use `mason.py` to invoke experiment orchastration on Ai2's cluster. For external users, you can copy the command after the `--` and run it on your system or debug locally. For example: the documentation will have commands like the following, but you can just run `$YOUR_COMMAND` on your system and make sure it matches `$NUM_GPUS`.
+    * You can you `--image costah/open_instruct_onlinedpo2` to specify a custom image or if you don't specify any it's going to use the default image.
+    * If you installed your python on NFS you can run a debug mode by **not toggling** `--pure_docker_mode` and it will mount your python environment on the docker container.
+
+```bash
+python mason.py \
+    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --priority preemptible \
+    --budget ai2/allennlp \
+    --gpus $NUM_GPUS -- $YOUR_COMMAND
+```
+
+
+### Level 0: single GPU; quick debug. Should take less than 10 minutes to finish
+
+```bash
+python open_instruct/ppo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
+    --reward_model_path cleanrl/reward_modeling__EleutherAI_pythia-1b-deduped_sentiment \
+    --chat_template simple_concat_with_space \
+    --learning_rate 3e-6 \
+    --total_episodes 4000 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 64 \
+    --max_token_length 2048 \
+    --max_prompt_token_lenth 512 \
+    --num_train_epochs 1 \
+    --stop_token period \
+    --beta 0.1 \
+    --output_dir models/rm/rm_sentiment_1b \
+    --vllm_device cuda:0 \
+    --vllm_gpu_memory_utilization 0.1 \
+    --with_tracking \
+    --push_to_hub \
+
+# LEVEL 0.1: two GPU; quick debug; using 1 GPU for training and 1 GPU for vllm generation via --vllm_device cuda:1
+python open_instruct/ppo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
+    --reward_model_path cleanrl/reward_modeling__EleutherAI_pythia-1b-deduped_sentiment \
+    --chat_template simple_concat_with_space \
+    --learning_rate 3e-6 \
+    --total_episodes 3000 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 64 \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --num_train_epochs 1 \
+    --stop_token period \
+    --beta 0.1 \
+    --output_dir models/rm/rm_sentiment_1b \
+    --vllm_device cuda:1 \
+    --with_tracking \
+    --push_to_hub \
+
+# LEVEL 0.2: three GPU; quick debug; using 2 GPU for training and 1 GPU for vllm generation via --vllm_device cuda:2
+accelerate launch --num_processes 2 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/ppo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/ppo_vllm_thread_tldr \
+    --per_device_train_batch_size 2 \
+    --local_rollout_forward_batch_size 2 \
+    --gradient_accumulation_steps 2 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 10 \
+    --sanity_check_max_samples 10 \
+    --sanity_check \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --response_length 53 \
+    --with_tracking \
+    --push_to_hub \
+    --hf_metadata_dataset "" \
+    --no_try_launch_beaker_eval_jobs \
+    --gradient_checkpointing \
+    --vllm_device cuda:2
+```
+
+
+
+
+### LEVEL 1: 8 GPU; TL;DR summarization
+
+Here we are using --vllm_device cuda:7 to say we want to launch the vllm generation engine on the 8th GPU (or GPU_7 using 0 index)
+```bash
+# for running TL;DR you can likely use GPUs with less memory
+python mason.py \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale ai2/general-cirrascale \
+    --priority normal \
+    --resumable \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/ppo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/ppo_vllm_thread_tldr \
+    --per_device_train_batch_size 16 \
+    --local_rollout_forward_batch_size 32 \
+    --gradient_accumulation_steps 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --response_length 53 \
+    --with_tracking \
+    --push_to_hub \
+    --hf_metadata_dataset '""' \
+    --no_try_launch_beaker_eval_jobs \
+    --vllm_device cuda:7
+```
+
+* Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/by8j2ejp
+* Trained model: https://huggingface.co/vwxyzjn/ppo_vllm_thread__cleanrl_EleutherAI_pythia-1b-deduped__sft__tldr/tree/ppo_vllm_thread__1__1726110645
+
+
+### LEVEL 2: 8 GPU; Huggingface no robot
+
+```bash
+# for running chat based models you should use an 8xH100 node.
+# use ai2/jupiter-cirrascale-2 or ai2/pluto-cirrascale
+python mason.py \
+    --cluster ai2/jupiter-cirrascale-2 \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --workspace ai2/tulu-3-dev \
+    --priority high \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/ppo_vllm_thread.py \
+    --exp_name "ppo_vllm_thread_beta_0.03" \
+    --dataset_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
+    --dataset_eval_splits test \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 8e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 1 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 100000 \
+    --model_name_or_path allenai/open_instruct_dev  \
+    --model_revision costa_finetune_tulu3_8b_norobot__meta-llama_Meta-Llama-3.1-8B__42__1725559869 \
+    --reward_model_path vwxyzjn/reward_modeling__allenai_open_instruct_dev \
+    --reward_model_revision reward_modeling__1__1725760619 \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 3 \
+    --seed 3 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+```
+
+TBD
+
+### LEVEL 3: 8 GPU; Training on ultrafeedback RM
+
+```bash
+# for running chat based models you should use an 8xH100 node.
+# use ai2/jupiter-cirrascale-2 or ai2/pluto-cirrascale
+python mason.py \
+    --cluster ai2/pluto-cirrascale \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --workspace ai2/tulu-3-dev \
+    --priority high \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/ppo_vllm_thread.py \
+    --exp_name "ppo_vllm_thread_beta_0.03" \
+    --dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
+    --sft_messages_key chosen \
+    --dataset_train_splits train_prefs \
+    --dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
+    --dataset_eval_splits test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 8e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 1 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 300000 \
+    --model_name_or_path allenai/open_instruct_dev  \
+    --model_revision finetune__meta-llama_Meta-Llama-3.1-8B__42__1725751338 \
+    --reward_model_path vwxyzjn/reward_modeling__allenai_llama-3-tulu-2-8b \
+    --reward_model_revision reward_modeling__1__1726175049 \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 3 \
+    --seed 3 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+```
+
+TBD
+
+
+
+### Quality of life tools
+
+
+Note that when running with `--push_to_hub` and `--with_tracking`, the HF repo is automatically tracked to wandb, so we link the tracked run and the trained model.
+
+![reward modeling tracked hf repo](reward_modeling_hf_repo.png)
+
+
+Furthermore, we also track the dataset length visualization in wandb (see detail in [here](#dataset-processing))
+
+
+![token length visualization in wandb](reward_modeling_token_wandb.png)
+
+
+Finally, we also include samples 
+
+![reward modeling preference sample texts](reward_modeling_preference_sample_texts.png)
+
+
+## Explanation of the logged metrics
+
+
+* `episode`: the global episode number training has gone through (e.g., `3000` means we have trained on 3000 data points already)
+* `lr`: the current learning rate
+* `epoch`: the fraction or multiple of the epoch (e.g., `2.7` means we have trained on the dataset for 2 epochs and 70% of the third epoch)
+* `objective/kl`: the KL divergence between the current policy and the reference policy (sum of the KL divergence of each response token)
+* `objective/scores`: the scores of the current response, rated by a reward model
+* `objective/rlhf_reward`: the RLHF reward, which is `objective/scores` - `beta` * `objective/kl`
+* `objective/non_score_reward`: `beta` * `objective/kl`
+* `objective/entropy`: the entropy of the current policy
+* `objective/scores_margin`: the difference between the chosen response scores and the rejected response scores. We pick the chosen response to be the response with higher scores, and the rejected response to be the response with lower scores
+* `objective/loss`: the DPO loss
+* `logps/chosen`: the log probability of the chosen response
+* `logps/rejected`: the log probability of the rejected response
+* `reward/chosen`: the implicit DPO reward of the chosen response
+* `reward/rejected`: the implicit DPO reward of the rejected response
+* `reward_margin`: the difference between the implicit PDO chosen reward and the implicit rejected reward
+* `time/from_scratch`: the time taken to train the model from scratch
+* `time/training`: the time taken to do one training step
+* `val/sequence_lengths`: the length of the sequences in the generated responses
+* `val/num_stop_token_ids`: the number of stop tokens in the generated responses
+
+
+
+
+## Implementation details
+
+These are relevant implementation details on reward modeling:
+
+1. The tokenizer pads from the left, so it's straightforward to do generations.
+1. Disable dropout in the model: this is an implementation detail in PPO training (see p.3. in https://arxiv.org/pdf/1909.08593).
+1. Layer initialization: we initialize the score's weight according to `std=1 / np.sqrt(model.config.hidden_size + 1)` (see p. 11 in https://arxiv.org/abs/2009.01325)
+1. Vocab size for RM and Policy: we use the same vocab size for the reward model and the policy model. This is to ensure that the reward model can score all the tokens in the policy model. We added a `ValueError` for situations when `policy.config.vocab_size != reward_model.config.vocab_size`.
+1. Retrain on the same prompts: say we only have 10k prompts but we specified `--episodes 100k`, we will shuffle the prompts at every 10k episodes and retrain on them.
+1. Truncate responses at the stop token: we truncate the responses at the `--stop_token eos` to ensure the generation is stopped at the stop token.
+1. Non-stop penalty: we use a non-stop penalty to the reward model to penalize the model for not stopping at the stop token. For example, if the model does not end at the stop token, we penalize the model by `-10.0` (see `--penalty_reward_value -10.0`).
+1. Async training and generation: we follow the architecture in https://arxiv.org/abs/2310.00036 to do rollout and training asynchronously. This is to ensure that the training is not bottlenecked by the generation.
+
+```python
+import queue
+import threading
+import time
+
+class Agent():
+    def __init__(self):
+        self.param = 1
+
+    def learn(self, data):
+        self.param += 1
+
+def query_generator_fn():
+    for i in range(1, 100):
+        yield i
+
+
+ITER = 7
+batch_size = 32
+agent = Agent()
+data_Q = queue.Queue(maxsize=1)
+param_and_query_Q = queue.Queue(maxsize=1)
+def actor():
+    for i in range(1, ITER + 1):
+        params, query = param_and_query_Q.get()
+        data = params
+        print(f"[actor] generating data π_{params} -> p_{query} D_π_{data}")
+        time.sleep(1) # simulate data generation
+        data_Q.put((query, data))
+
+actor_thread = threading.Thread(target=actor)
+actor_thread.start()
+
+# initial param put
+generator = query_generator_fn()
+next_queries = next(generator)
+param_and_query_Q.put((agent.param, next_queries))
+
+# cleanba style stuff
+async_mode = True
+start_time = time.time()
+for g in range(1, ITER + 1):
+    queries = next_queries
+    if async_mode:
+        if g != 1:
+            next_queries = next(generator)
+        param_and_query_Q.put((agent.param, queries))
+    else:
+        if g != 1:
+            next_queries = next(generator)
+            param_and_query_Q.put((agent.param, next_queries)) # note the indent here is different
+    _, data = data_Q.get()
+    old_param = agent.param
+    agent.learn(data)
+    time.sleep(1) # simulate training
+    print(f"--[leaner] get π_{old_param} ->  p_{queries} D_π_{data} -> π_{agent.param}, time: {time.time() - start_time}")
+actor_thread.join()
+```
+```
+[actor] generating data π_1 -> p_1 D_π_1
+[actor] generating data π_1 -> p_1 D_π_1
+--[leaner] get π_1 ->  p_1 D_π_1 -> π_2, time: 2.0022709369659424
+[actor] generating data π_2 -> p_1 D_π_2
+--[leaner] get π_2 ->  p_1 D_π_1 -> π_3, time: 3.003502607345581
+[actor] generating data π_3 -> p_2 D_π_3
+--[leaner] get π_3 ->  p_2 D_π_2 -> π_4, time: 4.004725933074951
+[actor] generating data π_4 -> p_3 D_π_4
+--[leaner] get π_4 ->  p_3 D_π_3 -> π_5, time: 5.005916118621826
+[actor] generating data π_5 -> p_4 D_π_5
+--[leaner] get π_5 ->  p_4 D_π_4 -> π_6, time: 6.007085800170898
+[actor] generating data π_6 -> p_5 D_π_6
+--[leaner] get π_6 ->  p_5 D_π_5 -> π_7, time: 7.007669448852539
+--[leaner] get π_7 ->  p_6 D_π_6 -> π_8, time: 8.009439706802368
+```
+
+
diff --git a/docs/algorithms/reward_modeling.md b/docs/algorithms/reward_modeling.md
index 14dee7331..34f1c0530 100644
--- a/docs/algorithms/reward_modeling.md
+++ b/docs/algorithms/reward_modeling.md
@@ -5,72 +5,23 @@
 
 ## Get started
 
-In the sections below, we will include some examples on how to train reward models and demonstrating different features. A couple of notes:
+In the sections below, we will include some examples on how to train online DPO models and demonstrating different features. A couple of notes:
 
-* You should adjust your `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly to maximize throughput
-* To launch jobs using docker and beaker, you should run the following, where `$NUM_GPUS` is the number of GPUs you want the job to use and $YOUR_COMMAND is the command to invoke training
+* You should adjust your `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly to maximize throughput on a particular GPU type.
+* For the examples below, we use `mason.py` to invoke experiment orchastration on Ai2's cluster. For external users, you can copy the command after the `--` and run it on your system or debug locally. For example: the documentation will have commands like the following, but you can just run `$YOUR_COMMAND` on your system and make sure it matches `$NUM_GPUS`.
+    * You can you `--image costah/open_instruct_onlinedpo2` to specify a custom image or if you don't specify any it's going to use the default image.
+    * If you installed your python on NFS you can run a debug mode by **not toggling** `--pure_docker_mode` and it will mount your python environment on the docker container.
 
 ```bash
 python mason.py \
     --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
-    --pure_docker_mode --no_mount_nfs --no_hf_cache_env \
-    --priority preemptible \
+    --image costah/open_instruct_onlinedpo2 --pure_docker_mode \
+    --priority normal \
     --budget ai2/allennlp \
     --gpus $NUM_GPUS -- $YOUR_COMMAND
 ```
 
 
-For example:
-
-```bash
-# single GPU
-python mason.py \
-    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
-    --pure_docker_mode --no_mount_nfs --no_hf_cache_env \
-    --priority preemptible \
-    --budget ai2/allennlp \
-    --gpus 1 -- python open_instruct/reward_modeling.py \
-    --dataset_mixer '{"trl-internal-testing/sentiment-trl-style": 1.0}' \
-    --dataset_train_splits train \
-    --dataset_eval_splits test \
-    --model_name_or_path EleutherAI/pythia-14m \
-    --chat_template simple_concat_with_space \
-    --learning_rate 3e-6 \
-    --per_device_train_batch_size 1 \
-    --per_device_eval_batch_size 1 \
-    --gradient_accumulation_steps 32 \
-    --max_token_length 1024 \
-    --max_prompt_token_lenth 1024 \
-    --num_train_epochs 1 \
-    --output_dir models/rm/rm \
-    --sanity_check \
-    --push_to_hub
-# 8 GPU
-python mason.py \
-    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
-    --pure_docker_mode --no_mount_nfs --no_hf_cache_env \
-    --priority preemptible \
-    --budget ai2/allennlp \
-    --gpus 8 -- accelerate launch  --config_file configs/ds_configs/deepspeed_zero2.yaml \
-    open_instruct/reward_modeling.py \
-    --dataset_mixer '{"trl-internal-testing/tldr-preference-trl-style": 1.0}' \
-    --dataset_train_splits train \
-    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-trl-style": 1.0}' \
-    --dataset_eval_splits validation \
-    --model_name_or_path EleutherAI/pythia-1b-deduped \
-    --chat_template simple_concat_with_space \
-    --learning_rate 3e-6 \
-    --per_device_train_batch_size 32 \
-    --per_device_eval_batch_size 32 \
-    --gradient_accumulation_steps 1 \
-    --max_token_length 1024 \
-    --max_prompt_token_lenth 512 \
-    --num_train_epochs 1 \
-    --output_dir models/rm/rm_tldr_1b \
-    --with_tracking \
-    --push_to_hub
-```
-
 
 ### Level 0: Debug
 
@@ -103,7 +54,12 @@ Here is a command to train a simple reward model on the sentiment dataset taken
 
 
 ```bash
-python open_instruct/reward_modeling.py \
+python mason.py \
+    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
+    --image costah/open_instruct_dev --pure_docker_mode \
+    --priority normal \
+    --budget ai2/allennlp \
+    --gpus 1 -- python open_instruct/reward_modeling.py \
     --dataset_mixer '{"trl-internal-testing/sentiment-trl-style": 1.0}' \
     --dataset_train_splits train \
     --dataset_eval_mixer '{"trl-internal-testing/sentiment-trl-style": 1.0}' \
@@ -111,15 +67,15 @@ python open_instruct/reward_modeling.py \
     --model_name_or_path EleutherAI/pythia-1b-deduped \
     --chat_template simple_concat_with_space \
     --learning_rate 3e-6 \
-    --per_device_train_batch_size 32 \
-    --per_device_eval_batch_size 32 \
-    --gradient_accumulation_steps 1 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 16 \
+    --gradient_accumulation_steps 4 \
     --max_token_length 1024 \
-    --max_prompt_token_lenth 1024 \
+    --max_prompt_token_lenth 512 \
     --num_train_epochs 1 \
     --output_dir models/rm/rm_sentiment_1b \
     --with_tracking \
-    --push_to_hub \
+    --push_to_hub 
 ```
 * Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/091a0tix
 * Trained model: https://huggingface.co/vwxyzjn/reward_modeling__EleutherAI_pythia-1b-deduped/tree/reward_modeling__1__1725461002
@@ -130,7 +86,12 @@ You can run the following commands to launch experiments. Note that you can mix
 
 
 ```bash
-python open_instruct/reward_modeling.py \
+python mason.py \
+    --cluster ai2/allennlp-cirrascale  ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
+    --image costah/open_instruct_dev --pure_docker_mode \
+    --priority normal \
+    --budget ai2/allennlp \
+    --gpus 1 -- python open_instruct/reward_modeling.py \
     --dataset_mixer '{"trl-internal-testing/sentiment-trl-style": 1.0, "ai2-adapt-dev/summarize_from_feedback_small": 1.0}' \
     --dataset_train_splits train train \
     --dataset_eval_mixer '{"trl-internal-testing/sentiment-trl-style": 1.0}' \
@@ -146,7 +107,7 @@ python open_instruct/reward_modeling.py \
     --num_train_epochs 1 \
     --output_dir models/rm/rm_sentiment_1b \
     --with_tracking \
-    --push_to_hub
+    --push_to_hub 
 ```
 
 * Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/hop8gzww
@@ -155,24 +116,29 @@ python open_instruct/reward_modeling.py \
 
 ### LEVEL 2: multi-gpu training using DS2 with the TL;DR summarization dataset
 ```bash
-accelerate launch  --config_file configs/ds_configs/deepspeed_zero2.yaml \
+python mason.py \
+    --cluster ai2/allennlp-cirrascale  ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale \
+    --image costah/open_instruct_dev --pure_docker_mode \
+    --priority normal \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --config_file configs/ds_configs/deepspeed_zero2.yaml \
     open_instruct/reward_modeling.py \
-    --dataset_mixer '{"trl-internal-testing/tldr-preference-trl-style": 1.0}' \
+    --dataset_mixer '{"trl-internal-testing/hh-rlhf-trl-style": 1.0}' \
     --dataset_train_splits train \
-    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-trl-style": 1.0}' \
-    --dataset_eval_splits validation \
+    --dataset_eval_mixer '{"trl-internal-testing/hh-rlhf-trl-style": 1.0}' \
+    --dataset_eval_splits test \
     --model_name_or_path EleutherAI/pythia-1b-deduped \
-    --chat_template simple_concat_with_space \
+    --chat_template simple_chat \
     --learning_rate 3e-6 \
-    --per_device_train_batch_size 32 \
-    --per_device_eval_batch_size 32 \
-    --gradient_accumulation_steps 1 \
-    --max_token_length 1024 \
-    --max_prompt_token_lenth 512 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --gradient_accumulation_steps 4 \
+    --max_token_length 2048 \
+    --max_prompt_token_lenth 1024 \
     --num_train_epochs 1 \
-    --output_dir models/rm/rm_tldr_1b \
+    --output_dir models/rm/rm_hh_1b \
     --with_tracking \
-    --push_to_hub
+    --push_to_hub 
 ```
 
 * Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/mlycj9qb
@@ -181,7 +147,12 @@ accelerate launch  --config_file configs/ds_configs/deepspeed_zero2.yaml \
 
 ### LEVEL 2.1: multi-gpu training using DS2 with the anthropic HH dataset
 ```bash
-accelerate launch --config_file configs/ds_configs/deepspeed_zero2.yaml \
+python mason.py \
+    --cluster ai2/allennlp-cirrascale ai2/pluto-cirrascale \
+    --image costah/open_instruct_dev --pure_docker_mode \
+    --priority normal \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --config_file configs/ds_configs/deepspeed_zero3.yaml \
     open_instruct/reward_modeling.py \
     --dataset_mixer '{"trl-internal-testing/hh-rlhf-trl-style": 1.0}' \
     --dataset_train_splits train \
@@ -209,13 +180,19 @@ accelerate launch --config_file configs/ds_configs/deepspeed_zero2.yaml \
 
 ### LEVEL 3: multi-gpu training using DS2 with the ultrafeedback dataset
 ```bash
-accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
-    open_instruct/rm_zephyr.py \
-    --dataset_mixer '{"HuggingFaceH4/ultrafeedback_binarized": 1.0}' \
+python mason.py \
+    --cluster ai2/allennlp-cirrascale ai2/pluto-cirrascale \
+    --image costah/open_instruct_dev --pure_docker_mode \
+    --priority normal \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/reward_modeling.py \
+    --dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
     --dataset_train_splits train_prefs \
-    --dataset_eval_mixer '{"HuggingFaceH4/ultrafeedback_binarized": 1.0}' \
+    --dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
     --dataset_eval_splits test_prefs \
-    --chat_template zephyr \
+    --model_name_or_path allenai/llama-3-tulu-2-8b \
+    --chat_template tulu \
     --learning_rate 3e-6 \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
@@ -223,9 +200,12 @@ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml
     --max_token_length 1024 \
     --max_prompt_token_lenth 1024 \
     --num_train_epochs 1 \
-    --bf16 \
-    --output_dir models/rm/rm_zephyr_7b \
+    --output_dir models/rm/rm_tulu_8b \
+    --gradient_checkpointing \
+    --push_to_hub \
+    --with_tracking
 ```
+
 * Tracked experiment: https://wandb.ai/ai2-llm/open_instruct_internal/runs/di1f6p0b
 * Trained model: https://huggingface.co/vwxyzjn/reward_modeling__allenai_llama-3-tulu-2-8b/tree/reward_modeling__1__1725459452
 
diff --git a/mason.py b/mason.py
index 39cf2c994..9669640fc 100644
--- a/mason.py
+++ b/mason.py
@@ -2,7 +2,8 @@
 from typing import List
 import beaker
 import os
-
+import secrets
+import string
 
 def parse_beaker_dataset(dataset_str):
     splt = dataset_str.split(":")
@@ -67,10 +68,13 @@ def get_args():
         "--pure_docker_mode", action="store_true", help="If given, run in pure docker mode"
     )
     parser.add_argument(
-        "--no_hf_cache_env", action="store_true", help="If given, do not pass in `HF_DATASETS_CACHE`, `HF_HUB_CACHE`, and `HF_ASSETS_CACHE`"
+        "--no_hf_cache_env", action="store_true", help="Getting deprecated; it does nothing"
+    )
+    parser.add_argument(
+        "--no_mount_nfs", action="store_true", help="Getting deprecated; it does nothing"
     )
     parser.add_argument(
-        "--no_mount_nfs", action="store_true", help="If given, do not mount NFS"
+        "--resumable", action="store_true", help="If given, make the job resumable"
     )
 
 
@@ -80,6 +84,17 @@ def get_args():
     return mason_args, commands
 
 
+def generate_id(length: int = 8) -> str:
+    """Generate a random base-36 string of `length` digits."""
+    # There are ~2.8T base-36 8-digit strings. If we generate 210k ids,
+    # we'll have a ~1% chance of collision.
+    alphabet = string.ascii_lowercase + string.digits
+    return "".join(secrets.choice(alphabet) for _ in range(length))
+
+
+global_wandb_id = generate_id()
+
+
 def parse_commands(command_args: List[str]) -> List[List[str]]:
     """the inputs are ['--', 'which', 'python', '--', 'echo', 'hello'], and this function converts it into [['which', 'python'], ['echo', 'hello']]"""
     if command_args[0] != "--":
@@ -103,7 +118,7 @@ def parse_commands(command_args: List[str]) -> List[List[str]]:
     return commands
 
 
-def get_env_vars(pure_docker_mode, no_mount_hf_cache, beaker_secrets, whoami):
+def get_env_vars(pure_docker_mode, cluster: List[str], beaker_secrets, whoami, resumable):
     env_vars = []
     useful_secrets = [
         "HF_TOKEN",
@@ -134,35 +149,81 @@ def get_env_vars(pure_docker_mode, no_mount_hf_cache, beaker_secrets, whoami):
                 value=os.getenv("PATH"),
             ),
         ])
-    if not no_mount_hf_cache:
+    
+    # if we are not running on jupiter2, we try to mount the NFS
+    if "ai2/jupiter-cirrascale-2" not in cluster:
         env_vars.extend([
             beaker.EnvVar(
                 name="HF_DATASETS_CACHE",
-                value=os.getenv("HF_DATASETS_CACHE"),
+                value="/net/nfs.cirrascale/allennlp/.cache/huggingface",
             ),
             beaker.EnvVar(
                 name="HF_HUB_CACHE",
-                value=os.getenv("HF_HUB_CACHE"),
+                value="/net/nfs.cirrascale/allennlp/.cache/hub",
             ),
             beaker.EnvVar(
                 name="HF_ASSETS_CACHE",
-                value=os.getenv("HF_ASSETS_CACHE"),
+                value="/net/nfs.cirrascale/allennlp/.cache/assets",
+            ),
+            beaker.EnvVar(
+                name="CHECKPOINT_OUTPUT_DIR",
+                value=f"/net/nfs.cirrascale/allennlp/deletable_checkpoint_states/{global_wandb_id}",
+            ),
+        ])
+    # if we only run on jupiter 2, we try to mount weka
+    elif len(cluster) == 1 and "ai2/jupiter-cirrascale-2" in cluster:
+        env_vars.extend([
+            beaker.EnvVar(
+                name="HF_DATASETS_CACHE",
+                value="/weka/allennlp/.cache/huggingface",
+            ),
+            beaker.EnvVar(
+                name="HF_HUB_CACHE",
+                value="/weka/allennlp/.cache/hub",
+            ),
+            beaker.EnvVar(
+                name="CHECKPOINT_OUTPUT_DIR",
+                value=f"/weka/allennlp/deletable_checkpoint_states/{global_wandb_id}",
+            ),
+        ])
+    # don't mount anything; assume no cache
+    else:
+        pass
+
+    if resumable:
+        env_vars.extend([
+            beaker.EnvVar(
+                name="WANDB_RUN_ID",
+                value=global_wandb_id,
+            ),
+            beaker.EnvVar(
+                name="WANDB_RESUME",
+                value="allow",
             ),
         ])
 
     return env_vars
 
 
-def get_datasets(beaker_datasets, no_mount_nfs):
+def get_datasets(beaker_datasets, cluster: List[str]):
     """if pure docker mode we don't mount the NFS; so we can run it on jupiter2"""
     res = []
-    if not no_mount_nfs:
+    # if we are not running on jupiter2, we try to mount the NFS
+    if "ai2/jupiter-cirrascale-2" not in cluster:
         res = [
             beaker.DataMount(
                 source=beaker.DataSource(host_path="/net/nfs.cirrascale"),
                 mount_path="/net/nfs.cirrascale",
             ),
         ]
+    # if we only run on jupiter 2, we try to mount weka
+    elif len(cluster) == 1 and "ai2/jupiter-cirrascale-2" in cluster:
+        res = [
+            beaker.DataMount(
+                source=beaker.DataSource(weka="oe-adapt-default"),
+                mount_path="/weka",
+            ),
+        ]
     for beaker_dataset in beaker_datasets:
         to_append = beaker.DataMount(
             source=beaker.DataSource(beaker=beaker_dataset["beaker"]),
@@ -173,7 +234,7 @@ def get_datasets(beaker_datasets, no_mount_nfs):
     return res
 
 
-def make_task_spec(args, command, i, beaker_secrets, whoami):
+def make_task_spec(args, command, i, beaker_secrets, whoami, resumable: bool):
     # special logic to deal with escape like
     # python mason.py ... -- python x.py --dataset_mixer '{"trl-internal-testing/sentiment-trl-style": 1.0}'
     # we need to wrap the json string with single quote
@@ -198,11 +259,11 @@ def make_task_spec(args, command, i, beaker_secrets, whoami):
         command=command,
         arguments=[fully_command],
         result=beaker.ResultSpec(path="/output"),
-        datasets=get_datasets(args.beaker_datasets, args.no_mount_nfs),
+        datasets=get_datasets(args.beaker_datasets, args.cluster),
         context=beaker.TaskContext(priority=beaker.Priority(args.priority),
                                    preemptible=args.preemptible),
         constraints=beaker.Constraints(cluster=args.cluster),
-        env_vars=get_env_vars(args.pure_docker_mode, args.no_hf_cache_env, beaker_secrets, whoami),
+        env_vars=get_env_vars(args.pure_docker_mode, args.cluster, beaker_secrets, whoami, resumable),
         resources=beaker.TaskResources(gpu_count=args.gpus),
     )
 
@@ -220,7 +281,7 @@ def main():
     whoami = beaker_client.account.whoami().name
     experiment_spec = beaker.ExperimentSpec(
         description=args.description,
-        tasks=[make_task_spec(args, command, i, beaker_secrets, whoami) for i, command in enumerate(commands)],
+        tasks=[make_task_spec(args, command, i, beaker_secrets, whoami, args.resumable) for i, command in enumerate(commands)],
         budget=args.budget,
     )
 
diff --git a/open_instruct/dataset_processor.py b/open_instruct/dataset_processor.py
index 191555cd1..692e5358e 100644
--- a/open_instruct/dataset_processor.py
+++ b/open_instruct/dataset_processor.py
@@ -471,10 +471,10 @@ def __call__(self, batch: List[Dict[str, int]]):
 class SimpleGenerateCollator:
     """Simple collator for generation task (always pad from the LEFT)"""
 
-    def __init__(self, pad_token_id):
+    def __init__(self, pad_token_id: int):
         self.pad_token_id = pad_token_id
 
-    def __call__(self, batch):
+    def __call__(self, batch: list[dict]):
         """the input will have input_ids_prompt"""
         # Find max length in the batch
         max_length = -1
diff --git a/open_instruct/dpo_tune.py b/open_instruct/dpo_tune.py
index a64c61eec..4de0d4a7a 100644
--- a/open_instruct/dpo_tune.py
+++ b/open_instruct/dpo_tune.py
@@ -17,13 +17,13 @@
 DPO tuning script. Adapted from our finetuning script.
 """
 
+import json
 import logging
 import math
 import os
 import random
 import subprocess
 import time
-import json
 from copy import deepcopy
 from dataclasses import dataclass, field
 from datetime import timedelta
@@ -1076,7 +1076,7 @@ def load_model():
     if is_beaker_job() and accelerator.is_main_process:
         # dpo script only supports these two options right now for datasets
         if args.dataset_mixer:
-            dataset_list = args.dataset_mixer.keys()
+            dataset_list = list(args.dataset_mixer.keys())
         elif args.dataset_mixer_list:
             dataset_list = args.dataset_mixer_list[::2]  # even indices
         elif args.dataset_name:
diff --git a/open_instruct/finetune.py b/open_instruct/finetune.py
index a3b397c3d..5eac4a602 100644
--- a/open_instruct/finetune.py
+++ b/open_instruct/finetune.py
@@ -14,13 +14,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import json
 import logging
 import math
 import os
 import random
 import subprocess
 import time
-import json
 from dataclasses import dataclass, field
 from datetime import timedelta
 from functools import partial
@@ -1042,7 +1042,7 @@ def main(args: FlatArguments):
     if is_beaker_job() and accelerator.is_main_process:
         # dpo script only supports these two options right now for datasets
         if args.dataset_mixer:
-            dataset_list = args.dataset_mixer.keys()
+            dataset_list = list(args.dataset_mixer.keys())
         elif args.dataset_mixer_list:
             dataset_list = args.dataset_mixer_list[::2]  # even indices
         elif args.dataset_name:
diff --git a/open_instruct/model_utils.py b/open_instruct/model_utils.py
index 9c0b18f26..c858c3e53 100644
--- a/open_instruct/model_utils.py
+++ b/open_instruct/model_utils.py
@@ -15,13 +15,11 @@
 
 
 import itertools
-from collections import defaultdict
+from collections import OrderedDict, defaultdict
 from contextlib import contextmanager
 from dataclasses import dataclass
 from typing import List, Literal, Optional, Tuple, Union
 
-from open_instruct.utils import retry_on_exception
-
 try:
     import deepspeed
     from deepspeed.runtime.engine import DeepSpeedEngine
@@ -40,12 +38,14 @@
 from torch.nn.parallel.distributed import DistributedDataParallel
 from transformers import PreTrainedModel, PreTrainedTokenizer
 
+from open_instruct.utils import retry_on_exception
+
 
 @dataclass
 class ModelConfig:
     model_name_or_path: Optional[str] = None
     """The model checkpoint for weights initialization."""
-    model_revision: str = "main"
+    model_revision: Optional[str] = None
     """The specific model version to use (can be a branch name, tag name or commit id)."""
     trust_remote_code: bool = False
     """Trust remote code when loading a model."""
@@ -323,7 +323,9 @@ def save_with_accelerate(
     tokenizer: PreTrainedTokenizer,
     output_dir: str,
     use_lora: bool = False,
+    model_attribute_to_save: Optional[str] = None,
 ) -> None:
+    """`model_attribute_to_save` is for used to save PPO's policy instead of the full model"""
     # set the generation config to an empty setting to be safe.
     # we usually do greedy decoding for generation, so this should be okay.
     # otherwise, we get an error thrown at save time.
@@ -332,10 +334,24 @@ def save_with_accelerate(
     )
 
     unwrapped_model: PreTrainedModel = accelerator.unwrap_model(model)
+    if model_attribute_to_save is not None:
+        unwrapped_model = getattr(unwrapped_model, model_attribute_to_save)
     # When doing multi-gpu training, we need to use accelerator.get_state_dict(model) to get the state_dict.
     # Otherwise, sometimes the model will be saved with only part of the parameters.
     # Also, accelerator needs to use the wrapped model to get the state_dict.
     state_dict = accelerator.get_state_dict(model)
+
+    # if we are saving a specific attribute of the model, we need to filter the state_dict
+    # also the state_dict only lives in the main process; other processes just have state_dict = None
+    if model_attribute_to_save is not None and accelerator.is_main_process:
+        state_dict = OrderedDict(
+            {
+                k[len(f"{model_attribute_to_save}.") :]: v
+                for k, v in state_dict.items()
+                if k.startswith(f"{model_attribute_to_save}.")
+            }
+        )
+
     if use_lora:
         # When using lora, the unwrapped model is a PeftModel, which doesn't support the is_main_process
         # and has its own save_pretrained function for only saving lora modules.
diff --git a/open_instruct/online_dpo_vllm_thread.py b/open_instruct/online_dpo_vllm_thread.py
new file mode 100644
index 000000000..c8c7090e7
--- /dev/null
+++ b/open_instruct/online_dpo_vllm_thread.py
@@ -0,0 +1,1000 @@
+import gc
+import json
+import os
+import random
+import shutil
+import signal
+import subprocess
+import threading
+import time
+from dataclasses import asdict, dataclass
+from queue import Empty, Queue
+from typing import List, Literal, Optional, Tuple
+
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn.functional as F
+import torch.optim as optim
+import torch.utils
+import torch.utils.data
+from accelerate import Accelerator
+from accelerate.utils import broadcast, gather_object
+from datasets import DatasetDict
+from huggingface_hub import HfApi
+from rich.pretty import pprint
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    PreTrainedModel,
+    get_scheduler,
+)
+from vllm import LLM, SamplingParams
+
+from open_instruct.dataset_processor import (
+    CHAT_TEMPLATES,
+    INPUT_IDS_PROMPT_KEY,
+    DatasetConfig,
+    SFTDatasetProcessor,
+    SimpleGenerateCollator,
+    visualize_token,
+)
+from open_instruct.model_utils import (
+    ModelConfig,
+    disable_dropout_in_model,
+    exact_div,
+    first_true_indices,
+    forward,
+    get_reward,
+    prepare_deepspeed,
+    print_rich_single_line_metrics,
+    print_rich_table,
+    push_folder_to_hub,
+    save_with_accelerate,
+    truncate_response,
+    unwrap_model_for_generation,
+)
+from open_instruct.utils import (
+    ArgumentParserPlus,
+    combine_dataset,
+    get_wandb_tags,
+    is_beaker_job,
+    maybe_get_beaker_config,
+    maybe_use_ai2_wandb_entity,
+    upload_metadata_to_hf,
+)
+from open_instruct.vllm_utils import vllm_single_gpu_patch
+
+api = HfApi()
+INVALID_LOGPROB = 1.0
+
+
+@dataclass
+class Args:
+    # required dataset args
+    dataset_mixer: str = None
+    """A dictionary of datasets (local or HF) to sample from."""
+    dataset_train_splits: List[str] = None
+    """The dataset splits to use for training"""
+    dataset_eval_mixer: Optional[str] = None
+    """A dictionary of datasets (local or HF) to sample from for evaluation"""
+    dataset_eval_splits: Optional[List[str]] = None
+    """The dataset splits to use for evaluation"""
+    dataset_mixer_dict: Optional[dict] = None
+    """The dataset mixer as a dictionary"""
+    dataset_eval_mixer_dict: Optional[dict] = None
+    """The dataset eval mixer as a dictionary"""
+
+    # common args
+    exp_name: str = os.path.basename(__file__)[: -len(".py")]
+    """The name of this experiment"""
+    seed: int = 1
+    """Seed of the experiment"""
+    run_name: Optional[str] = None
+    """A unique name of this run"""
+
+    # optimizer args
+    eps: float = 1e-5
+    """The epsilon value for the optimizer"""
+    learning_rate: float = 2e-5
+    """The initial learning rate for AdamW optimizer."""
+    lr_scheduler_type: Literal[
+        "linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"
+    ] = "linear"
+    """Which scheduler to use"""
+    warm_up_steps: int = 0
+    """Number of warm up steps for the scheduler"""
+
+    # various batch sizes
+    num_train_epochs: int = 1
+    """Number of epochs to train"""
+    gradient_accumulation_steps: int = 8
+    """The number of gradient accumulation steps"""
+    per_device_train_batch_size: Optional[int] = 1
+    """The forward batch size per device (local_micro_batch_size)"""
+    per_device_eval_batch_size: Optional[int] = 1
+    """The forward batch size per device for evaluation (local_micro_batch_size)"""
+    total_episodes: Optional[int] = 100000
+    """The total number of episodes in the dataset"""
+    world_size: Optional[int] = None
+    """The number of processes (GPUs) to use"""
+    micro_batch_size: Optional[int] = None
+    """The micro batch size across devices (HF's `per_device_train_batch_size` * `world_size`)"""
+    local_batch_size: Optional[int] = None
+    """The batch size per GPU (HF's `per_device_train_batch_size` * `gradient_accumulation_steps`)"""
+    batch_size: Optional[int] = None
+    """The batch size across devices (HF's `per_device_train_batch_size` * `world_size` * `gradient_accumulation_steps`)"""
+    num_training_steps: Optional[int] = None
+    """The number of training_steps to train"""
+    num_evals: int = 4
+    """The number of evaluations to run throughout training"""
+    eval_freq: Optional[int] = None
+    """The frequency of evaluation steps"""
+    local_dataloader_batch_size: Optional[int] = None
+    """The batch size per GPU for the dataloader"""
+
+    # online settings
+    num_epochs: int = 4
+    """the number of epochs to train"""
+    num_mini_batches: int = 1
+    """Number of minibatches to split a batch into"""
+    local_mini_batch_size: Optional[int] = None
+    """the mini batch size per GPU"""
+    mini_batch_size: Optional[int] = None
+    """the mini batch size across GPUs"""
+    local_rollout_forward_batch_size: int = 64
+    """per rank no grad forward pass in the rollout phase"""
+    reward_model_path: str = "EleutherAI/pythia-160m"
+    """the path to the reward model"""
+    reward_model_revision: Optional[str] = None
+    """the revision of the reward model"""
+
+    # generation config
+    response_length: int = 53
+    """the length of the response"""
+    stop_token: Optional[Literal["eos", "period"]] = None
+    """the stop token"""
+    stop_token_id: Optional[int] = None
+    """the truncation token id"""
+    min_response_length: int = 0
+    """stop only after this many tokens"""
+    temperature: float = 0.7
+    """the sampling temperature"""
+    penalty_reward_value: float = -1.0
+    """the reward value for responses that do not contain `stop_token_id`"""
+    non_stop_penalty: bool = False
+    """whether to penalize responses that do not contain `stop_token_id`"""
+
+    # online DPO specific args
+    beta: float = 0.05
+    """the beta value of the RLHF objective (KL coefficient)"""
+    num_generation_per_prompt: int = 2
+    """the number of generations per prompt (currently only support 2)"""
+    loss_type: Literal["sigmoid", "ipo"] = "sigmoid"
+    """the loss type for the DPO algorithm"""
+
+    # vLLM settings. NOTE: currently we need to place the vLLM model on a separate GPU
+    # for generation to work properly because vLLM would pre-alocate the memory.
+    # To do so, we would need to do a moneky patch `vllm_single_gpu_patch` to make sure
+    # the vLLM model is placed on the correct GPU.
+    vllm_device: str = "cuda:1"
+    """the device placement of the vllm model; typically we place the vllm model on a decicated GPU"""
+    vllm_gpu_memory_utilization: float = 0.8
+    """the GPU memory utilization of the vllm model; passed to `gpu_memory_utilization` to the `vLLM` instance"""
+    # async setting
+    async_mode: bool = True
+    """Whether to run the generation in async mode which learns from the second latest policy like Cleanba (https://arxiv.org/abs/2310.00036)"""
+
+    # wandb and HF tracking configs
+    with_tracking: bool = False
+    """If toggled, this experiment will be tracked with Weights and Biases"""
+    wandb_project_name: str = "open_instruct_internal"
+    """The wandb's project name"""
+    wandb_entity: Optional[str] = None
+    """The entity (team) of wandb's project"""
+    push_to_hub: bool = True
+    """Whether to upload the saved model to huggingface"""
+    hf_entity: Optional[str] = None
+    """The user or org name of the model repository from the Hugging Face Hub"""
+    hf_repo_id: Optional[str] = None
+    """The id of the saved model in the Hugging Face Hub (can be autoset if not given)"""
+    hf_repo_revision: Optional[str] = None
+    """The revision of the saved model in the Hugging Face Hub (can be autoset if not given)"""
+    hf_repo_url: Optional[str] = None
+    """The url of the saved model in the Hugging Face Hub (will be autoset)"""
+    output_dir: Optional[str] = None
+    """Where to save the model"""
+    checkpoint_output_dir: Optional[str] = None
+    """Where to save the model checkpoints in case of preemption"""
+
+    # Ai2 specific settings
+    try_launch_beaker_eval_jobs: bool = True
+    """Whether to launch beaker evaluation jobs after training"""
+    hf_metadata_dataset: Optional[str] = "allenai/tulu-3-evals"
+    """What dataset to upload the metadata to. If unset, don't upload metadata"""
+
+    def __post_init__(self):
+        self.dataset_mixer_dict, self.dataset_mixer = process_dataset_mixer(self.dataset_mixer)
+        if self.dataset_eval_mixer is not None:
+            self.dataset_eval_mixer_dict, self.dataset_eval_mixer = process_dataset_mixer(self.dataset_eval_mixer)
+
+
+def process_dataset_mixer(value) -> Tuple[Optional[dict], Optional[str]]:
+    # if passed through cli: convert the dataset mixers to dictionaries
+    if isinstance(value, str):
+        return json.loads(value), value
+    # if passed through yaml: convert the dataset mixers to strings
+    elif isinstance(value, dict):
+        return value, json.dumps(value)
+    else:
+        raise ValueError("Input must be either a string or a dictionary")
+
+
+def calculate_runtime_args_and_accelerator(args: Args, model_config: ModelConfig) -> Accelerator:
+    """calculate (in-place) runtime args such as the effective batch size, word size, etc."""
+    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
+    args.world_size = accelerator.num_processes
+    args.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps * args.num_mini_batches
+    args.micro_batch_size = int(args.per_device_train_batch_size * args.world_size)
+    args.batch_size = int(args.local_batch_size * args.world_size)
+    time_tensor = torch.tensor(int(time.time()), device=accelerator.device)
+    # set a unique run name with the current timestamp
+    time_int = broadcast(time_tensor, 0).item()
+    args.run_name = f"{args.exp_name}__{args.seed}__{time_int}"
+    args.mini_batch_size = exact_div(
+        args.batch_size, args.num_mini_batches, "`batch_size` must be a multiple of `num_mini_batches`"
+    )
+    args.local_mini_batch_size = exact_div(
+        args.local_batch_size, args.num_mini_batches, "`local_batch_size` must be a multiple of `num_mini_batches`"
+    )
+    args.num_training_steps = args.total_episodes // args.batch_size
+    args.eval_freq = max(1, args.num_training_steps // args.num_evals)
+    # DPO logic: repeats the same prompt `num_generation_per_prompt` times
+    args.local_dataloader_batch_size = exact_div(
+        args.local_batch_size,
+        args.num_generation_per_prompt,
+        "`local_batch_size` must be a multiple of `num_generation_per_prompt`",
+    )
+    if args.push_to_hub:
+        if args.hf_repo_id is None:  # auto-generate one
+            args.hf_repo_id = f"{args.exp_name}__{model_config.model_name_or_path.replace('/', '_')}"
+        if args.hf_entity is None:
+            args.hf_entity = api.whoami()["name"]
+        args.hf_repo_id = f"{args.hf_entity}/{args.hf_repo_id}"
+        if args.hf_repo_revision is None:  # auto-generate one
+            args.hf_repo_revision = args.run_name
+        args.hf_repo_url = f"https://huggingface.co/{args.hf_repo_id}/tree/{args.hf_repo_revision}"
+
+    if args.with_tracking and accelerator.is_main_process:
+        if args.wandb_entity is None:
+            args.wandb_entity = maybe_use_ai2_wandb_entity()
+    return accelerator
+
+
+def vllm_generate(
+    model_name_or_path: str,
+    model_revision: Optional[str],
+    max_model_len: int,
+    vllm_device: str,
+    vllm_gpu_memory_utilization: float,
+    generation_config: SamplingParams,
+    response_ids_Q: Queue,
+    param_prompt_Q: Queue,
+    num_training_steps: int,
+    sample_evaluation_prompt_token_ids: Optional[List[int]],
+    evaluation_Q: Queue,
+    eval_freq: int,
+    resume_training_step: int,
+):
+    vllm_single_gpu_patch()
+    llm = LLM(
+        model=model_name_or_path,
+        revision=model_revision,
+        tokenizer_revision=model_revision,
+        tensor_parallel_size=1,
+        device=vllm_device,
+        gpu_memory_utilization=vllm_gpu_memory_utilization,
+        max_model_len=max_model_len,
+    )
+    print("🔥🔥🔥 vllm loaded")
+    llmp = llm.llm_engine.model_executor.driver_worker.model_runner.model
+    for training_step in range(resume_training_step, num_training_steps + 1):
+        items = param_prompt_Q.get()
+        if items is None:
+            break
+        unwrapped_model, g_queries_list = items
+        if unwrapped_model is not None:
+            start_time = time.time()
+            llmp.load_weights(unwrapped_model.named_parameters())
+            print(
+                f"🔥🔥🔥 Loading weights using shared memory; Time to load weights: {time.time() - start_time:.2f} seconds"
+            )
+        generation_start_time = time.time()
+        outputs = llm.generate(prompt_token_ids=g_queries_list, sampling_params=generation_config)
+        response_ids = [list(output.outputs[0].token_ids) for output in outputs]
+        print(f"🔥🔥🔥 Generation time: {time.time() - generation_start_time:.2f} seconds")
+        response_ids_Q.put(response_ids)
+
+        if sample_evaluation_prompt_token_ids is not None and (training_step - 1) % eval_freq == 0:
+            outputs = llm.generate(
+                prompt_token_ids=sample_evaluation_prompt_token_ids, sampling_params=generation_config
+            )
+            response_ids = [list(output.outputs[0].token_ids) for output in outputs]
+            evaluation_Q.put(response_ids)
+
+
+def send_queries(accelerator, unwrapped_model, tokenizer, param_prompt_Q, queries):
+    g_queries_list = gather_object(queries.tolist())
+    if accelerator.is_main_process:
+        g_queries_list = [
+            [inneritem for inneritem in item if inneritem != tokenizer.pad_token_id] for item in g_queries_list
+        ]  # remove padding
+        param_prompt_Q.put((unwrapped_model, g_queries_list))
+
+
+def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
+    accelerator = calculate_runtime_args_and_accelerator(args, model_config)
+    local_seed = args.seed + accelerator.process_index
+
+    # set up experiment tracking and seeds
+    all_configs = {}
+    if is_beaker_job():
+        args.checkpoint_output_dir = os.environ.get("CHECKPOINT_OUTPUT_DIR", args.output_dir)
+        beaker_config = maybe_get_beaker_config()
+        # try saving to the beaker `/output`, which will be uploaded to the beaker dataset
+        if len(beaker_config.beaker_dataset_id_urls) > 0:
+            args.output_dir = "/output"
+        all_configs.update(vars(beaker_config))
+    all_configs.update(**asdict(args), **asdict(dataset_config), **asdict(model_config))
+    if accelerator.is_main_process:
+        if args.with_tracking:
+            import wandb
+
+            wandb.init(
+                project=args.wandb_project_name,
+                entity=args.wandb_entity,
+                sync_tensorboard=True,
+                config=all_configs,
+                name=args.run_name,
+                save_code=True,
+                tags=[args.exp_name] + get_wandb_tags(),
+            )
+        writer = SummaryWriter(f"runs/{args.run_name}")
+        writer.add_text(
+            "hyperparameters",
+            "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()])),
+        )
+    device = torch.device(f"cuda:{accelerator.local_process_index}")
+    random.seed(local_seed)
+    np.random.seed(local_seed)
+    torch.manual_seed(local_seed)
+    torch.backends.cudnn.deterministic = True
+
+    # create a tokenizer (pad from right)
+    config = AutoConfig.from_pretrained(model_config.model_name_or_path, revision=model_config.model_revision)
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_config.model_name_or_path, revision=model_config.model_revision, padding_side="right"
+    )
+    if config.architectures == "LlamaForCausalLM" and config.bos_token_id == 128000:
+        tokenizer.pad_token_id = 128002  # <|reserved_special_token_0|>
+    else:
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})  # NOTE: we do not resize the embedding
+    tokenizer.chat_template = CHAT_TEMPLATES[dataset_config.chat_template]
+
+    # create the dataset
+    dataset_dict = DatasetDict()
+    dataset_processor = SFTDatasetProcessor(tokenizer=tokenizer, config=dataset_config)
+    train_dataset = combine_dataset(
+        args.dataset_mixer_dict,
+        splits=args.dataset_train_splits,
+        columns_to_keep=[dataset_config.sft_messages_key],
+    )
+    if dataset_config.sanity_check:
+        train_dataset = train_dataset.select(
+            range(0, min(len(train_dataset), dataset_config.sanity_check_max_samples))
+        )
+    with accelerator.main_process_first():
+        train_dataset = dataset_processor.tokenize(train_dataset)
+        train_dataset = dataset_processor.filter(train_dataset)
+    dataset_dict["train"] = train_dataset
+    eval_dataset = None
+    if args.dataset_eval_mixer is not None:
+        eval_dataset = combine_dataset(
+            args.dataset_eval_mixer_dict,
+            splits=args.dataset_eval_splits,
+            columns_to_keep=[dataset_config.sft_messages_key],
+        )
+        eval_dataset = eval_dataset.select(range(0, min(len(eval_dataset), dataset_config.sanity_check_max_samples)))
+        with accelerator.main_process_first():
+            eval_dataset = dataset_processor.tokenize(eval_dataset)
+            eval_dataset = dataset_processor.filter(eval_dataset)
+        dataset_dict["eval"] = eval_dataset
+
+    # some more runtime logging
+    if accelerator.is_main_process:
+        pprint([args, dataset_config, model_config])
+        visualize_token(train_dataset[0][INPUT_IDS_PROMPT_KEY], tokenizer)
+        if args.with_tracking:
+            # upload the visualized token length
+            dataset_processor.get_token_length_visualization(
+                dataset_dict, save_path=f"runs/{args.run_name}/token_length.png"
+            )
+            wandb.log({"token_length": wandb.Image(f"runs/{args.run_name}/token_length.png")})
+
+    # create the model and optimizer
+    policy: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
+        model_config.model_name_or_path,
+        revision=model_config.model_revision,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    ref_model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
+        model_config.model_name_or_path,
+        revision=model_config.model_revision,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    reward_model: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
+        args.reward_model_path,
+        revision=args.reward_model_revision,
+        num_labels=1,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    if policy.config.vocab_size != reward_model.config.vocab_size:
+        raise ValueError(
+            "Policy and reward model must have the same vocab size. "
+            f"Policy: {policy.config.vocab_size}, Reward: {reward_model.config.vocab_size}. "
+            "If they don't have the same vocab size, the policy could generate tokens which "
+            "is going to cause index out of bound error in the reward model."
+        )
+    model = policy
+    if model_config.gradient_checkpointing:
+        model.gradient_checkpointing_enable()
+    for module in [model, ref_model, reward_model]:
+        disable_dropout_in_model(module)
+    if args.stop_token:
+        if args.stop_token == "eos":
+            args.stop_token_id = tokenizer.eos_token_id
+        if args.stop_token == "period":
+            args.stop_token_id = tokenizer.encode(".")[0]
+    optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate, eps=args.eps)
+    scheduler = get_scheduler(
+        args.lr_scheduler_type,
+        optimizer=optimizer,
+        num_warmup_steps=args.warm_up_steps,
+        num_training_steps=args.num_training_steps * args.num_train_epochs,
+    )
+    data_collator = SimpleGenerateCollator(pad_token_id=tokenizer.pad_token_id)
+    dataloader = DataLoader(
+        train_dataset,
+        batch_size=args.local_dataloader_batch_size,
+        shuffle=True,
+        collate_fn=data_collator,
+        drop_last=True,  # needed; otherwise the last batch will be of ragged shape
+    )
+    # sync random states for DataLoader(shuffle=True) before `accelerator.prepare`
+    # see https://gist.github.com/vwxyzjn/2581bff1e48e185e0b85b6dfe1def79c
+    torch.manual_seed(args.seed)
+    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
+    torch.manual_seed(local_seed)
+
+    # resume from preemption
+    resume_training_step = 1
+    if os.path.exists(args.checkpoint_output_dir):
+        for item in os.listdir(args.checkpoint_output_dir):
+            print(item)
+            if "step_" in item:
+                old_checkpoint_path = os.path.join(args.checkpoint_output_dir, item)
+                # check if the directory is empty
+                if len(os.listdir(old_checkpoint_path)) == 0:
+                    continue
+                accelerator.load_state(old_checkpoint_path)
+                resume_training_step = int(item.split("_")[-1])
+                print("Resuming training from step", resume_training_step)
+                if accelerator.is_main_process:
+                    shutil.rmtree(old_checkpoint_path)
+                break
+    resume_training_step > 1
+
+    # handle preemption
+    class PreemptionHandler:
+        preemptied = False
+
+        def __init__(self):
+            signal.signal(signal.SIGTERM, self.exit_gracefully)
+
+        def exit_gracefully(self, signum, frame):
+            output_dir = os.path.join(args.checkpoint_output_dir, f"step_{training_step - 1}")
+            print(f"SIGTERM received, saving to {output_dir} from {accelerator.local_process_index}")
+            accelerator.save_state(output_dir)
+            if accelerator.is_main_process and args.with_tracking:
+                wandb.log({"preempted": True}, commit=True)
+                wandb.mark_preempting()
+            if accelerator.is_main_process:
+                try:
+                    param_prompt_Q.put(None, timeout=20)
+                    response_ids_Q.get(timeout=20)
+                    print("vllm thread terminated")
+                except Exception as e:
+                    print(e)
+            self.preemptied = True
+
+    ph = PreemptionHandler()
+
+    # deepspeed setup
+    is_deepspeed_enabled = getattr(accelerator.state, "deepspeed_plugin", None) is not None
+    mixed_precision = accelerator.state.mixed_precision
+    if is_deepspeed_enabled:
+        reward_model = prepare_deepspeed(reward_model, args.per_device_train_batch_size, mixed_precision)
+        ref_model = prepare_deepspeed(ref_model, args.per_device_train_batch_size, mixed_precision)
+    else:
+        reward_model = reward_model.to(device)
+        ref_model = ref_model.to(device)
+
+    # online generation config
+    def repeat_generator():
+        while True:
+            yield from dataloader
+
+    iter_dataloader = iter(repeat_generator())
+    generation_config = SamplingParams(
+        temperature=args.temperature,
+        top_p=1.0,
+        max_tokens=args.response_length,
+        include_stop_str_in_output=True,
+    )
+    param_prompt_Q = None
+    response_ids_Q = None
+    evaluation_Q = None
+    if accelerator.is_main_process:
+        response_ids_Q = Queue(maxsize=1)
+        param_prompt_Q = Queue(maxsize=1)
+        evaluation_Q = Queue(maxsize=1)
+        LOCAL_NUM_EVAL_SAMPLES = 4
+        num_eval_samples = LOCAL_NUM_EVAL_SAMPLES * accelerator.num_processes
+        sample_evaluation_prompt_token_ids = None
+        if eval_dataset is not None:
+            sample_evaluation_prompt_token_ids = eval_dataset[:num_eval_samples][INPUT_IDS_PROMPT_KEY]
+        thread = threading.Thread(
+            target=vllm_generate,
+            args=(
+                model_config.model_name_or_path,
+                model_config.model_revision,
+                dataset_config.max_prompt_token_lenth + args.response_length,
+                args.vllm_device,
+                args.vllm_gpu_memory_utilization,
+                generation_config,
+                response_ids_Q,
+                param_prompt_Q,
+                args.num_training_steps,
+                sample_evaluation_prompt_token_ids,
+                evaluation_Q,
+                args.eval_freq,
+                resume_training_step,
+            ),
+        )
+        thread.start()
+    torch.cuda.set_device(device)
+
+    g_vllm_responses = torch.zeros((args.batch_size, args.response_length), device=device, dtype=torch.long)
+
+    # set up the metrics and initial states
+    stats_shape = (args.num_epochs, args.num_mini_batches, args.gradient_accumulation_steps)
+    loss_stats = torch.zeros(stats_shape, device=device)
+    chosen_rewards_stats = torch.zeros(stats_shape, device=device)
+    rejected_rewards_stats = torch.zeros(stats_shape, device=device)
+    chosen_logprobs_stats = torch.zeros(stats_shape, device=device)
+    rejected_logprobs_stats = torch.zeros(stats_shape, device=device)
+    local_metrics = torch.zeros((20,), device=device)
+    episode = args.batch_size * (resume_training_step - 1)
+    model.train()
+
+    # training loop
+    start_time = time.time()
+    data = next(iter_dataloader)
+    queries_next = data[INPUT_IDS_PROMPT_KEY].to(device)
+    queries_next = queries_next.repeat(args.num_generation_per_prompt, 1)
+    send_queries(accelerator, None, tokenizer, param_prompt_Q, queries_next)
+
+    for _ in range(1, resume_training_step):  # we didn't store scheduler state
+        scheduler.step()
+
+    for training_step in range(resume_training_step, args.num_training_steps + 1):
+        episode += args.batch_size
+        scheduler.step()
+        queries = queries_next
+        if ph.preemptied:
+            break
+
+        if accelerator.is_main_process:
+            try:
+                evaluation_responses = evaluation_Q.get(timeout=0.01)
+                print("🔥🔥🔥 Evaluation responses received")
+                table = {}
+                table["prompt"] = tokenizer.batch_decode(sample_evaluation_prompt_token_ids)
+                table["response"] = tokenizer.batch_decode(evaluation_responses)
+                table["response"] = [item.replace(tokenizer.pad_token, "") for item in table["response"]]
+                df = pd.DataFrame(table)
+                print_rich_table(df)
+                if args.with_tracking:
+                    wandb.log({"sample_completions": wandb.Table(dataframe=df)})
+                else:
+                    print_rich_table(df)
+                del table
+            except Empty:
+                print("🙈 Evaluation responses not received")
+
+        with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
+            # (optionally) evaluate the model
+            generation_model = unwrapped_model
+            if args.async_mode:
+                if training_step != 1:
+                    data = next(iter_dataloader)
+                    queries_next = data[INPUT_IDS_PROMPT_KEY].to(device)
+                    queries_next = queries_next.repeat(args.num_generation_per_prompt, 1)
+                send_queries(accelerator, generation_model, tokenizer, param_prompt_Q, queries_next)
+            else:
+                if training_step != 1:
+                    data = next(iter_dataloader)
+                    queries_next = data[INPUT_IDS_PROMPT_KEY].to(device)
+                    queries_next = queries_next.repeat(args.num_generation_per_prompt, 1)
+                    # NOTE: important: the indent here is different for sync mode
+                    send_queries(accelerator, generation_model, tokenizer, param_prompt_Q, queries_next)
+
+            training_time_start = time.time()
+            with torch.no_grad():
+                context_length = queries.shape[1]
+                responses = []
+                postprocessed_responses = []
+                logprobs = []
+                ref_logprobs = []
+                scores = []
+                sequence_lengths = []
+                if accelerator.is_main_process:
+                    g_response_token_ids = response_ids_Q.get()
+                    DUMMY_PAD_TOKEN = 0  # we can't use tokenizer.pad_token_id because it's outside vocab and `torch.gather(all_logprob, 2, response.unsqueeze(-1))` will error out
+                    g_padded_response_ids = [
+                        response + [DUMMY_PAD_TOKEN] * (args.response_length - len(response))
+                        for response in g_response_token_ids
+                    ]
+                    for item in g_padded_response_ids:
+                        assert len(item) == args.response_length
+                        for inner_item in item:
+                            if not inner_item < config.vocab_size:
+                                assert inner_item < config.vocab_size, f"{inner_item=}, {tokenizer.vocab_size=}"
+                    g_padded_response_ids = torch.tensor(g_padded_response_ids, device=device)
+                    g_vllm_responses[:] = g_padded_response_ids
+                broadcast(g_vllm_responses, 0)
+                local_vllm_responses = g_vllm_responses[
+                    accelerator.local_process_index
+                    * queries.shape[0] : (accelerator.local_process_index + 1)
+                    * queries.shape[0]
+                ]
+                query_responses = torch.cat((queries, local_vllm_responses), 1)
+                for i in range(0, queries.shape[0], args.local_rollout_forward_batch_size):
+                    query = queries[i : i + args.local_rollout_forward_batch_size]
+                    query_response = query_responses[i : i + args.local_rollout_forward_batch_size]
+                    response = query_response[:, context_length:]
+                    output = forward(generation_model, query_response, tokenizer.pad_token_id)
+                    logits = output.logits[:, context_length - 1 : -1]
+                    logits /= args.temperature + 1e-7
+                    all_logprob = F.log_softmax(logits, dim=-1)
+                    logprob = torch.gather(all_logprob, 2, response.unsqueeze(-1)).squeeze(-1)
+                    del output, logits, all_logprob
+                    torch.cuda.empty_cache()
+
+                    ref_output = forward(ref_model, query_response, tokenizer.pad_token_id)
+                    ref_logits = ref_output.logits[:, context_length - 1 : -1]
+                    ref_logits /= args.temperature + 1e-7
+                    ref_all_logprob = F.log_softmax(ref_logits, dim=-1)
+                    ref_logprob = torch.gather(ref_all_logprob, 2, response.unsqueeze(-1)).squeeze(-1)
+                    del ref_output, ref_logits, ref_all_logprob
+                    torch.cuda.empty_cache()
+
+                    # Response Processing 1. truncate response after the first occurrence of `stop_token_id`
+                    postprocessed_response = response
+                    if args.stop_token_id is not None:  # handle the edge case when stop_token_id exists but is 0
+                        postprocessed_response = truncate_response(
+                            args.stop_token_id, tokenizer.pad_token_id, response
+                        )
+
+                    # Response Processing 2. run reward model on the truncated responses
+                    postprocessed_query_response = torch.cat((query, postprocessed_response), 1)
+                    sequence_length = first_true_indices(postprocessed_response == tokenizer.pad_token_id) - 1
+                    _, score, _ = get_reward(
+                        reward_model, postprocessed_query_response, tokenizer.pad_token_id, context_length
+                    )
+
+                    responses.append(response)
+                    postprocessed_responses.append(postprocessed_response)
+                    logprobs.append(logprob)
+                    ref_logprobs.append(ref_logprob)
+                    sequence_lengths.append(sequence_length)
+                    scores.append(score)
+                responses = torch.cat(responses, 0)
+                postprocessed_responses = torch.cat(postprocessed_responses, 0)
+                logprobs = torch.cat(logprobs, 0)
+                ref_logprobs = torch.cat(ref_logprobs, 0)
+                sequence_lengths = torch.cat(sequence_lengths, 0)
+                scores = torch.cat(scores, 0)
+                global_scores = accelerator.gather(scores)
+                accelerator.print(f"global_scores: {global_scores}, {global_scores.mean()}")
+                del (logprob, ref_logprob, score)
+                gc.collect()
+                torch.cuda.empty_cache()
+
+                # Response Processing 3. filter response. Ensure that the sample contains stop_token_id
+                # responses not passing that filter will receive a low (fixed) score
+                # only query humans on responses that pass that filter
+                contain_stop_token = torch.any(postprocessed_responses == args.stop_token_id, dim=-1)
+                # NOTE: only apply the stop token filter if the response is long enough
+                # otherwise the model could learn to generate the first token as the stop token
+                contain_stop_token = contain_stop_token & (sequence_lengths >= args.min_response_length)
+                if args.non_stop_penalty:
+                    scores = torch.where(
+                        contain_stop_token, scores, torch.full_like(scores, args.penalty_reward_value)
+                    )
+
+                # be very careful with `padding_mask_p1`; see https://excalidraw.com/#json=LWnzG4w2k5DjF_EOL_xPt,e2w3a-hFJ_gX5vOfeyXGTw
+                response_idxs = torch.arange(responses.shape[1], device=responses.device).repeat(responses.shape[0], 1)
+                padding_mask = response_idxs > sequence_lengths.unsqueeze(1)
+                logprobs = torch.masked_fill(logprobs, padding_mask, INVALID_LOGPROB)
+                ref_logprobs = torch.masked_fill(ref_logprobs, padding_mask, INVALID_LOGPROB)
+
+                # 4. compute rewards
+                kl = logprobs - ref_logprobs
+                print(f"{accelerator.local_process_index=}, {kl.sum(1)=}")
+                non_score_reward = -args.beta * kl
+                non_score_reward_sum = non_score_reward.sum(1)
+                rlhf_reward = scores + non_score_reward_sum
+
+                # num_examples should be same as args.local_batch_size divided by 2
+                num_examples = scores.size(0) // 2
+                first_half = scores[:num_examples]
+                second_half = scores[num_examples:]
+
+                num_examples_range = torch.arange(num_examples).to(scores.device)
+                chosen_indices = torch.where(
+                    first_half >= second_half, num_examples_range.clone(), num_examples_range.clone() + num_examples
+                )
+                rejected_indices = torch.where(
+                    first_half < second_half, num_examples_range.clone(), num_examples_range.clone() + num_examples
+                )
+                scores_margin = scores[chosen_indices] - scores[rejected_indices]
+
+        # Do multiple epochs of training on on-policy data (PPO-style), with a fresh random shuffle in each epoch
+        for epoch_idx in range(args.num_epochs):
+            b_inds = np.random.permutation(args.local_batch_size // args.num_generation_per_prompt)
+            minibatch_idx = 0
+            for mini_batch_start in range(
+                0,
+                args.local_batch_size // args.num_generation_per_prompt,
+                args.local_mini_batch_size // args.num_generation_per_prompt,
+            ):
+                mini_batch_end = mini_batch_start + args.local_mini_batch_size // args.num_generation_per_prompt
+                mini_batch_inds = b_inds[mini_batch_start:mini_batch_end]
+                gradient_accumulation_idx = 0
+                for micro_batch_start in range(
+                    0,
+                    args.local_mini_batch_size // args.num_generation_per_prompt,
+                    args.per_device_train_batch_size,
+                ):
+                    with accelerator.accumulate(model):
+                        micro_batch_end = micro_batch_start + args.per_device_train_batch_size
+                        micro_batch_inds = mini_batch_inds[micro_batch_start:micro_batch_end]
+                        chosen_mb_inds = chosen_indices[micro_batch_inds]
+                        chosen_responses = responses[chosen_mb_inds]
+                        rejected_mb_inds = rejected_indices[micro_batch_inds]
+                        rejected_responses = responses[rejected_mb_inds]
+
+                        concat_mb_inds = torch.cat((chosen_mb_inds, rejected_mb_inds), dim=0)
+                        concat_query_responses = query_responses[concat_mb_inds]
+                        concat_output = forward(model, concat_query_responses, tokenizer.pad_token_id)
+                        num_examples = chosen_mb_inds.shape[0]
+                        chosen_logits = concat_output.logits[:num_examples]
+                        rejected_logits = concat_output.logits[num_examples:]
+
+                        # chosen
+                        chosen_logits = chosen_logits[:, context_length - 1 : -1]
+                        chosen_logits /= args.temperature + 1e-7
+                        chosen_all_logprobs = F.log_softmax(chosen_logits, dim=-1)
+                        chosen_logprobs = torch.gather(chosen_all_logprobs, 2, chosen_responses.unsqueeze(-1)).squeeze(
+                            -1
+                        )
+                        chosen_logprobs = torch.masked_fill(
+                            chosen_logprobs, padding_mask[chosen_mb_inds], INVALID_LOGPROB
+                        )
+                        chosen_ref_logprobs = ref_logprobs[chosen_mb_inds]
+                        chosen_logprobs_sum = (chosen_logprobs * ~padding_mask[chosen_mb_inds]).sum(1)
+                        chosen_ref_logprobs_sum = (chosen_ref_logprobs * ~padding_mask[chosen_mb_inds]).sum(1)
+
+                        # rejected
+                        rejected_logits = rejected_logits[:, context_length - 1 : -1]
+                        rejected_logits /= args.temperature + 1e-7
+                        rejected_all_logprobs = F.log_softmax(rejected_logits, dim=-1)
+                        rejected_logprobs = torch.gather(
+                            rejected_all_logprobs, 2, rejected_responses.unsqueeze(-1)
+                        ).squeeze(-1)
+                        rejected_logprobs = torch.masked_fill(
+                            rejected_logprobs, padding_mask[rejected_mb_inds], INVALID_LOGPROB
+                        )
+                        rejected_ref_logprobs = ref_logprobs[rejected_mb_inds]
+                        rejected_logprobs_sum = (rejected_logprobs * ~padding_mask[rejected_mb_inds]).sum(1)
+                        rejected_ref_logprobs_sum = (rejected_ref_logprobs * ~padding_mask[rejected_mb_inds]).sum(1)
+
+                        pi_logratios = chosen_logprobs_sum - rejected_logprobs_sum
+                        ref_logratios = chosen_ref_logprobs_sum - rejected_ref_logprobs_sum
+
+                        logits = pi_logratios - ref_logratios
+
+                        if args.loss_type == "sigmoid":
+                            losses = -F.logsigmoid(args.beta * logits)
+                        elif args.loss_type == "ipo":
+                            losses = (logits - 1 / (2 * args.beta)) ** 2
+                        else:
+                            raise NotImplementedError(f"invalid loss type {args.loss_type}")
+
+                        loss = losses.mean()
+                        accelerator.backward(loss)
+                        optimizer.step()
+                        optimizer.zero_grad()
+                        with torch.no_grad():
+                            chosen_rewards = args.beta * (chosen_logprobs_sum - chosen_ref_logprobs_sum)
+                            rejected_rewards = args.beta * (rejected_logprobs_sum - rejected_ref_logprobs_sum)
+                            loss_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = loss
+                            chosen_rewards_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = (
+                                chosen_rewards.mean()
+                            )
+                            rejected_rewards_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = (
+                                rejected_rewards.mean()
+                            )
+                            chosen_logprobs_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = (
+                                chosen_logprobs_sum.mean()
+                            )
+                            rejected_logprobs_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = (
+                                rejected_logprobs_sum.mean()
+                            )
+                    gradient_accumulation_idx += 1
+                minibatch_idx += 1
+                # fmt: off
+                del (
+                    loss, logits, concat_output, concat_query_responses,
+                    chosen_logits, rejected_logits, chosen_logprobs, rejected_logprobs,
+                    chosen_responses, rejected_responses,
+                )
+                # fmt: on
+                # del everything and empty cache
+                torch.cuda.empty_cache()
+        with torch.no_grad():
+            local_metrics[0] = sequence_lengths.float().mean()
+            local_metrics[1] = (responses == args.stop_token_id).sum().float().mean()
+            local_metrics[2] = kl.sum(1).mean()
+            local_metrics[3] = (-logprobs).sum(1).mean()
+            local_metrics[4] = non_score_reward_sum.mean()
+            local_metrics[5] = rlhf_reward.mean()
+            local_metrics[6] = scores.mean()
+            local_metrics[7] = scores_margin.mean()
+            local_metrics[8] = loss_stats.mean()
+            local_metrics[9] = chosen_rewards_stats.mean()
+            local_metrics[10] = rejected_rewards_stats.mean()
+            local_metrics[11] = (chosen_rewards_stats > rejected_rewards_stats).float().mean()
+            local_metrics[12] = (chosen_rewards_stats - rejected_rewards_stats).mean()
+            local_metrics[13] = chosen_logprobs_stats.mean()
+            local_metrics[14] = rejected_logprobs_stats.mean()
+            local_metrics[15] = ((kl) ** 2 / 2).sum(1).mean()
+            local_metrics[16] = ((-kl).exp() - 1 + kl).sum(1).mean()
+            global_metrics = accelerator.reduce(local_metrics, reduction="mean").tolist()
+            metrics = {
+                "episode": episode,
+                "training_step": training_step,
+                "lr": scheduler.get_last_lr()[0],
+                "epoch": episode / len(train_dataset),
+                "time/from_scratch": time.time() - start_time,
+                "time/training": time.time() - training_time_start,
+                "val/sequence_lengths": global_metrics[0],
+                "val/num_stop_token_ids": global_metrics[1],
+                "objective/kl": global_metrics[2],
+                "objective/kl2": global_metrics[15],
+                "ojbective/kl3": global_metrics[16],
+                "objective/entropy": global_metrics[3],
+                "objective/non_score_reward": global_metrics[4],
+                "objective/rlhf_reward": global_metrics[5],
+                "objective/scores": global_metrics[6],
+                "objective/scores_margin": global_metrics[7],
+                "objective/loss": global_metrics[8],
+                "rewards/chosen": global_metrics[9],
+                "rewards/rejected": global_metrics[10],
+                "rewards/accuracies": global_metrics[11],
+                "rewards/margins": global_metrics[12],
+                "logps/chosen": global_metrics[13],
+                "logps/rejected": global_metrics[14],
+            }
+            if accelerator.is_main_process:
+                print_rich_single_line_metrics(metrics)
+                for key, value in metrics.items():
+                    writer.add_scalar(key, value, episode)
+        del (queries, responses, postprocessed_responses, logprobs, ref_logprobs, sequence_lengths, scores)
+        del (metrics, kl, non_score_reward, rlhf_reward)
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    if not ph.preemptied:
+        # save model
+        os.makedirs(os.path.dirname(args.output_dir), exist_ok=True)
+        original_tokenizer = AutoTokenizer.from_pretrained(
+            model_config.model_name_or_path, revision=model_config.model_revision
+        )
+        save_with_accelerate(
+            accelerator,
+            model,
+            original_tokenizer,
+            args.output_dir,
+        )
+
+        # Ai2 specific logic
+        if is_beaker_job() and accelerator.is_main_process:
+            if args.hf_metadata_dataset:
+                dataset_list = list(args.dataset_mixer_dict.keys())
+                # mainly just focussing here on what would be useful for the leaderboard.
+                # wandb will have even more useful information.
+                metadata_blob = {
+                    "model_name": args.exp_name,
+                    "model_type": "sft",
+                    "datasets": dataset_list,
+                    "base_model": model_config.model_name_or_path,
+                    "wandb_path": wandb.run.get_url(),
+                    "beaker_experiment": beaker_config.beaker_experiment_url,
+                    "beaker_datasets": beaker_config.beaker_dataset_id_urls,
+                }
+                upload_metadata_to_hf(
+                    metadata_blob,
+                    "metadata.json",
+                    args.hf_metadata_dataset,
+                    "results/" + args.hf_repo_revision,  # to match what the auto-evals name as.
+                )
+
+            if args.try_launch_beaker_eval_jobs and len(beaker_config.beaker_dataset_id_urls) > 0:
+                command = f"""\
+                python mason.py  \
+                    --cluster ai2/allennlp-cirrascale ai2/general-cirrascale-a5000 ai2/general-cirrascale-a5000 ai2/s2-cirrascale ai2/general-cirrascale \
+                    --priority low \
+                    --preemptible \
+                    --budget ai2/allennlp \
+                    --workspace ai2/tulu-2-improvements \
+                    --image nathanl/open_instruct_auto \
+                    --pure_docker_mode \
+                    --gpus 0 -- python scripts/wait_beaker_dataset_model_upload_then_evaluate_model.py \
+                    --beaker_workload_id {beaker_config.beaker_workload_id} \
+                    --model_name {args.hf_repo_revision}
+                """
+                process = subprocess.Popen(["bash", "-c", command], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+                stdout, stderr = process.communicate()
+                print(f"Submit jobs after model training is finished - Stdout:\n{stdout.decode()}")
+                print(f"Submit jobs after model training is finished - Stderr:\n{stderr.decode()}")
+                print(f"Submit jobs after model training is finished - process return code: {process.returncode}")
+
+        if args.push_to_hub:
+            push_folder_to_hub(
+                accelerator,
+                args.output_dir,
+                args.hf_repo_id,
+                args.hf_repo_revision,
+            )
+
+        if accelerator.is_main_process:
+            # remove args.checkpoint_output_dir
+            if os.path.exists(args.checkpoint_output_dir):
+                shutil.rmtree(args.checkpoint_output_dir, ignore_errors=True)
+
+
+if __name__ == "__main__":
+    parser = ArgumentParserPlus((Args, DatasetConfig, ModelConfig))
+    main(*parser.parse())
diff --git a/open_instruct/ppo_vllm_thread.py b/open_instruct/ppo_vllm_thread.py
new file mode 100644
index 000000000..14b6854ec
--- /dev/null
+++ b/open_instruct/ppo_vllm_thread.py
@@ -0,0 +1,1077 @@
+import gc
+import json
+import os
+import random
+import shutil
+import signal
+import subprocess
+import threading
+import time
+from dataclasses import asdict, dataclass
+from queue import Empty, Queue
+from typing import List, Literal, Optional, Tuple
+
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn.functional as F
+import torch.optim as optim
+import torch.utils
+import torch.utils.data
+from accelerate import Accelerator
+from accelerate.utils import broadcast, gather_object
+from datasets import DatasetDict
+from huggingface_hub import HfApi
+from rich.pretty import pprint
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    PreTrainedModel,
+    get_scheduler,
+)
+from vllm import LLM, SamplingParams
+
+from open_instruct.dataset_processor import (
+    CHAT_TEMPLATES,
+    INPUT_IDS_PROMPT_KEY,
+    DatasetConfig,
+    SFTDatasetProcessor,
+    SimpleGenerateCollator,
+    visualize_token,
+)
+from open_instruct.model_utils import (
+    ModelConfig,
+    disable_dropout_in_model,
+    exact_div,
+    first_true_indices,
+    forward,
+    get_reward,
+    prepare_deepspeed,
+    print_rich_single_line_metrics,
+    print_rich_table,
+    push_folder_to_hub,
+    save_with_accelerate,
+    truncate_response,
+    unwrap_model_for_generation,
+)
+from open_instruct.utils import (
+    ArgumentParserPlus,
+    combine_dataset,
+    get_wandb_tags,
+    is_beaker_job,
+    maybe_get_beaker_config,
+    maybe_use_ai2_wandb_entity,
+    upload_metadata_to_hf,
+)
+from open_instruct.vllm_utils import vllm_single_gpu_patch
+
+api = HfApi()
+INVALID_LOGPROB = 1.0
+
+
+@dataclass
+class Args:
+    # required dataset args
+    dataset_mixer: str = None
+    """A dictionary of datasets (local or HF) to sample from."""
+    dataset_train_splits: List[str] = None
+    """The dataset splits to use for training"""
+    dataset_eval_mixer: Optional[str] = None
+    """A dictionary of datasets (local or HF) to sample from for evaluation"""
+    dataset_eval_splits: Optional[List[str]] = None
+    """The dataset splits to use for evaluation"""
+    dataset_mixer_dict: Optional[dict] = None
+    """The dataset mixer as a dictionary"""
+    dataset_eval_mixer_dict: Optional[dict] = None
+    """The dataset eval mixer as a dictionary"""
+
+    # common args
+    exp_name: str = os.path.basename(__file__)[: -len(".py")]
+    """The name of this experiment"""
+    seed: int = 1
+    """Seed of the experiment"""
+    run_name: Optional[str] = None
+    """A unique name of this run"""
+
+    # optimizer args
+    eps: float = 1e-5
+    """The epsilon value for the optimizer"""
+    learning_rate: float = 2e-5
+    """The initial learning rate for AdamW optimizer."""
+    lr_scheduler_type: Literal[
+        "linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"
+    ] = "linear"
+    """Which scheduler to use"""
+    warm_up_steps: int = 0
+    """Number of warm up steps for the scheduler"""
+
+    # various batch sizes
+    num_train_epochs: int = 1
+    """Number of epochs to train"""
+    gradient_accumulation_steps: int = 8
+    """The number of gradient accumulation steps"""
+    per_device_train_batch_size: Optional[int] = 1
+    """The forward batch size per device (local_micro_batch_size)"""
+    per_device_eval_batch_size: Optional[int] = 1
+    """The forward batch size per device for evaluation (local_micro_batch_size)"""
+    total_episodes: Optional[int] = 100000
+    """The total number of episodes in the dataset"""
+    world_size: Optional[int] = None
+    """The number of processes (GPUs) to use"""
+    micro_batch_size: Optional[int] = None
+    """The micro batch size across devices (HF's `per_device_train_batch_size` * `world_size`)"""
+    local_batch_size: Optional[int] = None
+    """The batch size per GPU (HF's `per_device_train_batch_size` * `gradient_accumulation_steps`)"""
+    batch_size: Optional[int] = None
+    """The batch size across devices (HF's `per_device_train_batch_size` * `world_size` * `gradient_accumulation_steps`)"""
+    num_training_steps: Optional[int] = None
+    """The number of training_steps to train"""
+    num_evals: int = 4
+    """The number of evaluations to run throughout training"""
+    eval_freq: Optional[int] = None
+    """The frequency of evaluation steps"""
+    local_dataloader_batch_size: Optional[int] = None
+    """The batch size per GPU for the dataloader"""
+
+    # online settings
+    num_epochs: int = 4
+    """the number of epochs to train"""
+    num_mini_batches: int = 1
+    """Number of minibatches to split a batch into"""
+    local_mini_batch_size: Optional[int] = None
+    """the mini batch size per GPU"""
+    mini_batch_size: Optional[int] = None
+    """the mini batch size across GPUs"""
+    local_rollout_forward_batch_size: int = 64
+    """per rank no grad forward pass in the rollout phase"""
+    reward_model_path: str = "EleutherAI/pythia-160m"
+    """the path to the reward model"""
+    reward_model_revision: Optional[str] = None
+    """the revision of the reward model"""
+
+    # generation config
+    response_length: int = 53
+    """the length of the response"""
+    stop_token: Optional[Literal["eos", "period"]] = None
+    """the stop token"""
+    stop_token_id: Optional[int] = None
+    """the truncation token id"""
+    min_response_length: int = 0
+    """stop only after this many tokens"""
+    temperature: float = 0.7
+    """the sampling temperature"""
+    penalty_reward_value: float = -1.0
+    """the reward value for responses that do not contain `stop_token_id`"""
+    non_stop_penalty: bool = False
+    """whether to penalize responses that do not contain `stop_token_id`"""
+
+    # online PPO specific args
+    beta: float = 0.05
+    """the beta value of the RLHF objective (KL coefficient)"""
+    whiten_rewards: bool = False
+    """whether to whiten the rewards"""
+    cliprange: float = 0.2
+    """the clip range"""
+    vf_coef: float = 0.1
+    """the value function coefficient"""
+    cliprange_value: float = 0.2
+    """the clip range for the value function"""
+    gamma: float = 1
+    """the discount factor"""
+    lam: float = 0.95
+    """the lambda value for GAE"""
+    kl_estimator: Literal["kl1", "kl2", "kl3"] = "kl1"
+
+    # vLLM settings. NOTE: currently we need to place the vLLM model on a separate GPU
+    # for generation to work properly because vLLM would pre-alocate the memory.
+    # To do so, we would need to do a moneky patch `vllm_single_gpu_patch` to make sure
+    # the vLLM model is placed on the correct GPU.
+    vllm_device: str = "cuda:1"
+    """the device placement of the vllm model; typically we place the vllm model on a decicated GPU"""
+    vllm_gpu_memory_utilization: float = 0.8
+    """the GPU memory utilization of the vllm model; passed to `gpu_memory_utilization` to the `vLLM` instance"""
+    # async setting
+    async_mode: bool = True
+    """Whether to run the generation in async mode which learns from the second latest policy like Cleanba (https://arxiv.org/abs/2310.00036)"""
+
+    # wandb and HF tracking configs
+    with_tracking: bool = False
+    """If toggled, this experiment will be tracked with Weights and Biases"""
+    wandb_project_name: str = "open_instruct_internal"
+    """The wandb's project name"""
+    wandb_entity: Optional[str] = None
+    """The entity (team) of wandb's project"""
+    push_to_hub: bool = True
+    """Whether to upload the saved model to huggingface"""
+    hf_entity: Optional[str] = None
+    """The user or org name of the model repository from the Hugging Face Hub"""
+    hf_repo_id: Optional[str] = None
+    """The id of the saved model in the Hugging Face Hub (can be autoset if not given)"""
+    hf_repo_revision: Optional[str] = None
+    """The revision of the saved model in the Hugging Face Hub (can be autoset if not given)"""
+    hf_repo_url: Optional[str] = None
+    """The url of the saved model in the Hugging Face Hub (will be autoset)"""
+    output_dir: Optional[str] = None
+    """Where to save the model"""
+    checkpoint_output_dir: Optional[str] = None
+    """Where to save the model checkpoints in case of preemption"""
+
+    # Ai2 specific settings
+    try_launch_beaker_eval_jobs: bool = True
+    """Whether to launch beaker evaluation jobs after training"""
+    hf_metadata_dataset: Optional[str] = "allenai/tulu-3-evals"
+    """What dataset to upload the metadata to. If unset, don't upload metadata"""
+
+    def __post_init__(self):
+        self.dataset_mixer_dict, self.dataset_mixer = process_dataset_mixer(self.dataset_mixer)
+        if self.dataset_eval_mixer is not None:
+            self.dataset_eval_mixer_dict, self.dataset_eval_mixer = process_dataset_mixer(self.dataset_eval_mixer)
+
+
+def process_dataset_mixer(value) -> Tuple[Optional[dict], Optional[str]]:
+    # if passed through cli: convert the dataset mixers to dictionaries
+    if isinstance(value, str):
+        return json.loads(value), value
+    # if passed through yaml: convert the dataset mixers to strings
+    elif isinstance(value, dict):
+        return value, json.dumps(value)
+    else:
+        raise ValueError("Input must be either a string or a dictionary")
+
+
+def calculate_runtime_args_and_accelerator(args: Args, model_config: ModelConfig) -> Accelerator:
+    """calculate (in-place) runtime args such as the effective batch size, word size, etc."""
+    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
+    args.world_size = accelerator.num_processes
+    args.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps * args.num_mini_batches
+    args.micro_batch_size = int(args.per_device_train_batch_size * args.world_size)
+    args.batch_size = int(args.local_batch_size * args.world_size)
+    time_tensor = torch.tensor(int(time.time()), device=accelerator.device)
+    # set a unique run name with the current timestamp
+    time_int = broadcast(time_tensor, 0).item()
+    args.run_name = f"{args.exp_name}__{args.seed}__{time_int}"
+    args.mini_batch_size = exact_div(
+        args.batch_size, args.num_mini_batches, "`batch_size` must be a multiple of `num_mini_batches`"
+    )
+    args.local_mini_batch_size = exact_div(
+        args.local_batch_size, args.num_mini_batches, "`local_batch_size` must be a multiple of `num_mini_batches`"
+    )
+    args.num_training_steps = args.total_episodes // args.batch_size
+    args.eval_freq = max(1, args.num_training_steps // args.num_evals)
+    # PPO logic: do checks and set up dataloader batch size
+    if args.whiten_rewards:
+        assert (
+            args.local_mini_batch_size >= 8
+        ), f"Per-rank minibatch size {args.local_mini_batch_size} is insufficient for whitening"
+    args.local_dataloader_batch_size = args.local_batch_size
+    if args.push_to_hub:
+        if args.hf_repo_id is None:  # auto-generate one
+            args.hf_repo_id = f"{args.exp_name}__{model_config.model_name_or_path.replace('/', '_')}"
+        if args.hf_entity is None:
+            args.hf_entity = api.whoami()["name"]
+        args.hf_repo_id = f"{args.hf_entity}/{args.hf_repo_id}"
+        if args.hf_repo_revision is None:  # auto-generate one
+            args.hf_repo_revision = args.run_name
+        args.hf_repo_url = f"https://huggingface.co/{args.hf_repo_id}/tree/{args.hf_repo_revision}"
+
+    if args.with_tracking and accelerator.is_main_process:
+        if args.wandb_entity is None:
+            args.wandb_entity = maybe_use_ai2_wandb_entity()
+    return accelerator
+
+
+def vllm_generate(
+    model_name_or_path: str,
+    model_revision: Optional[str],
+    max_model_len: int,
+    vllm_device: str,
+    vllm_gpu_memory_utilization: float,
+    generation_config: SamplingParams,
+    response_ids_Q: Queue,
+    param_prompt_Q: Queue,
+    num_training_steps: int,
+    sample_evaluation_prompt_token_ids: Optional[List[int]],
+    evaluation_Q: Queue,
+    eval_freq: int,
+    resume_training_step: int,
+):
+    vllm_single_gpu_patch()
+    llm = LLM(
+        model=model_name_or_path,
+        revision=model_revision,
+        tokenizer_revision=model_revision,
+        tensor_parallel_size=1,
+        device=vllm_device,
+        gpu_memory_utilization=vllm_gpu_memory_utilization,
+        max_model_len=max_model_len,
+    )
+    print("🔥🔥🔥 vllm loaded")
+    llmp = llm.llm_engine.model_executor.driver_worker.model_runner.model
+    for training_step in range(resume_training_step, num_training_steps + 1):
+        items = param_prompt_Q.get()
+        if items is None:
+            break
+        unwrapped_model, g_queries_list = items
+        if unwrapped_model is not None:
+            start_time = time.time()
+            llmp.load_weights(unwrapped_model.named_parameters())
+            print(
+                f"🔥🔥🔥 Loading weights using shared memory; Time to load weights: {time.time() - start_time:.2f} seconds"
+            )
+        generation_start_time = time.time()
+        outputs = llm.generate(prompt_token_ids=g_queries_list, sampling_params=generation_config)
+        response_ids = [list(output.outputs[0].token_ids) for output in outputs]
+        print(f"🔥🔥🔥 Generation time: {time.time() - generation_start_time:.2f} seconds")
+        response_ids_Q.put(response_ids)
+
+        if sample_evaluation_prompt_token_ids is not None and (training_step - 1) % eval_freq == 0:
+            outputs = llm.generate(
+                prompt_token_ids=sample_evaluation_prompt_token_ids, sampling_params=generation_config
+            )
+            response_ids = [list(output.outputs[0].token_ids) for output in outputs]
+            evaluation_Q.put(response_ids)
+
+
+def send_queries(accelerator, unwrapped_model, tokenizer, param_prompt_Q, queries):
+    g_queries_list = gather_object(queries.tolist())
+    if accelerator.is_main_process:
+        g_queries_list = [
+            [inneritem for inneritem in item if inneritem != tokenizer.pad_token_id] for item in g_queries_list
+        ]  # remove padding
+        param_prompt_Q.put((unwrapped_model, g_queries_list))
+
+
+# taken from https://github.com/OpenLMLab/MOSS-RLHF/blob/40b91eb2f2b71b16919addede0341d2bef70825d/ppo/ppo_trainer.py#L29
+# we did this we can do a single `model = accelerator.prepare(model)`
+class PolicyAndValueWrapper(torch.nn.Module):
+    def __init__(self, policy, value_model) -> None:
+        super().__init__()
+        self.policy = policy
+        self.value_model = value_model
+        self.critic_backbone = getattr(value_model, value_model.base_model_prefix)
+
+    def forward(self, **kwargs):
+        output = self.critic_backbone(
+            **kwargs,
+        )
+        logits = self.value_model.score(output.hidden_states[-1])
+        return self.policy(**kwargs), logits
+
+    def gradient_checkpointing_enable(self):
+        self.policy.gradient_checkpointing_enable()
+        self.value_model.gradient_checkpointing_enable()
+
+
+def masked_mean(values: torch.Tensor, mask: torch.Tensor, axis: Optional[bool] = None) -> torch.Tensor:
+    """Compute mean of tensor with a masked values."""
+    if axis is not None:
+        return (values * mask).sum(axis=axis) / mask.sum(axis=axis)
+    else:
+        return (values * mask).sum() / mask.sum()
+
+
+def masked_var(values: torch.Tensor, mask: torch.Tensor, unbiased: bool = True) -> torch.Tensor:
+    """Compute variance of tensor with masked values."""
+    mean = masked_mean(values, mask)
+    centered_values = values - mean
+    variance = masked_mean(centered_values**2, mask)
+    if unbiased:
+        mask_sum = mask.sum()
+        if mask_sum == 0:
+            raise ValueError(
+                "The sum of the mask is zero, which can happen when `mini_batch_size=1`;"
+                "try increase the `mini_batch_size` or `gradient_accumulation_steps`"
+            )
+        # note that if mask_sum == 1, then there is a division by zero issue
+        # to avoid it you just need to use a larger minibatch_size
+        bessel_correction = mask_sum / (mask_sum - 1)
+        variance = variance * bessel_correction
+    return variance
+
+
+def masked_whiten(values: torch.Tensor, mask: torch.Tensor, shift_mean: bool = True) -> torch.Tensor:
+    """Whiten values with masked values."""
+    mean, var = masked_mean(values, mask), masked_var(values, mask)
+    whitened = (values - mean) * torch.rsqrt(var + 1e-8)
+    if not shift_mean:
+        whitened += mean
+    return whitened
+
+
+def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
+    accelerator = calculate_runtime_args_and_accelerator(args, model_config)
+    local_seed = args.seed + accelerator.process_index
+
+    # set up experiment tracking and seeds
+    all_configs = {}
+    if is_beaker_job():
+        args.checkpoint_output_dir = os.environ.get("CHECKPOINT_OUTPUT_DIR", args.output_dir)
+        beaker_config = maybe_get_beaker_config()
+        # try saving to the beaker `/output`, which will be uploaded to the beaker dataset
+        if len(beaker_config.beaker_dataset_id_urls) > 0:
+            args.output_dir = "/output"
+        all_configs.update(vars(beaker_config))
+    all_configs.update(**asdict(args), **asdict(dataset_config), **asdict(model_config))
+    if accelerator.is_main_process:
+        if args.with_tracking:
+            import wandb
+
+            wandb.init(
+                project=args.wandb_project_name,
+                entity=args.wandb_entity,
+                sync_tensorboard=True,
+                config=all_configs,
+                name=args.run_name,
+                save_code=True,
+                tags=[args.exp_name] + get_wandb_tags(),
+            )
+        writer = SummaryWriter(f"runs/{args.run_name}")
+        writer.add_text(
+            "hyperparameters",
+            "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()])),
+        )
+    device = torch.device(f"cuda:{accelerator.local_process_index}")
+    random.seed(local_seed)
+    np.random.seed(local_seed)
+    torch.manual_seed(local_seed)
+    torch.backends.cudnn.deterministic = True
+
+    # create a tokenizer (pad from right)
+    config = AutoConfig.from_pretrained(model_config.model_name_or_path, revision=model_config.model_revision)
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_config.model_name_or_path, revision=model_config.model_revision, padding_side="right"
+    )
+    if config.architectures == "LlamaForCausalLM" and config.bos_token_id == 128000:
+        tokenizer.pad_token_id = 128002  # <|reserved_special_token_0|>
+    else:
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})  # NOTE: we do not resize the embedding
+    tokenizer.chat_template = CHAT_TEMPLATES[dataset_config.chat_template]
+
+    # create the dataset
+    dataset_dict = DatasetDict()
+    dataset_processor = SFTDatasetProcessor(tokenizer=tokenizer, config=dataset_config)
+    train_dataset = combine_dataset(
+        args.dataset_mixer_dict,
+        splits=args.dataset_train_splits,
+        columns_to_keep=[dataset_config.sft_messages_key],
+    )
+    if dataset_config.sanity_check:
+        train_dataset = train_dataset.select(
+            range(0, min(len(train_dataset), dataset_config.sanity_check_max_samples))
+        )
+    with accelerator.main_process_first():
+        train_dataset = dataset_processor.tokenize(train_dataset)
+        train_dataset = dataset_processor.filter(train_dataset)
+    dataset_dict["train"] = train_dataset
+    eval_dataset = None
+    if args.dataset_eval_mixer is not None:
+        eval_dataset = combine_dataset(
+            args.dataset_eval_mixer_dict,
+            splits=args.dataset_eval_splits,
+            columns_to_keep=[dataset_config.sft_messages_key],
+        )
+        eval_dataset = eval_dataset.select(range(0, min(len(eval_dataset), dataset_config.sanity_check_max_samples)))
+        with accelerator.main_process_first():
+            eval_dataset = dataset_processor.tokenize(eval_dataset)
+            eval_dataset = dataset_processor.filter(eval_dataset)
+        dataset_dict["eval"] = eval_dataset
+
+    # some more runtime logging
+    if accelerator.is_main_process:
+        pprint([args, dataset_config, model_config])
+        visualize_token(train_dataset[0][INPUT_IDS_PROMPT_KEY], tokenizer)
+        if args.with_tracking:
+            # upload the visualized token length
+            dataset_processor.get_token_length_visualization(
+                dataset_dict, save_path=f"runs/{args.run_name}/token_length.png"
+            )
+            wandb.log({"token_length": wandb.Image(f"runs/{args.run_name}/token_length.png")})
+
+    # create the model and optimizer
+    policy: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
+        model_config.model_name_or_path,
+        revision=model_config.model_revision,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    ref_model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
+        model_config.model_name_or_path,
+        revision=model_config.model_revision,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    value_model: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
+        args.reward_model_path,
+        revision=args.reward_model_revision,
+        num_labels=1,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    reward_model: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
+        args.reward_model_path,
+        revision=args.reward_model_revision,
+        num_labels=1,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        use_cache=False,
+    )
+    if policy.config.vocab_size != reward_model.config.vocab_size:
+        raise ValueError(
+            "Policy and reward model must have the same vocab size. "
+            f"Policy: {policy.config.vocab_size}, Reward: {reward_model.config.vocab_size}. "
+            "If they don't have the same vocab size, the policy could generate tokens which "
+            "is going to cause index out of bound error in the reward model."
+        )
+    model = PolicyAndValueWrapper(policy, value_model)
+    if model_config.gradient_checkpointing:
+        model.gradient_checkpointing_enable()
+    for module in [model, ref_model, reward_model]:
+        disable_dropout_in_model(module)
+    if args.stop_token:
+        if args.stop_token == "eos":
+            args.stop_token_id = tokenizer.eos_token_id
+        if args.stop_token == "period":
+            args.stop_token_id = tokenizer.encode(".")[0]
+    optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate, eps=args.eps)
+    scheduler = get_scheduler(
+        args.lr_scheduler_type,
+        optimizer=optimizer,
+        num_warmup_steps=args.warm_up_steps,
+        num_training_steps=args.num_training_steps * args.num_train_epochs,
+    )
+    data_collator = SimpleGenerateCollator(pad_token_id=tokenizer.pad_token_id)
+    dataloader = DataLoader(
+        train_dataset,
+        batch_size=args.local_dataloader_batch_size,
+        shuffle=True,
+        collate_fn=data_collator,
+        drop_last=True,  # needed; otherwise the last batch will be of ragged shape
+    )
+    # sync random states for DataLoader(shuffle=True) before `accelerator.prepare`
+    # see https://gist.github.com/vwxyzjn/2581bff1e48e185e0b85b6dfe1def79c
+    torch.manual_seed(args.seed)
+    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
+    torch.manual_seed(local_seed)
+
+    # resume from preemption
+    resume_training_step = 1
+    if os.path.exists(args.checkpoint_output_dir):
+        for item in os.listdir(args.checkpoint_output_dir):
+            print(item)
+            if "step_" in item:
+                old_checkpoint_path = os.path.join(args.checkpoint_output_dir, item)
+                # check if the directory is empty
+                if len(os.listdir(old_checkpoint_path)) == 0:
+                    continue
+                accelerator.load_state(old_checkpoint_path)
+                resume_training_step = int(item.split("_")[-1])
+                print("Resuming training from step", resume_training_step)
+                if accelerator.is_main_process:
+                    shutil.rmtree(old_checkpoint_path)
+                break
+    resume_training_step > 1
+
+    # handle preemption
+    class PreemptionHandler:
+        preemptied = False
+
+        def __init__(self):
+            signal.signal(signal.SIGTERM, self.exit_gracefully)
+
+        def exit_gracefully(self, signum, frame):
+            output_dir = os.path.join(args.checkpoint_output_dir, f"step_{training_step - 1}")
+            print(f"SIGTERM received, saving to {output_dir} from {accelerator.local_process_index}")
+            accelerator.save_state(output_dir)
+            if accelerator.is_main_process and args.with_tracking:
+                wandb.log({"preempted": True}, commit=True)
+                wandb.mark_preempting()
+            if accelerator.is_main_process:
+                try:
+                    param_prompt_Q.put(None, timeout=20)
+                    response_ids_Q.get(timeout=20)
+                    print("vllm thread terminated")
+                except Exception as e:
+                    print(e)
+            self.preemptied = True
+
+    ph = PreemptionHandler()
+
+    # deepspeed setup
+    is_deepspeed_enabled = getattr(accelerator.state, "deepspeed_plugin", None) is not None
+    mixed_precision = accelerator.state.mixed_precision
+    if is_deepspeed_enabled:
+        reward_model = prepare_deepspeed(reward_model, args.per_device_train_batch_size, mixed_precision)
+        ref_model = prepare_deepspeed(ref_model, args.per_device_train_batch_size, mixed_precision)
+    else:
+        reward_model = reward_model.to(device)
+        ref_model = ref_model.to(device)
+
+    # online generation config
+    def repeat_generator():
+        while True:
+            yield from dataloader
+
+    iter_dataloader = iter(repeat_generator())
+    generation_config = SamplingParams(
+        temperature=args.temperature,
+        top_p=1.0,
+        max_tokens=args.response_length,
+        include_stop_str_in_output=True,
+    )
+    param_prompt_Q = None
+    response_ids_Q = None
+    evaluation_Q = None
+    if accelerator.is_main_process:
+        response_ids_Q = Queue(maxsize=1)
+        param_prompt_Q = Queue(maxsize=1)
+        evaluation_Q = Queue(maxsize=1)
+        LOCAL_NUM_EVAL_SAMPLES = 4
+        num_eval_samples = LOCAL_NUM_EVAL_SAMPLES * accelerator.num_processes
+        sample_evaluation_prompt_token_ids = None
+        if eval_dataset is not None:
+            sample_evaluation_prompt_token_ids = eval_dataset[:num_eval_samples][INPUT_IDS_PROMPT_KEY]
+        thread = threading.Thread(
+            target=vllm_generate,
+            args=(
+                model_config.model_name_or_path,
+                model_config.model_revision,
+                dataset_config.max_prompt_token_lenth + args.response_length,
+                args.vllm_device,
+                args.vllm_gpu_memory_utilization,
+                generation_config,
+                response_ids_Q,
+                param_prompt_Q,
+                args.num_training_steps,
+                sample_evaluation_prompt_token_ids,
+                evaluation_Q,
+                args.eval_freq,
+                resume_training_step,
+            ),
+        )
+        thread.start()
+    torch.cuda.set_device(device)
+
+    g_vllm_responses = torch.zeros((args.batch_size, args.response_length), device=device, dtype=torch.long)
+
+    # set up the metrics and initial states
+    stats_shape = (args.num_epochs, args.num_mini_batches, args.gradient_accumulation_steps)
+    approxkl_stats = torch.zeros(stats_shape, device=device)
+    pg_clipfrac_stats = torch.zeros(stats_shape, device=device)
+    pg_loss_stats = torch.zeros(stats_shape, device=device)
+    vf_loss_stats = torch.zeros(stats_shape, device=device)
+    vf_clipfrac_stats = torch.zeros(stats_shape, device=device)
+    entropy_stats = torch.zeros(stats_shape, device=device)
+    ratio_stats = torch.zeros(stats_shape, device=device)
+    local_metrics = torch.zeros((20,), device=device)
+    episode = args.batch_size * (resume_training_step - 1)
+    model.train()
+
+    # training loop
+    start_time = time.time()
+    data = next(iter_dataloader)
+    queries_next = data[INPUT_IDS_PROMPT_KEY].to(device)
+    send_queries(accelerator, None, tokenizer, param_prompt_Q, queries_next)
+
+    for _ in range(1, resume_training_step):  # we didn't store scheduler state
+        scheduler.step()
+
+    for training_step in range(resume_training_step, args.num_training_steps + 1):
+        episode += args.batch_size
+        scheduler.step()
+        queries = queries_next
+        if ph.preemptied:
+            break
+
+        if accelerator.is_main_process:
+            try:
+                evaluation_responses = evaluation_Q.get(timeout=0.01)
+                print("🔥🔥🔥 Evaluation responses received")
+                table = {}
+                table["prompt"] = tokenizer.batch_decode(sample_evaluation_prompt_token_ids)
+                table["response"] = tokenizer.batch_decode(evaluation_responses)
+                table["response"] = [item.replace(tokenizer.pad_token, "") for item in table["response"]]
+                df = pd.DataFrame(table)
+                print_rich_table(df)
+                if args.with_tracking:
+                    wandb.log({"sample_completions": wandb.Table(dataframe=df)})
+                else:
+                    print_rich_table(df)
+                del table
+            except Empty:
+                print("🙈 Evaluation responses not received")
+
+        with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
+            # (optionally) evaluate the model
+            generation_model = unwrapped_model.policy
+            if args.async_mode:
+                if training_step != 1:
+                    data = next(iter_dataloader)
+                    queries_next = data[INPUT_IDS_PROMPT_KEY].to(device)
+                send_queries(accelerator, generation_model, tokenizer, param_prompt_Q, queries_next)
+            else:
+                if training_step != 1:
+                    data = next(iter_dataloader)
+                    queries_next = data[INPUT_IDS_PROMPT_KEY].to(device)
+                    # NOTE: important: the indent here is different for sync mode
+                    send_queries(accelerator, generation_model, tokenizer, param_prompt_Q, queries_next)
+
+            training_time_start = time.time()
+            with torch.no_grad():
+                context_length = queries.shape[1]
+                responses = []
+                postprocessed_responses = []
+                logprobs = []
+                ref_logprobs = []
+                scores = []
+                sequence_lengths = []
+                values = []
+                if accelerator.is_main_process:
+                    g_response_token_ids = response_ids_Q.get()
+                    DUMMY_PAD_TOKEN = 0  # we can't use tokenizer.pad_token_id because it's outside vocab and `torch.gather(all_logprob, 2, response.unsqueeze(-1))` will error out
+                    g_padded_response_ids = [
+                        response + [DUMMY_PAD_TOKEN] * (args.response_length - len(response))
+                        for response in g_response_token_ids
+                    ]
+                    for item in g_padded_response_ids:
+                        assert len(item) == args.response_length
+                        for inner_item in item:
+                            if not inner_item < config.vocab_size:
+                                assert inner_item < config.vocab_size, f"{inner_item=}, {tokenizer.vocab_size=}"
+                    g_padded_response_ids = torch.tensor(g_padded_response_ids, device=device)
+                    g_vllm_responses[:] = g_padded_response_ids
+                broadcast(g_vllm_responses, 0)
+                local_vllm_responses = g_vllm_responses[
+                    accelerator.local_process_index
+                    * queries.shape[0] : (accelerator.local_process_index + 1)
+                    * queries.shape[0]
+                ]
+                query_responses = torch.cat((queries, local_vllm_responses), 1)
+                for i in range(0, queries.shape[0], args.local_rollout_forward_batch_size):
+                    query = queries[i : i + args.local_rollout_forward_batch_size]
+                    query_response = query_responses[i : i + args.local_rollout_forward_batch_size]
+                    response = query_response[:, context_length:]
+                    output = forward(generation_model, query_response, tokenizer.pad_token_id)
+                    logits = output.logits[:, context_length - 1 : -1]
+                    logits /= args.temperature + 1e-7
+                    all_logprob = F.log_softmax(logits, dim=-1)
+                    logprob = torch.gather(all_logprob, 2, response.unsqueeze(-1)).squeeze(-1)
+                    del output, logits, all_logprob
+                    torch.cuda.empty_cache()
+
+                    ref_output = forward(ref_model, query_response, tokenizer.pad_token_id)
+                    ref_logits = ref_output.logits[:, context_length - 1 : -1]
+                    ref_logits /= args.temperature + 1e-7
+                    ref_all_logprob = F.log_softmax(ref_logits, dim=-1)
+                    ref_logprob = torch.gather(ref_all_logprob, 2, response.unsqueeze(-1)).squeeze(-1)
+                    del ref_output, ref_logits, ref_all_logprob
+                    torch.cuda.empty_cache()
+
+                    # Response Processing 1. truncate response after the first occurrence of `stop_token_id`
+                    postprocessed_response = response
+                    if args.stop_token_id is not None:  # handle the edge case when stop_token_id exists but is 0
+                        postprocessed_response = truncate_response(
+                            args.stop_token_id, tokenizer.pad_token_id, response
+                        )
+
+                    # Response Processing 2. run reward model on the truncated responses
+                    postprocessed_query_response = torch.cat((query, postprocessed_response), 1)
+                    sequence_length = first_true_indices(postprocessed_response == tokenizer.pad_token_id) - 1
+                    _, score, _ = get_reward(
+                        reward_model, postprocessed_query_response, tokenizer.pad_token_id, context_length
+                    )
+                    unwrapped_value_model = accelerator.unwrap_model(model).value_model
+                    full_value, _, _ = get_reward(
+                        unwrapped_value_model, query_response, tokenizer.pad_token_id, context_length
+                    )
+                    value = full_value[:, context_length - 1 : -1].squeeze(-1)
+
+                    responses.append(response)
+                    postprocessed_responses.append(postprocessed_response)
+                    logprobs.append(logprob)
+                    ref_logprobs.append(ref_logprob)
+                    sequence_lengths.append(sequence_length)
+                    scores.append(score)
+                    values.append(value)
+                responses = torch.cat(responses, 0)
+                postprocessed_responses = torch.cat(postprocessed_responses, 0)
+                logprobs = torch.cat(logprobs, 0)
+                ref_logprobs = torch.cat(ref_logprobs, 0)
+                sequence_lengths = torch.cat(sequence_lengths, 0)
+                scores = torch.cat(scores, 0)
+                global_scores = accelerator.gather(scores)
+                accelerator.print(f"global_scores: {global_scores}, {global_scores.mean()}")
+                values = torch.cat(values, 0)
+                del (logprob, ref_logprob, full_value, value, score)
+                gc.collect()
+                torch.cuda.empty_cache()
+
+                # Response Processing 3. filter response. Ensure that the sample contains stop_token_id
+                # responses not passing that filter will receive a low (fixed) score
+                # only query humans on responses that pass that filter
+                contain_stop_token = torch.any(postprocessed_responses == args.stop_token_id, dim=-1)
+                # NOTE: only apply the stop token filter if the response is long enough
+                # otherwise the model could learn to generate the first token as the stop token
+                contain_stop_token = contain_stop_token & (sequence_lengths >= args.min_response_length)
+                if args.non_stop_penalty:
+                    scores = torch.where(
+                        contain_stop_token, scores, torch.full_like(scores, args.penalty_reward_value)
+                    )
+
+                # be very careful with `padding_mask_p1`; see https://excalidraw.com/#json=LWnzG4w2k5DjF_EOL_xPt,e2w3a-hFJ_gX5vOfeyXGTw
+                response_idxs = torch.arange(responses.shape[1], device=responses.device).repeat(responses.shape[0], 1)
+                padding_mask = response_idxs > sequence_lengths.unsqueeze(1)
+                logprobs = torch.masked_fill(logprobs, padding_mask, INVALID_LOGPROB)
+                ref_logprobs = torch.masked_fill(ref_logprobs, padding_mask, INVALID_LOGPROB)
+                sequence_lengths_p1 = sequence_lengths + 1
+                padding_mask_p1 = response_idxs > (sequence_lengths_p1.unsqueeze(1))
+                values = torch.masked_fill(values, padding_mask_p1, 0)
+
+                # 4. compute rewards
+                kl1 = logprobs - ref_logprobs
+                kl2 = (kl1) ** 2 / 2
+                kl3 = (-kl1).exp() - 1 + kl1
+                if args.kl_estimator == "kl1":
+                    kl = kl1
+                elif args.kl_estimator == "kl2":
+                    kl = kl2
+                elif args.kl_estimator == "kl3":
+                    kl = kl3
+                print(f"{accelerator.local_process_index=}, {kl.sum(1)=}")
+                non_score_reward = -args.beta * kl
+                non_score_reward_sum = non_score_reward.sum(1)
+                rlhf_reward = scores + non_score_reward_sum
+                rewards = non_score_reward.clone()
+                actual_start = torch.arange(rewards.size(0), device=rewards.device)
+                actual_end = torch.where(sequence_lengths_p1 < rewards.size(1), sequence_lengths_p1, sequence_lengths)
+                rewards[[actual_start, actual_end]] += scores
+
+                # 5. whiten rewards
+                if args.whiten_rewards:
+                    rewards = masked_whiten(rewards, mask=~padding_mask_p1, shift_mean=False)
+                    rewards = torch.masked_fill(rewards, padding_mask_p1, 0)
+
+                # 6. compute advantages and returns
+                lastgaelam = 0
+                advantages_reversed = []
+                gen_length = responses.shape[1]
+                for t in reversed(range(gen_length)):
+                    nextvalues = values[:, t + 1] if t < gen_length - 1 else 0.0
+                    delta = rewards[:, t] + args.gamma * nextvalues - values[:, t]
+                    lastgaelam = delta + args.gamma * args.lam * lastgaelam
+                    advantages_reversed.append(lastgaelam)
+                advantages = torch.stack(advantages_reversed[::-1], axis=1)
+                returns = advantages + values
+                advantages = masked_whiten(advantages, ~padding_mask)
+                advantages = torch.masked_fill(advantages, padding_mask, 0)
+                torch.cuda.empty_cache()
+
+        # Do multiple epochs of training on on-policy data (PPO-style), with a fresh random shuffle in each epoch
+        for epoch_idx in range(args.num_epochs):
+            b_inds = np.random.permutation(args.local_batch_size)
+            minibatch_idx = 0
+            for mini_batch_start in range(0, args.local_batch_size, args.local_mini_batch_size):
+                mini_batch_end = mini_batch_start + args.local_mini_batch_size
+                mini_batch_inds = b_inds[mini_batch_start:mini_batch_end]
+                gradient_accumulation_idx = 0
+                for micro_batch_start in range(0, args.local_mini_batch_size, args.per_device_train_batch_size):
+                    with accelerator.accumulate(model):
+                        micro_batch_end = micro_batch_start + args.per_device_train_batch_size
+                        micro_batch_inds = mini_batch_inds[micro_batch_start:micro_batch_end]
+                        mb_advantage = advantages[micro_batch_inds]
+                        mb_responses = responses[micro_batch_inds]
+                        mb_query_responses = query_responses[micro_batch_inds]
+                        mb_logprobs = logprobs[micro_batch_inds]
+                        mb_return = returns[micro_batch_inds]
+                        mb_values = values[micro_batch_inds]
+
+                        output, vpred_temp = forward(model, mb_query_responses, tokenizer.pad_token_id)
+                        logits = output.logits[:, context_length - 1 : -1]
+                        logits /= args.temperature + 1e-7
+                        new_all_logprobs = F.log_softmax(logits, dim=-1)
+                        new_logprobs = torch.gather(new_all_logprobs, 2, mb_responses.unsqueeze(-1)).squeeze(-1)
+                        new_logprobs = torch.masked_fill(new_logprobs, padding_mask[micro_batch_inds], INVALID_LOGPROB)
+                        vpred = vpred_temp[:, context_length - 1 : -1].squeeze(-1)
+                        vpred = torch.masked_fill(vpred, padding_mask_p1[micro_batch_inds], 0)
+                        vpredclipped = torch.clamp(
+                            vpred,
+                            mb_values - args.cliprange_value,
+                            mb_values + args.cliprange_value,
+                        )
+                        vf_losses1 = torch.square(vpred - mb_return)
+                        vf_losses2 = torch.square(vpredclipped - mb_return)
+                        vf_loss_max = torch.max(vf_losses1, vf_losses2)
+                        vf_loss = 0.5 * masked_mean(vf_loss_max, ~padding_mask_p1[micro_batch_inds])
+                        logprobs_diff = new_logprobs - mb_logprobs
+                        ratio = torch.exp(logprobs_diff)
+                        pg_losses = -mb_advantage * ratio
+                        pg_losses2 = -mb_advantage * torch.clamp(ratio, 1.0 - args.cliprange, 1.0 + args.cliprange)
+                        pg_loss_max = torch.max(pg_losses, pg_losses2)
+                        pg_loss = masked_mean(pg_loss_max, ~padding_mask[micro_batch_inds])
+                        loss = pg_loss + args.vf_coef * vf_loss
+
+                        accelerator.backward(loss)
+                        optimizer.step()
+                        optimizer.zero_grad()
+                        with torch.no_grad():
+                            pg_clipfrac = masked_mean(
+                                (pg_losses2 > pg_losses).float(), ~padding_mask[micro_batch_inds]
+                            )
+                            vf_clipfrac = masked_mean(
+                                (vf_losses2 > vf_losses1).float(), ~padding_mask_p1[micro_batch_inds]
+                            )
+                            prob_dist = torch.nn.functional.softmax(logits, dim=-1)
+                            entropy = torch.logsumexp(logits, dim=-1) - torch.sum(prob_dist * logits, dim=-1)
+                            approxkl = 0.5 * (logprobs_diff**2).mean()
+                            approxkl_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = approxkl
+                            pg_clipfrac_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = pg_clipfrac
+                            pg_loss_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = pg_loss
+                            vf_loss_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = vf_loss
+                            vf_clipfrac_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = vf_clipfrac
+                            entropy_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = entropy.mean()
+                            ratio_stats[epoch_idx, minibatch_idx, gradient_accumulation_idx] = ratio.mean()
+
+                    gradient_accumulation_idx += 1
+                minibatch_idx += 1
+                # fmt: off
+                del (
+                    output, vpred_temp, logits, new_all_logprobs, new_logprobs, vpred, vpredclipped,
+                    vf_losses1, vf_losses2, vf_loss, vf_clipfrac, logprobs_diff, ratio, pg_losses, pg_losses2, pg_loss_max,
+                    pg_loss, loss, pg_clipfrac, prob_dist, entropy, approxkl, mb_return,
+                    mb_advantage, mb_values, mb_responses, mb_query_responses, mb_logprobs,
+                )
+                # fmt: on
+                # del everything and empty cache
+                torch.cuda.empty_cache()
+        with torch.no_grad():
+            local_metrics[0] = sequence_lengths.float().mean()
+            local_metrics[1] = (responses == args.stop_token_id).sum().float().mean()
+            local_metrics[2] = kl.sum(1).mean()
+            local_metrics[3] = (-logprobs).sum(1).mean()
+            local_metrics[4] = non_score_reward_sum.mean()
+            local_metrics[5] = rlhf_reward.mean()
+            local_metrics[6] = scores.mean()
+            local_metrics[7] = approxkl_stats.mean()
+            local_metrics[8] = pg_clipfrac_stats.mean()
+            local_metrics[9] = pg_loss_stats.mean()
+            local_metrics[10] = vf_loss_stats.mean()
+            local_metrics[11] = vf_clipfrac_stats.mean()
+            local_metrics[12] = entropy_stats.mean()
+            local_metrics[13] = ratio_stats.mean()
+            local_metrics[14] = ratio_stats.var()
+            local_metrics[15] = ((kl) ** 2 / 2).sum(1).mean()
+            local_metrics[16] = ((-kl).exp() - 1 + kl).sum(1).mean()
+            global_metrics = accelerator.reduce(local_metrics, reduction="mean").tolist()
+            metrics = {
+                "episode": episode,
+                "training_step": training_step,
+                "lr": scheduler.get_last_lr()[0],
+                "epoch": episode / len(train_dataset),
+                "time/from_scratch": time.time() - start_time,
+                "time/training": time.time() - training_time_start,
+                "val/sequence_lengths": global_metrics[0],
+                "val/num_stop_token_ids": global_metrics[1],
+                "objective/kl": global_metrics[2],
+                "objective/kl2": global_metrics[15],
+                "ojbective/kl3": global_metrics[16],
+                "objective/entropy": global_metrics[3],
+                "objective/non_score_reward": global_metrics[4],
+                "objective/rlhf_reward": global_metrics[5],
+                "objective/scores": global_metrics[6],
+                "policy/approxkl_avg": global_metrics[7],
+                "policy/clipfrac_avg": global_metrics[8],
+                "loss/policy_avg": global_metrics[9],
+                "loss/value_avg": global_metrics[10],
+                "val/clipfrac_avg": global_metrics[11],
+                "policy/entropy_avg": global_metrics[12],
+                "val/ratio": global_metrics[13],
+                "val/ratio_var": global_metrics[14],
+            }
+            if accelerator.is_main_process:
+                print_rich_single_line_metrics(metrics)
+                for key, value in metrics.items():
+                    writer.add_scalar(key, value, episode)
+        del (queries, responses, postprocessed_responses, logprobs, ref_logprobs, sequence_lengths, scores)
+        del (metrics, kl, non_score_reward, rlhf_reward)
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    if not ph.preemptied:
+        # save model
+        os.makedirs(os.path.dirname(args.output_dir), exist_ok=True)
+        original_tokenizer = AutoTokenizer.from_pretrained(
+            model_config.model_name_or_path, revision=model_config.model_revision
+        )
+        save_with_accelerate(
+            accelerator,
+            model,
+            original_tokenizer,
+            args.output_dir,
+            model_attribute_to_save="policy",
+        )
+
+        # Ai2 specific logic
+        if is_beaker_job() and accelerator.is_main_process:
+            if args.hf_metadata_dataset:
+                dataset_list = list(args.dataset_mixer_dict.keys())
+                # mainly just focussing here on what would be useful for the leaderboard.
+                # wandb will have even more useful information.
+                metadata_blob = {
+                    "model_name": args.exp_name,
+                    "model_type": "sft",
+                    "datasets": dataset_list,
+                    "base_model": model_config.model_name_or_path,
+                    "wandb_path": wandb.run.get_url(),
+                    "beaker_experiment": beaker_config.beaker_experiment_url,
+                    "beaker_datasets": beaker_config.beaker_dataset_id_urls,
+                }
+                upload_metadata_to_hf(
+                    metadata_blob,
+                    "metadata.json",
+                    args.hf_metadata_dataset,
+                    "results/" + args.hf_repo_revision,  # to match what the auto-evals name as.
+                )
+
+            if args.try_launch_beaker_eval_jobs and len(beaker_config.beaker_dataset_id_urls) > 0:
+                command = f"""\
+                python mason.py  \
+                    --cluster ai2/allennlp-cirrascale ai2/general-cirrascale-a5000 ai2/general-cirrascale-a5000 ai2/s2-cirrascale ai2/general-cirrascale \
+                    --priority low \
+                    --preemptible \
+                    --budget ai2/allennlp \
+                    --workspace ai2/tulu-2-improvements \
+                    --image nathanl/open_instruct_auto \
+                    --pure_docker_mode \
+                    --gpus 0 -- python scripts/wait_beaker_dataset_model_upload_then_evaluate_model.py \
+                    --beaker_workload_id {beaker_config.beaker_workload_id} \
+                    --model_name {args.hf_repo_revision}
+                """
+                process = subprocess.Popen(["bash", "-c", command], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+                stdout, stderr = process.communicate()
+                print(f"Submit jobs after model training is finished - Stdout:\n{stdout.decode()}")
+                print(f"Submit jobs after model training is finished - Stderr:\n{stderr.decode()}")
+                print(f"Submit jobs after model training is finished - process return code: {process.returncode}")
+
+        if args.push_to_hub:
+            push_folder_to_hub(
+                accelerator,
+                args.output_dir,
+                args.hf_repo_id,
+                args.hf_repo_revision,
+            )
+
+        if accelerator.is_main_process:
+            # remove args.checkpoint_output_dir
+            if os.path.exists(args.checkpoint_output_dir):
+                shutil.rmtree(args.checkpoint_output_dir, ignore_errors=True)
+
+
+if __name__ == "__main__":
+    parser = ArgumentParserPlus((Args, DatasetConfig, ModelConfig))
+    main(*parser.parse())
diff --git a/open_instruct/reward_modeling.py b/open_instruct/reward_modeling.py
index bd9e3108e..633e80cb8 100644
--- a/open_instruct/reward_modeling.py
+++ b/open_instruct/reward_modeling.py
@@ -19,6 +19,7 @@
 from torch.utils.data import DataLoader
 from torch.utils.tensorboard import SummaryWriter
 from transformers import (
+    AutoConfig,
     AutoModelForSequenceClassification,
     AutoTokenizer,
     PreTrainedModel,
@@ -176,7 +177,7 @@ def calculate_runtime_args_and_accelerator(args: Args, model_config: ModelConfig
             args.hf_repo_revision = args.run_name
         args.hf_repo_url = f"https://huggingface.co/{args.hf_repo_id}/tree/{args.hf_repo_revision}"
 
-    if args.with_tracking:
+    if args.with_tracking and accelerator.is_main_process:
         if args.wandb_entity is None:
             args.wandb_entity = maybe_use_ai2_wandb_entity()
     return accelerator
@@ -192,15 +193,16 @@ def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
     local_seed = args.seed + accelerator.process_index
 
     # set up experiment tracking and seeds
+    all_configs = {}
+    if is_beaker_job():
+        args.checkpoint_output_dir = os.environ.get("CHECKPOINT_OUTPUT_DIR", args.output_dir)
+        beaker_config = maybe_get_beaker_config()
+        # try saving to the beaker `/output`, which will be uploaded to the beaker dataset
+        if len(beaker_config.beaker_dataset_id_urls) > 0:
+            args.output_dir = "/output"
+        all_configs.update(vars(beaker_config))
+    all_configs.update(**asdict(args), **asdict(dataset_config), **asdict(model_config))
     if accelerator.is_main_process:
-        all_configs = {**asdict(args), **asdict(dataset_config), **asdict(model_config)}
-        if is_beaker_job():
-            beaker_config = maybe_get_beaker_config()
-            # try saving to the beaker `/output`, which will be uploaded to the beaker dataset
-            if len(beaker_config.beaker_dataset_id_urls) > 0:
-                args.output_dir = "/output"
-            all_configs.update(vars(beaker_config))
-
         if args.with_tracking:
             import wandb
 
@@ -225,8 +227,14 @@ def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
     torch.backends.cudnn.deterministic = True
 
     # create a tokenizer (pad from right)
-    tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, padding_side="right")
-    tokenizer.add_special_tokens({"pad_token": "[PAD]"})  # NOTE: we do not resize the embedding
+    config = AutoConfig.from_pretrained(model_config.model_name_or_path, revision=model_config.model_revision)
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_config.model_name_or_path, revision=model_config.model_revision, padding_side="right"
+    )
+    if config.architectures == "LlamaForCausalLM" and config.bos_token_id == 128000:
+        tokenizer.pad_token_id = 128002  # <|reserved_special_token_0|>
+    else:
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})  # NOTE: we do not resize the embedding
     tokenizer.chat_template = CHAT_TEMPLATES[dataset_config.chat_template]
 
     # create the dataset
@@ -235,7 +243,7 @@ def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
     train_dataset = combine_dataset(
         args.dataset_mixer_dict,
         splits=args.dataset_train_splits,
-        columns_to_keep=["chosen", "rejected"],
+        columns_to_keep=[dataset_config.preference_chosen_key, dataset_config.preference_rejected_key],
     )
     if dataset_config.sanity_check:
         train_dataset = train_dataset.select(
@@ -250,7 +258,7 @@ def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
         eval_dataset = combine_dataset(
             args.dataset_eval_mixer_dict,
             splits=args.dataset_eval_splits,
-            columns_to_keep=["chosen", "rejected"],
+            columns_to_keep=[dataset_config.preference_chosen_key, dataset_config.preference_rejected_key],
         )
         eval_dataset = eval_dataset.select(range(0, min(len(eval_dataset), dataset_config.sanity_check_max_samples)))
         with accelerator.main_process_first():
@@ -297,7 +305,6 @@ def main(args: Args, dataset_config: DatasetConfig, model_config: ModelConfig):
         shuffle=True,
         collate_fn=data_collator,
     )
-
     eval_dataloader = DataLoader(
         eval_dataset,
         batch_size=args.per_device_eval_batch_size,
diff --git a/open_instruct/vllm_utils.py b/open_instruct/vllm_utils.py
new file mode 100644
index 000000000..ecb50287c
--- /dev/null
+++ b/open_instruct/vllm_utils.py
@@ -0,0 +1,157 @@
+# Taken and modified from https://github.com/huggingface/trl
+# Copyright 2024 The AllenAI Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""This file basically allows us to place vLLM's driver worker in a specified
+GPU. For example. you can try the following.
+
+```python
+from transformers import AutoTokenizer
+from vllm import SamplingParams
+from open_instruct.vllm_utils import SingleGPULLM
+
+
+tok = AutoTokenizer.from_pretrained("facebook/opt-125m")
+tok.chat_template = (
+    "{% for message in messages %}"
+    "{{'\n\n' if not loop.first else ''}}"
+    "{{message['role']|capitalize + ': ' +message['content']}}"
+    "{% if loop.last and not add_generation_prompt %}{{ eos_token }}{% endif %}"
+    "{% endfor %}"
+)
+prompts = [
+    {"role": "user", "content": "Compose a speech about the need for more affordable dental care."},
+]
+
+prompt_ids = tok.apply_chat_template(prompts, add_generation_prompt=True)
+sampling_params = SamplingParams(temperature=0.001, top_p=1.0, max_tokens=1024, include_stop_str_in_output=True)
+
+llm = SingleGPULLM(model="facebook/opt-125m", tensor_parallel_size=1, device="cuda:1")
+llmp = llm.llm_engine.model_executor.driver_worker.model_runner.model
+print(f"🔥🔥🔥 vllm lives in {llmp.lm_head.weight.device}")
+print("prepare to generate")
+outputs = llm.generate(prompt_token_ids=[prompt_ids], sampling_params=sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+"""
+
+
+from typing import List, Optional
+
+import torch
+import vllm
+from vllm.distributed.parallel_state import (
+    GroupCoordinator,
+    get_world_group,
+    init_model_parallel_group,
+)
+from vllm.executor.gpu_executor import GPUExecutor
+
+
+def custom_initialize_model_parallel(
+    tensor_model_parallel_size: int = 1,
+    pipeline_model_parallel_size: int = 1,
+    backend: Optional[str] = None,
+) -> None:
+    """
+    Initialize model parallel groups.
+
+    Arguments:
+        tensor_model_parallel_size: number of GPUs used for tensor model
+            parallelism.
+        pipeline_model_parallel_size: number of GPUs used for pipeline model
+            parallelism.
+
+    Let's say we have a total of 8 GPUs denoted by g0 ... g7 and we
+    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
+    the model pipeline. The present function will
+    create 4 tensor model-parallel groups and 2 pipeline model-parallel groups:
+        4 tensor model-parallel groups:
+            [g0, g1], [g2, g3], [g4, g5], [g6, g7]
+        2 pipeline model-parallel groups:
+            [g0, g2, g4, g6], [g1, g3, g5, g7]
+    Note that for efficiency, the caller should make sure adjacent ranks
+    are on the same DGX box. For example if we are using 2 DGX-1 boxes
+    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
+    ranks 8 to 15 belong to the second box.
+    """
+    # Get world size and rank. Ensure some consistencies.
+    assert torch.distributed.is_initialized()
+    world_size: int = torch.distributed.get_world_size()
+    world_size: int = 1  # SingleGPULLM logic: only use a single GPU
+    backend = backend or torch.distributed.get_backend(get_world_group().device_group)
+
+    if world_size != tensor_model_parallel_size * pipeline_model_parallel_size:
+        raise RuntimeError(
+            f"world_size ({world_size}) is not equal to "
+            f"tensor_model_parallel_size ({tensor_model_parallel_size}) x "
+            f"pipeline_model_parallel_size ({pipeline_model_parallel_size})"
+        )
+
+    # Build the tensor model-parallel groups.
+    num_tensor_model_parallel_groups: int = world_size // tensor_model_parallel_size
+    # global _TP
+    assert vllm.distributed.parallel_state._TP is None, "tensor model parallel group is already initialized"
+    group_ranks = []
+    for i in range(num_tensor_model_parallel_groups):
+        ranks = list(range(i * tensor_model_parallel_size, (i + 1) * tensor_model_parallel_size))
+        group_ranks.append(ranks)
+
+    # message queue broadcaster is only used in tensor model parallel group
+    vllm.distributed.parallel_state._TP = init_model_parallel_group(
+        group_ranks, get_world_group().local_rank, backend, use_message_queue_broadcaster=True
+    )
+
+    # Build the pipeline model-parallel groups.
+    num_pipeline_model_parallel_groups: int = world_size // pipeline_model_parallel_size
+    # global _PP
+    assert vllm.distributed.parallel_state._PP is None, "pipeline model parallel group is already initialized"
+    group_ranks = []
+    for i in range(num_pipeline_model_parallel_groups):
+        ranks = list(range(i, world_size, num_pipeline_model_parallel_groups))
+        group_ranks.append(ranks)
+    # pipeline parallel does not need custom allreduce
+    vllm.distributed.parallel_state._PP = init_model_parallel_group(
+        group_ranks, get_world_group().local_rank, backend, use_custom_allreduce=False
+    )
+
+
+def init_world_group(ranks: List[int], local_rank: int, backend: str) -> GroupCoordinator:
+    return GroupCoordinator(
+        group_ranks=[[0]],  # SingleGPULLM logic: only use a single GPU
+        local_rank=local_rank,
+        torch_distributed_backend=backend,
+        use_pynccl=False,
+        use_custom_allreduce=False,
+        use_tpu_communicator=False,
+    )
+
+
+def _init_executor(self) -> None:
+    """Initialize the worker and load the model."""
+    assert self.parallel_config.world_size == 1, "GPUExecutor only supports single GPU."
+
+    self.driver_worker = self._create_worker(local_rank=self.device_config.device.index)
+    self.driver_worker.init_device()
+    self.driver_worker.load_model()
+
+
+# monkey patch the function
+def vllm_single_gpu_patch():
+    vllm.distributed.parallel_state.init_world_group = init_world_group
+    vllm.distributed.parallel_state.initialize_model_parallel = custom_initialize_model_parallel
+    GPUExecutor._init_executor = _init_executor
diff --git a/scripts/rejection_sampling_tulu_docker.bash b/scripts/rejection_sampling_tulu_docker.bash
index 7ed96082b..99a1d2d49 100644
--- a/scripts/rejection_sampling_tulu_docker.bash
+++ b/scripts/rejection_sampling_tulu_docker.bash
@@ -76,7 +76,6 @@ if [ "$on_jupyter" = true ]; then
         --pure_docker_mode \
         --priority low \
         --preemptible \
-        --no_mount_nfs --no_hf_cache_env \
         --budget ai2/allennlp \
         --gpus $num_gpus -- $command
 else
diff --git a/test.sh b/test.sh
new file mode 100644
index 000000000..f38fb8fb8
--- /dev/null
+++ b/test.sh
@@ -0,0 +1,680 @@
+beaker session create \
+    --gpus 1 \
+    --budget ai2/allennlp  \
+    --workdir $PWD \
+    --image beaker://costah/open_instruct_onlinedpo1 \
+    --priority normal \
+    --workspace ai2/costah
+beaker session create \
+    --gpus 1 \
+    --budget ai2/allennlp  \
+    --workdir $PWD \
+    --image beaker://costah/open_instruct_dev_uv \
+    --priority normal \
+    --workspace ai2/costah
+
+
+beaker session create \
+    --gpus 3 \
+    --budget ai2/allennlp  \
+    --workdir $PWD \
+    --image beaker://ai2/cuda11.8-cudnn8-dev-ubuntu20.04 \
+    --priority normal \
+    --workspace ai2/costah
+
+beaker session create \
+    --gpus 1 \
+    --budget ai2/allennlp  \
+    --bare \
+    --image beaker://costah/open_instruct_onlinedpo \
+    --priority normal \
+    --workspace ai2/costah
+
+
+accelerate launch --num_processes 2 open_instruct/online_dpo_vllm.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tldr \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53 \
+    --vllm_device cuda:2 --sanity_check
+
+
+accelerate launch --num_processes 2 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tulu3 \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:2 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 100000 \
+    --model_name_or_path allenai/llama-3-tulu-2-8b  \
+    --reward_model_path allenai/reward_modeling__allenai_llama-3-tulu-2-8b_ultrafeedback \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 512 \
+    --with_tracking \
+    --push_to_hub \
+
+
+
+accelerate launch --num_processes 1 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tulu3 \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:1 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 100000 \
+    --model_name_or_path allenai/llama-3-tulu-2-8b  \
+    --reward_model_path allenai/reward_modeling__allenai_llama-3-tulu-2-8b_ultrafeedback \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 512 \
+    --with_tracking \
+    --push_to_hub \
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tulu3 \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 200000 \
+    --model_name_or_path allenai/llama-3-tulu-2-8b  \
+    --reward_model_path allenai/reward_modeling__allenai_llama-3-tulu-2-8b_ultrafeedback \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 512 \
+    --with_tracking \
+    --push_to_hub \
+
+
+
+
+
+
+python open_instruct/online_dpo_vllm.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tldr \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 16 \
+    --local_rollout_forward_batch_size 8 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53
+
+
+accelerate launch --num_processes 2 open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tldr \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 16 \
+    --local_rollout_forward_batch_size 8 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53 --vllm_device cuda:2 --sanity_check
+
+
+python open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tldr \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-2.8b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53
+
+python open_instruct/online_dpo_vllm.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tldr \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-2.8b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53
+
+
+
+
+pip install git+https://github.com/vwxyzjn/vllm.git@costa-single-gpu-fix
+
+docker build --build-arg CUDA=12.1.0 --build-arg TARGET=cudnn8-devel --build-arg DIST=ubuntu20.04 --build-arg REQUIRE=requirements.txt . -t open_instruct_onlinedpo2
+beaker image delete $(whoami)/open_instruct_onlinedpo2 
+beaker image create open_instruct_onlinedpo2 -n open_instruct_onlinedpo2 -w ai2/$(whoami)
+
+
+accelerate launch --num_processes 2 open_instruct/online_dpo_vllm.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_tldr \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 10000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53 \
+    --vllm_device cuda:2 --sanity_check  --with_tracking
+
+accelerate launch --num_processes 2 open_instruct/ppo_vllm.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/ppo_tldr \
+    --per_device_train_batch_size 16 \
+    --gradient_accumulation_steps 2 \
+    --local_rollout_forward_batch_size 16 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 10000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53 \
+    --vllm_device cuda:2 --sanity_check
+
+
+accelerate launch --num_processes 2 open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/ppo_tldr \
+    --per_device_train_batch_size 16 \
+    --gradient_accumulation_steps 2 \
+    --local_rollout_forward_batch_size 16 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 10000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --num_evals 10 \
+    --response_length 53 \
+    --vllm_device cuda:2 --sanity_check
+
+accelerate launch --num_processes 2 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 5e-7 \
+    --output_dir models/minimal/online_dpo_tulu2_llama333 \
+    --chat_template tulu \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 16 \
+    --local_rollout_forward_batch_size 4 \
+    --vllm_device cuda:2 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000 \
+    --model_name_or_path allenai/llama-3-tulu-2-8b  \
+    --reward_model_path allenai/reward_modeling__allenai_llama-3-tulu-2-8b_ultrafeedback \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 10 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub --sanity_check
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 5e-7 \
+    --output_dir models/minimal/online_dpo_tulu2_llama333 \
+    --chat_template tulu \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 16 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 2000 \
+    --model_name_or_path allenai/llama-3-tulu-2-8b  \
+    --reward_model_path allenai/reward_modeling__allenai_llama-3-tulu-2-8b_ultrafeedback \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 10 \
+    --response_length 1024 \
+    --gradient_checkpointing --sanity_check \
+
+g = AutoModelForSequenceClassification.from_pretrained("allenai/llama-3-tulu-2-8b", num_labels=1)
+
+accelerate launch --num_processes 2 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 5e-7 \
+    --output_dir models/minimal/online_dpo_tulu2_llama333 \
+    --chat_template tulu \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:2 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000 \
+    --model_name_or_path vwxyzjn/btulu  \
+    --reward_model_path allenai/llama-3.1-tulu-2-8b-uf-mean-rm \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 10 \
+    --response_length 1024 \
+    --gradient_checkpointing  --with_tracking
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 5e-7 \
+    --output_dir models/minimal/online_dpo_tulu2_llama333 \
+    --chat_template simple_concat_with_space \
+    --per_device_train_batch_size 64 \
+    --per_device_eval_batch_size 64 \
+    --gradient_accumulation_steps 1 \
+    --local_rollout_forward_batch_size 64 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 10000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 10 \
+    --response_length 53 \
+    --gradient_checkpointing  --with_tracking
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/ppo_vllm_thread.py \
+    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 5e-7 \
+    --output_dir models/minimal/online_dpo_tulu2_llama333 \
+    --chat_template simple_concat_with_space \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.03 \
+    --num_evals 10 \
+    --response_length 53 \
+    --gradient_checkpointing
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 5e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 2 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 300000 \
+    --model_name_or_path vwxyzjn/btulu  \
+    --reward_model_path allenai/llama-3.1-tulu-2-8b-uf-mean-rm \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.04 \
+    --num_evals 1 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml ds3.py
+
+
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/online_dpo_vllm_thread.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --sft_messages_key chosen \
+    --learning_rate 5e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 16 \
+    --local_rollout_forward_batch_size 4 \
+    --vllm_device cuda:7 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 200000 \
+    --model_name_or_path OLMoE/OLMoE-1B-7B-0824-SFT  \
+    --reward_model_path allenai/llama-3.1-tulu-2-8b-uf-mean-rm \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.05 \
+    --num_evals 1 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+
+
+accelerate launch --num_processes 8 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+    open_instruct/online_dpo.py \
+    --dataset_name allenai/ultrafeedback_binarized_cleaned \
+    --dataset_train_split train_prefs \
+    --dataset_eval_split test_prefs \
+    --max_token_length 512 \
+    --max_prompt_token_lenth 256 \
+    --sft_messages_key chosen \
+    --learning_rate 5e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 16 \
+    --local_rollout_forward_batch_size 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 200000 \
+    --model_name_or_path OLMoE/OLMoE-1B-7B-0824-SFT  \
+    --reward_model_path allenai/llama-3.1-tulu-2-8b-uf-mean-rm \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.05 \
+    --num_evals 1 \
+    --response_length 512 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --push_to_hub
+
+
+accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+        open_instruct/online_dpo_vllm_thread.py \
+        --exp_name "online_dpo_vllm_thread_beta_${beta}" \
+        --dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
+        --dataset_train_splits train_prefs \
+        --dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
+        --dataset_eval_splits test_prefs \
+        --max_token_length 1024 \
+        --max_prompt_token_lenth 512 \
+        --sft_messages_key chosen \
+        --learning_rate 5e-7 \
+        --output_dir /output/ \
+        --chat_template tulu \
+        --per_device_train_batch_size 1 \
+        --per_device_eval_batch_size 1 \
+        --gradient_accumulation_steps 32 \
+        --local_rollout_forward_batch_size 1 \
+        --vllm_device cuda:7 \
+        --num_epochs 1 \
+        --num_mini_batches 1 \
+        --total_episodes 300000 \
+        --model_name_or_path allenai/llama-3-tulu-2-8b  \
+        --reward_model_path allenai/reward_modeling__allenai_llama-3-tulu-2-8b_ultrafeedback \
+        --non_stop_penalty \
+        --stop_token eos \
+        --penalty_reward_value -10.0 \
+        --beta $beta \
+        --num_evals 1 \
+        --response_length 1024 \
+        --gradient_checkpointing \
+        --with_tracking \
+        --push_to_hub
+
+
+python open_instruct/online_dpo_vllm_thread.py \
+    --exp_name "online_dpo_vllm_thread_beta" \
+    --dataset_mixer '{"HuggingFaceH4/no_robots": 1.0}' \
+    --dataset_train_splits train \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 5e-7 \
+    --output_dir /output/ \
+    --chat_template tulu \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --no_async_mode \
+    --gradient_accumulation_steps 32 \
+    --local_rollout_forward_batch_size 1 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 300000 \
+    --model_name_or_path allenai/open_instruct_dev  \
+    --model_revision costa_finetune_tulu3_8b_norobot__meta-llama_Meta-Llama-3.1-8B__42__1725559869 \
+    --reward_model_path vwxyzjn/reward_modeling__allenai_llama-3-tulu-2-8b \
+    --reward_model_revision reward_modeling__1__1725631368 \
+    --non_stop_penalty \
+    --stop_token eos \
+    --penalty_reward_value -10.0 \
+    --beta 0.05 \
+    --num_evals 1 \
+    --response_length 1024 \
+    --gradient_checkpointing \
+    --vllm_device cuda:1 \
+    --with_tracking \
+
+
+python mason.py \
+    --cluster ai2/pluto-cirrascale ai2/prior-cirrascale ai2/s2-cirrascale ai2/general-cirrascale \
+    --priority normal \
+    --resumable \
+    --budget ai2/allennlp \
+    --gpus 8 -- accelerate launch --num_processes 7 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_vllm_thread_tldr \
+    --per_device_train_batch_size 2 \
+    --local_rollout_forward_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-6.9b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-6.9b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --response_length 53 \
+    --with_tracking \
+    --push_to_hub \
+    --vllm_device cuda:7 \
+
+
+accelerate launch --num_processes 3 --config_file configs/ds_configs/deepspeed_zero3.yaml \
+     open_instruct/online_dpo_vllm_thread.py \
+    --dataset_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_train_splits train \
+    --dataset_eval_mixer '{"trl-internal-testing/tldr-preference-sft-trl-style": 1.0}' \
+    --dataset_eval_splits validation \
+    --max_token_length 1024 \
+    --max_prompt_token_lenth 512 \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/online_dpo_vllm_thread_tldr \
+    --per_device_train_batch_size 2 \
+    --local_rollout_forward_batch_size 4 \
+    --gradient_accumulation_steps 4 \
+    --num_epochs 1 \
+    --num_mini_batches 1 \
+    --total_episodes 1000000 \
+    --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr  \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --non_stop_penalty \
+    --stop_token eos \
+    --beta 0.1 \
+    --response_length 53 \
+    --with_tracking \
+    --push_to_hub \
+    --vllm_device cuda:3 \
\ No newline at end of file