[WIP] Add KTOLoss #1864

krammnic · 2024-10-18T10:20:30Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.
#1850

Changelog

What are the changes made in this PR?

Adding KTO loss to torchtune

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-10-18T10:20:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1864

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

krammnic · 2024-10-18T10:21:48Z

Run some recipe with this loss.
Add APO loss variant(It is pretty minor, but in current design it's not that trivial)
KL divergence?

krammnic · 2024-10-18T10:42:02Z

@SalmanMohammadi Should I make a full run with KTOLoss?

SalmanMohammadi · 2024-10-18T11:22:13Z

Hey @krammnic. Thanks so much for opening this. This is a great start.

I've been thinking about how to improve the contributor experience, and the contribution workflow for PRs which may involve design decisions i.e. touch our existing codebase in a non-trivial way, require refactoring, or generally need some carefulness to make sure we're exposing new features to users in the best way possible.

So, before we dive into the specific implementation details, I'd like to take a step back. To prevent thrash and minimize back-and-forth and discovering design reconsiderations later into the PR, let's de-risk things by better understanding what we're doing here, and agree on some of the necessary steps we need to see this landed. I'm going to shamelessly plug my SimPO PR (#1223) as an example of this.

Would you be willing to answer some questions for me by updating the PR description?

Could you provide a very very high-level overview of what this method is? What are the main ways it differs from e.g. DPO? You could provide some information from the paper here.
Have you used a reference implementation? If so, which one? Code pointers here are very useful. I like to use TRL (https://github.com/huggingface/trl) which I believe has a KTOTrainer, but others also use official repos from paper authors.
What would integrating this into the DPO recipes look like? Does it fit neatly into the existing DPO logic, or do we need some specific logic in the recipe for KTO?
- From skimming the KTOTrainer docpage, it looks like TRL specify a different dataset format which differs from our preference dataset format in a meaningful way. As above with outlining the main differences between KTO and DPO, how would this different dataset format be used?

Generally what we're looking for is to make sure we don't find any gotchas down the line and that we can have an overview of the expected changes and outcomes for this contribution.

A couple disclaimers:

If this sounds like overkill - let me know! If you feel like there's a different way you'd like to go about it, please feel free! Similarly, I'm more than happy to help investigate on some of these points. You can also drop me a line on Discord : )
Also, if this seems like a lot of effort, we can come back to this PR and you could try one of the RLHF techniques that you mentioned which may be easier to implement into our codebase. That's totally fine.

krammnic · 2024-10-18T11:31:32Z

Thanks for comment! I think there are already answers on most of your question in docstring of the KTOLoss) And in general RLHF issue. I might duplicate them here if required.

Recipe should not be significantly changed. Change loss to KTOLoss. Will check about the dataset but generally in all contrastives they are pretty same

SalmanMohammadi · 2024-10-18T11:34:22Z

torchtune/rlhf/loss/dpo.py

+        The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as
+        the implicit reward. Additionally, we introduce a target reward margin to the Bradley-Terry objective to
+        encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance.
+


It looks like this has been copied from the SimPO loss?

Oh, I used SimPO docs as template and forgot to add my docs in this part. Other comments are about KTO

krammnic · 2024-10-18T11:35:45Z

We can conclude main idea of the method from its limitations! KTO >= DPO and DPO should be used in cases with high quality data, in other cases KTO is really suitable(Actually, there are 2 versions of this loss, which gives more control on situation with your dataset, but they are minorly different, so usually included in one loss(will do it as follow up))

SalmanMohammadi · 2024-10-18T12:00:35Z

torchtune/rlhf/loss/dpo.py

+        kl = torch.zeros(1).to(policy_chosen_logps.device)
+
+        chosen_logratios = policy_chosen_logps - reference_chosen_logps
+
+        chosen_losses = 1 - F.sigmoid(self.beta * (chosen_logratios - kl))
+
+        rejected_logratios = policy_rejected_logps - reference_rejected_logps
+
+        rejected_losses = 1 - F.sigmoid(self.beta * (kl - rejected_logratios))
+
+        losses = torch.cat(
+            (
+                self.desirable_weight * chosen_losses,
+                self.undesirable_weight * rejected_losses,
+            ),
+            0,
+        )
+
+        chosen_rewards = self.beta * (chosen_logratios).detach()
+        rejected_rewards = self.beta * (rejected_logratios).detach()


I'm not convinced that the KL calculation is correct. In TRL, the KL calculation is done differently (https://github.com/huggingface/trl/blob/main/trl/trainer/kto_trainer.py#L1100) - you can see that self.calculate_KL is True for the KTO loss (https://github.com/huggingface/trl/blob/main/trl/trainer/kto_trainer.py#L545).

This is the same in the official KTO repo https://github.com/ContextualAI/HALOs/blob/61b9ee6787c9f52c2f4578533866e1cb42f314ec/trainers.py#L836

Yes, I saved this mostly as drafted(version without KL) will bring it

Hmm, to actually get kl_logprobs we need to modify concatenated_forward

I don't want to pass cfg in concatenated_forward

We can actually ignore KL divergence right now and use APO variant of loss:

rejected_losses = F.sigmoid(self.beta * rejected_logratios)

And we need to actually change structure of the batch. Also would require some processing to dataset. Do you approve such changes?

SalmanMohammadi · 2024-10-18T12:04:56Z

I might duplicate them here if required.

That would be very helpful to see! I'm also not convinced about the dataset format, as I've also found in the official KTO repo (https://github.com/ContextualAI/HALOs?tab=readme-ov-file) that:

KTO doesn't use preference pairs, just knowledge of whether outputs are desirable or undesirable. This means we use dataloader.UnpairedPreferenceDataLoader. However, that dataloader assumes that you're working with datasets that originally contain preference pairs, like Anthropic HH or SHP

So we need to think a bit about how to make sure this implementation in the DPO recipe retains parity here. Do we need use some minor modification of a preference dataset here?

krammnic · 2024-10-18T12:33:55Z

I might duplicate them here if required.

That would be very helpful to see! I'm also not convinced about the dataset format, as I've also found in the official KTO repo (https://github.com/ContextualAI/HALOs?tab=readme-ov-file) that:

KTO doesn't use preference pairs, just knowledge of whether outputs are desirable or undesirable. This means we use dataloader.UnpairedPreferenceDataLoader. However, that dataloader assumes that you're working with datasets that originally contain preference pairs, like Anthropic HH or SHP

So we need to think a bit about how to make sure this implementation in the DPO recipe retains parity here. Do we need use some minor modification of a preference dataset here?

Yes, probably will require special things about data loading

krammnic · 2024-10-18T16:49:16Z

I might duplicate them here if required.

That would be very helpful to see! I'm also not convinced about the dataset format, as I've also found in the official KTO repo (https://github.com/ContextualAI/HALOs?tab=readme-ov-file) that:

KTO doesn't use preference pairs, just knowledge of whether outputs are desirable or undesirable. This means we use dataloader.UnpairedPreferenceDataLoader. However, that dataloader assumes that you're working with datasets that originally contain preference pairs, like Anthropic HH or SHP

So we need to think a bit about how to make sure this implementation in the DPO recipe retains parity here. Do we need use some minor modification of a preference dataset here?

Yes, probably will require special things about data loading

Looks like it want be difficult, but actually we can create new dataset type for this. So basically to convert pair dataset to KTO format we need something like:

batch_size = len(examples["chosen"])
new_rows = {
        "completion": examples["chosen"] + examples["rejected"],
        "label": [True] * batch_size + [False] * batch_size,
 }
if "prompt" in examples:
        new_rows["prompt"] = examples["prompt"] + examples["prompt"]

And obviously remove required columns. I'm not sure where to put such preprocessing. Either new dataset type, or new argument in current dataset?

SalmanMohammadi · 2024-10-21T14:04:51Z

And we need to actually change structure of the batch. Also would require some processing to dataset. Do you approve such changes?

Would you be able to outline some of the necessary steps we need to make on the dataset side?

We can actually ignore KL divergence right now and use APO variant of loss:
rejected_losses = F.sigmoid(self.beta * rejected_logratios)

APO is a separate (but comparable) technique to KTO, right? Is this the right paper? It looks like it came out < 2 months ago, and I see very few models on the hub mentioning APO. As I mentioned in the RLHF tracker, I'm not sure we should be implementing techniques which are new and haven't been adopted by the OS community, so I'd rather we stick with the original proposal of KTO, and I'd love to keep supporting you on this.

I don't fully follow your comments on the necessary modifications needed to implement KTO, you mentioned needing to modify concatenated_forward? It'd be really helpful if you could also outline why we need to make these changes - that way we can figure out the best way to implement them.

krammnic · 2024-10-21T16:13:20Z

About APO. It looks like it is just KTO without KL, but ok, let's skip
Therefore, we don't need to pass anything extra in forward!
We need to do this preprocessing to dataset:

batch_size = len(examples["chosen"])
new_rows = {
        "completion": examples["chosen"] + examples["rejected"],
        "label": [True] * batch_size + [False] * batch_size,
 }
if "prompt" in examples:
        new_rows["prompt"] = examples["prompt"] + examples["prompt"]

and map it to dataset

But we need to calculate: policy_KL_logps, reference_KL_logps
then we need batch["reference_KL_logps"] and policy_KL_logps

krammnic · 2024-10-21T20:27:17Z

About APO. It looks like it is just KTO without KL, but ok, let's skip

Therefore, we don't need to pass anything extra in forward!

We need to do this preprocessing to dataset:
batch_size = len(examples["chosen"])
new_rows = {
        "completion": examples["chosen"] + examples["rejected"],
        "label": [True] * batch_size + [False] * batch_size,
 }
if "prompt" in examples:
        new_rows["prompt"] = examples["prompt"] + examples["prompt"]
and map it to dataset

But we need to calculate: policy_KL_logps, reference_KL_logps

then we need batch["reference_KL_logps"] and policy_KL_logps

@SalmanMohammadi Can we continue from here?

SalmanMohammadi · 2024-10-21T23:35:27Z

About APO. It looks like it is just KTO without KL, but ok, let's skip

Therefore, we don't need to pass anything extra in forward!

We need to do this preprocessing to dataset:
batch_size = len(examples["chosen"])
new_rows = {
        "completion": examples["chosen"] + examples["rejected"],
        "label": [True] * batch_size + [False] * batch_size,
 }
if "prompt" in examples:
        new_rows["prompt"] = examples["prompt"] + examples["prompt"]
and map it to dataset

But we need to calculate: policy_KL_logps, reference_KL_logps

then we need batch["reference_KL_logps"] and policy_KL_logps
@SalmanMohammadi Can we continue from here?

Hey @krammnic - thanks for summarizing here, it seems like we can make things simpler if we don't try to add APO too?

I need a bit of time to look into your other questions which I may not get round to until later this week. If you have an idea of how you'd like to proceed here I'd say go for it and we can try iterate if we need to. I'm interested in your thoughts on whether you think we'll need to make any changes to the recipe itself - for example, does concatenated_forward need to be changed? Or can we make modifications just at the dataset level?

Out of curiosity, do you have a reference for this code so I can follow along?

batch_size = len(examples["chosen"])
new_rows = {
"completion": examples["chosen"] + examples["rejected"],
"label": [True] * batch_size + [False] * batch_size,
}
if "prompt" in examples:
new_rows["prompt"] = examples["prompt"] + examples["prompt"]

krammnic · 2024-10-22T09:19:03Z

Thanks for the answer!

Yes some changes will be required as we need to get batch["reference_KL_logps"] and policy_KL_logps to calculate KL divergence(will write what we need to do here soon).

Dataset processing implementation is from TRL:

https://github.com/huggingface/trl/blob/main/trl/data_utils.py

krammnic · 2024-10-23T14:21:39Z

Added dataset processing, now we need to add correct KL divergence calculation

krammnic · 2024-10-23T15:14:49Z

@SalmanMohammadi Are we supporting different ref_model in current DPO recipe? As I see, we are not:

with torch.no_grad(), disable_adapter(self._model):
      (
          reference_chosen_log_probs,
          reference_rejected_log_probs,
                            _,
                            _,
      ) = self.concatenated_forward(self._model, batch)

SalmanMohammadi · 2024-10-23T15:20:24Z

@SalmanMohammadi Are we supporting different ref_model in current DPO recipe? As I see, we are not:

with torch.no_grad(), disable_adapter(self._model):
      (
          reference_chosen_log_probs,
          reference_rejected_log_probs,
                            _,
                            _,
      ) = self.concatenated_forward(self._model, batch)

What do you mean by a different ref model? Because we're fine-tuning with LoRA, we can just disable the LoRA adapters to recover the original reference model without having to keep a separate copy of the model in memory. The reference model should always be the base model we're finetuning from.

krammnic · 2024-10-23T15:32:01Z

@SalmanMohammadi Are we supporting different ref_model in current DPO recipe? As I see, we are not:
with torch.no_grad(), disable_adapter(self._model):
      (
          reference_chosen_log_probs,
          reference_rejected_log_probs,
                            _,
                            _,
      ) = self.concatenated_forward(self._model, batch)
What do you mean by a different ref model? Because we're fine-tuning with LoRA, we can just disable the LoRA adapters to recover the original reference model without having to keep a separate copy of the model in memory. The reference model should always be the base model we're finetuning from.

Actually in TRL there is possibility to set something different!

ref_model (`PreTrainedModelWrapper`):
           Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no
           reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized.

I'm not sure why they give this possibility either, so that why I've asked. Maybe there is some special use case?

SalmanMohammadi · 2024-10-23T15:37:27Z

Ah good catch! I'm not 100% sure either. In their example scripts they always initialize with the same model (https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py).

krammnic · 2024-10-23T15:47:57Z

Actually, it is probably pretty critical to DPO, for example according to https://arxiv.org/pdf/2407.13709

Our experiments reveal an interesting finding: stronger
reference models can indeed offer more benefits
than the SFT model, but only when they are compat-
ible with the model being fine-tuned

I'm pretty sure that it will be actual for classic DPO for example

krammnic · 2024-10-23T15:48:35Z

I would consider this as a follow up, which we might need to add pretty soon

krammnic · 2024-10-23T15:54:24Z

About this PR. Added all required KL staff, which is not affecting the batch. KTO test right now will fail

krammnic · 2024-10-23T15:55:22Z

Tasks:

Fix test
Modify batch
Do same changes to distributed recipe

krammnic added 2 commits October 18, 2024 06:14

Add KTOLoss, docs + test

b74be00

Fix lint, docs

3ac4de0

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 18, 2024

Fix docs

dbb35fb

SalmanMohammadi reviewed Oct 18, 2024

View reviewed changes

add dataset processing for KTO

3f3fd67

krammnic force-pushed the kto-loss branch from ee65636 to 3f3fd67 Compare October 23, 2024 14:10

Add KL_probs, without touches to the batch

ab1a2fb

Add correct KL calculation in loss

93b3944

krammnic mentioned this pull request Oct 23, 2024

Custom reference model for DPO recipes #1888

Open

krammnic added 2 commits October 26, 2024 07:48

fix test

7bf8655

import fix

dd05c01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add KTOLoss #1864

[WIP] Add KTOLoss #1864

krammnic commented Oct 18, 2024 •

edited

Loading

pytorch-bot bot commented Oct 18, 2024

krammnic commented Oct 18, 2024 •

edited

Loading

krammnic commented Oct 18, 2024

SalmanMohammadi commented Oct 18, 2024 •

edited

Loading

krammnic commented Oct 18, 2024

SalmanMohammadi Oct 18, 2024

krammnic Oct 18, 2024

krammnic Oct 18, 2024

krammnic commented Oct 18, 2024

SalmanMohammadi Oct 18, 2024 •

edited

Loading

SalmanMohammadi Oct 18, 2024

krammnic Oct 18, 2024

krammnic Oct 18, 2024

krammnic Oct 18, 2024

krammnic Oct 18, 2024 •

edited

Loading

krammnic Oct 18, 2024 •

edited

Loading

SalmanMohammadi commented Oct 18, 2024 •

edited

Loading

krammnic commented Oct 18, 2024

krammnic commented Oct 18, 2024

SalmanMohammadi commented Oct 21, 2024 •

edited

Loading

krammnic commented Oct 21, 2024 •

edited

Loading

krammnic commented Oct 21, 2024

SalmanMohammadi commented Oct 21, 2024 •

edited

Loading

krammnic commented Oct 22, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

SalmanMohammadi commented Oct 23, 2024

krammnic commented Oct 23, 2024 •

edited

Loading

SalmanMohammadi commented Oct 23, 2024

krammnic commented Oct 23, 2024 •

edited

Loading

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024 •

edited

Loading

[WIP] Add KTOLoss #1864

Are you sure you want to change the base?

[WIP] Add KTOLoss #1864

Conversation

krammnic commented Oct 18, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Oct 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1864

krammnic commented Oct 18, 2024 • edited Loading

krammnic commented Oct 18, 2024

SalmanMohammadi commented Oct 18, 2024 • edited Loading

krammnic commented Oct 18, 2024

SalmanMohammadi Oct 18, 2024

Choose a reason for hiding this comment

krammnic Oct 18, 2024

Choose a reason for hiding this comment

krammnic Oct 18, 2024

Choose a reason for hiding this comment

krammnic commented Oct 18, 2024

SalmanMohammadi Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

SalmanMohammadi Oct 18, 2024

Choose a reason for hiding this comment

krammnic Oct 18, 2024

Choose a reason for hiding this comment

krammnic Oct 18, 2024

Choose a reason for hiding this comment

krammnic Oct 18, 2024

Choose a reason for hiding this comment

krammnic Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

krammnic Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

SalmanMohammadi commented Oct 18, 2024 • edited Loading

krammnic commented Oct 18, 2024

krammnic commented Oct 18, 2024

SalmanMohammadi commented Oct 21, 2024 • edited Loading

krammnic commented Oct 21, 2024 • edited Loading

krammnic commented Oct 21, 2024

SalmanMohammadi commented Oct 21, 2024 • edited Loading

krammnic commented Oct 22, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

SalmanMohammadi commented Oct 23, 2024

krammnic commented Oct 23, 2024 • edited Loading

SalmanMohammadi commented Oct 23, 2024

krammnic commented Oct 23, 2024 • edited Loading

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024

krammnic commented Oct 23, 2024 • edited Loading

krammnic commented Oct 18, 2024 •

edited

Loading

krammnic commented Oct 18, 2024 •

edited

Loading

SalmanMohammadi commented Oct 18, 2024 •

edited

Loading

SalmanMohammadi Oct 18, 2024 •

edited

Loading

krammnic Oct 18, 2024 •

edited

Loading

krammnic Oct 18, 2024 •

edited

Loading

SalmanMohammadi commented Oct 18, 2024 •

edited

Loading

SalmanMohammadi commented Oct 21, 2024 •

edited

Loading

krammnic commented Oct 21, 2024 •

edited

Loading

SalmanMohammadi commented Oct 21, 2024 •

edited

Loading

krammnic commented Oct 23, 2024 •

edited

Loading

krammnic commented Oct 23, 2024 •

edited

Loading

krammnic commented Oct 23, 2024 •

edited

Loading