Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral doesn't join docs with the <|endoftext|> separator #90

Closed
nostalgebraist opened this issue Aug 31, 2021 · 2 comments
Closed

Mistral doesn't join docs with the <|endoftext|> separator #90

nostalgebraist opened this issue Aug 31, 2021 · 2 comments

Comments

@nostalgebraist
Copy link

Unlike GPT-2 and other GPT-style LMs, the Mistral codebase and pretrained models do not make use of the special <|endoftext|> token.

Evidence that this is true:

  1. When prompted with this token, the pretrained models usually begin in the middle of a sentence.
  2. If I understand correctly, this line in get_auto_dataset concatenates tokenized documents without inserting anything in between them.
  • If this code was used to prepare data for the pretrained models, that would explain the behavior noted in point 1.

Was this a deliberate choice? Mistral follows GPT-2 carefully in other respects, so I'm surprised by this difference.

Also, concatenating documents without inserting such a character seems sub-optimal from a language modeling perspective. At the boundaries between documents, it produces sudden discontinuities in style/content. The resulting dataset makes it look to the LM as if such discontinuities were a feature of natural text, which they aren't.

@siddk
Copy link
Contributor

siddk commented Aug 31, 2021

Hi @nostalgebraist,

Thanks so much for bringing this to our attention. I think this may have been an oversight on our part, and we just didn't include the <|endoftext|> token. We wrote this pre-processing code a while ago, consulting some of the Hugging Face examples, and I thought we had changed it (you may want to post an issue in HF Transformers as well around this point).

Agreed that this feels sub-optimal from an LM perspective. I've opened issue #91 to address this, and it'll get picked up by one of the development team members this week -- unless you're willing to try making the change! It'd be a great chance to join the Mistral community, and especially since you pointed it out, would be fitting to fix it as well 🙂.

Regardless, we'll fix this prior to training new models. For these current models, we hope they're still useable! If you want to use them for unprompted generation, I feel like you could probably fine-tune these cheaply on smaller datasets with the EOS token in place (since it's in the vocabulary), and the models would learn to incorporate them fairly quickly.

Thank you again for raising this! I'll leave the issue open to resolve any remaining questions.

@siddk
Copy link
Contributor

siddk commented Aug 31, 2021

Closing issue (saw that you reacted!) -- feel free to re-open if further questions arise.

@siddk siddk closed this as completed Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants