Mistral doesn't join docs with the <|endoftext|> separator #90

nostalgebraist · 2021-08-31T17:07:28Z

Unlike GPT-2 and other GPT-style LMs, the Mistral codebase and pretrained models do not make use of the special <|endoftext|> token.

Evidence that this is true:

When prompted with this token, the pretrained models usually begin in the middle of a sentence.
If I understand correctly, this line in get_auto_dataset concatenates tokenized documents without inserting anything in between them.

If this code was used to prepare data for the pretrained models, that would explain the behavior noted in point 1.

Was this a deliberate choice? Mistral follows GPT-2 carefully in other respects, so I'm surprised by this difference.

Also, concatenating documents without inserting such a character seems sub-optimal from a language modeling perspective. At the boundaries between documents, it produces sudden discontinuities in style/content. The resulting dataset makes it look to the LM as if such discontinuities were a feature of natural text, which they aren't.

The text was updated successfully, but these errors were encountered:

siddk · 2021-08-31T17:58:02Z

Hi @nostalgebraist,

Thanks so much for bringing this to our attention. I think this may have been an oversight on our part, and we just didn't include the <|endoftext|> token. We wrote this pre-processing code a while ago, consulting some of the Hugging Face examples, and I thought we had changed it (you may want to post an issue in HF Transformers as well around this point).

Agreed that this feels sub-optimal from an LM perspective. I've opened issue #91 to address this, and it'll get picked up by one of the development team members this week -- unless you're willing to try making the change! It'd be a great chance to join the Mistral community, and especially since you pointed it out, would be fitting to fix it as well 🙂.

Regardless, we'll fix this prior to training new models. For these current models, we hope they're still useable! If you want to use them for unprompted generation, I feel like you could probably fine-tune these cheaply on smaller datasets with the EOS token in place (since it's in the vocabulary), and the models would learn to incorporate them fairly quickly.

Thank you again for raising this! I'll leave the issue open to resolve any remaining questions.

siddk · 2021-08-31T18:30:39Z

Closing issue (saw that you reacted!) -- feel free to re-open if further questions arise.

siddk mentioned this issue Aug 31, 2021

Add EOS token when concatenating documents in preprocessing loop #91

Open

siddk closed this as completed Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral doesn't join docs with the <|endoftext|> separator #90

Mistral doesn't join docs with the <|endoftext|> separator #90

nostalgebraist commented Aug 31, 2021

siddk commented Aug 31, 2021

siddk commented Aug 31, 2021

Mistral doesn't join docs with the <|endoftext|> separator #90

Mistral doesn't join docs with the <|endoftext|> separator #90

Comments

nostalgebraist commented Aug 31, 2021

siddk commented Aug 31, 2021

siddk commented Aug 31, 2021