Replicating training / test split on models #191

Uzay-G · 2022-11-21T20:35:42Z

Hello,
We are running some experiments on Mistral models and it would be useful if we knew how the openwebtext train-test split was done to train the models. It would allow us to replicate this split and evaluate the models using openwebtext / without leakage.
Thanks for your help.

J38 · 2022-11-30T06:35:24Z

@siddk would know better, but my first guess is that code in auto.py performed the split, so ultimately the HF dataset method train_test_split with the validation ratio = 0.0005 ... I'm guessing this was done once and every random seed experiment used the same split ... I am unsure which random seed was used for the initial data processing ...

code in Mistral:

mistral/src/corpora/auto.py

Line 112 in 315560f

if "validation" not in dataset:

code in HF Datasets:

https://github.com/huggingface/datasets/blob/6d247bd4fd76b45998747ecc3367daab5f5e0b82/src/datasets/arrow_dataset.py#L3645

J38 · 2022-11-30T06:41:25Z

If I had to guess I would assume it was done with seed=42 but that could certainly be wrong ... I just note 42 is the default seed when no seed is specified ...

J38 · 2022-11-30T06:45:54Z

Honestly I am really unclear on what random seed was used for the data preprocessing which means it is kind of difficult to perfectly replicate the data split ...

J38 · 2022-11-30T14:12:56Z

Here are some more details from @siddk

All OpenWebText data was processed once, via a call to get_auto_dataset (https://github.com/stanford-crfm/mistral/blob/main/src/corpora/auto.py#L94) using the first model’s config (alias-gpt2-small) with seed = 21.
This all happened on a single node, the remaining part of the config that’s important is here: https://github.com/stanford-crfm/mistral/blob/main/conf/datasets/openwebtext.yaml.
Basically — 64 workers for training, 4 works for eval, validation ratio of 0.0005.
If you just run train.py from a single process point it at the mistral-small.yaml config, should be equivalent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating training / test split on models #191

Replicating training / test split on models #191

Uzay-G commented Nov 21, 2022

J38 commented Nov 30, 2022

J38 commented Nov 30, 2022

J38 commented Nov 30, 2022

J38 commented Nov 30, 2022 •

edited

Loading

Replicating training / test split on models #191

Replicating training / test split on models #191

Comments

Uzay-G commented Nov 21, 2022

J38 commented Nov 30, 2022

J38 commented Nov 30, 2022

J38 commented Nov 30, 2022

J38 commented Nov 30, 2022 • edited Loading

J38 commented Nov 30, 2022 •

edited

Loading