Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicating training / test split on models #191

Open
Uzay-G opened this issue Nov 21, 2022 · 4 comments
Open

Replicating training / test split on models #191

Uzay-G opened this issue Nov 21, 2022 · 4 comments

Comments

@Uzay-G
Copy link

Uzay-G commented Nov 21, 2022

Hello,
We are running some experiments on Mistral models and it would be useful if we knew how the openwebtext train-test split was done to train the models. It would allow us to replicate this split and evaluate the models using openwebtext / without leakage.
Thanks for your help.

@J38
Copy link
Contributor

J38 commented Nov 30, 2022

@siddk would know better, but my first guess is that code in auto.py performed the split, so ultimately the HF dataset method train_test_split with the validation ratio = 0.0005 ... I'm guessing this was done once and every random seed experiment used the same split ... I am unsure which random seed was used for the initial data processing ...

code in Mistral:

if "validation" not in dataset:

code in HF Datasets:

https://github.com/huggingface/datasets/blob/6d247bd4fd76b45998747ecc3367daab5f5e0b82/src/datasets/arrow_dataset.py#L3645

@J38
Copy link
Contributor

J38 commented Nov 30, 2022

If I had to guess I would assume it was done with seed=42 but that could certainly be wrong ... I just note 42 is the default seed when no seed is specified ...

@J38
Copy link
Contributor

J38 commented Nov 30, 2022

Honestly I am really unclear on what random seed was used for the data preprocessing which means it is kind of difficult to perfectly replicate the data split ...

@J38
Copy link
Contributor

J38 commented Nov 30, 2022

Here are some more details from @siddk

All OpenWebText data was processed once, via a call to get_auto_dataset (https://github.com/stanford-crfm/mistral/blob/main/src/corpora/auto.py#L94) using the first model’s config (alias-gpt2-small) with seed = 21.
This all happened on a single node, the remaining part of the config that’s important is here: https://github.com/stanford-crfm/mistral/blob/main/conf/datasets/openwebtext.yaml.
Basically — 64 workers for training, 4 works for eval, validation ratio of 0.0005.
If you just run train.py from a single process point it at the mistral-small.yaml config, should be equivalent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants