-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datamixer (for SFT) #187
Datamixer (for SFT) #187
Conversation
* add yaml files for safety data * update max_seq_length * update max_seq_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, three questions/requests in addition to my handful of comments --
- Are we outputting a single jsonl file with the mix created by the data mixer yet? Imo this is important for consistency
- Is there an easy way to use this to create a mix, output a file, and not use it for training? Would be nice to be able to do that to create mixes for EasyLM/TPUs
- Have we tested to make sure we get identical mixes if a seed is set?
* add yaml files for safety data * update max_seq_length * update max_seq_length * update data paths --------- Co-authored-by: Nathan Lambert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
/net/nfs.cirrascale/mosaic/oe-safety-datasets/wildchat_lmsys_sexual/gpt4_lmsys_wildchat_dedup_50ksampled.jsonl: 16888 | ||
allenai/tulu-v2-sft-mixture: 326154 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's helpful to specify the percentage instead of the absolute value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percentage also works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm in general. I would love to test it for training a real model, but maybe later.
and "response" in dataset.column_names | ||
and "messages" not in dataset.column_names | ||
): | ||
dataset = dataset.map(query_response_to_messages, num_proc=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably raise an unsupported error if none is matched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the time it'll error if it doesn't match.
Makes it so you can mix HF and local datasets by proportion of dataset or count of samples, configs like what we had:
Or fractional mixing:
Or count mixing:
Including local files: