You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm processing redpajama data and it's unacceptably slow, especially processing book domain, any suggestions please?
Or can you share a copy of your processed training data, thanks a lot!
The text was updated successfully, but these errors were encountered:
Hi, can you please share some details on what step is giving you trouble?
If you are running into slow speed with the tokenization, then I would recommend checking out the SentencePiece tokenizer instead of using the Transformers tokenizer (I talk about it here).
From my experience, the SentencePiece tokenizer is much faster with longer sequences (which matters a lot for the books domain), whereas the Transformers tokenizer is faster at large batches of shorter sequences.
It is easy to switch over to the SentencePiece tokenizer, simply by uncommenting this line.
You can also shard this process across many processes if you have the CPU cores to do so. To do this, you can change this line and specify a large shard_size.
If you are running in slow processing for the sampling step, you can try increasing the number of shards as we talk about it here. Let me know if you need help with something else!
I would be happy to share the training data, though it totals to about 5T, which can be very slow over a network. If you are still running issues and want me to send the training data, please email me at [email protected] and we can figure a way to do this :)
Hello, I'm processing redpajama data and it's unacceptably slow, especially processing book domain, any suggestions please?
Or can you share a copy of your processed training data, thanks a lot!
The text was updated successfully, but these errors were encountered: