-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchrun error when generating training split #24
Comments
Sorry, I'm not sure what the issue is and it might be related to your setup (e.g., disk space, RAM). Are there any additional error messages? |
Thank you for your response. I tried to increase the RAM size to 50GB and it can generate training split now. However, when it starts training, it raises a wandb related error:
I already install wandb. Here are all packages I installed with the corresponding versions:
|
I've encountered this issue as well. It seems to be a problem with insufficient memory on your end, not related to the GPU. |
When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.
I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.
The text was updated successfully, but these errors were encountered: