-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Random slowdowns in tensor parallel. #630
Labels
bug
Something isn't working
Comments
You're not rolling over the KV cache at ~4K context or something like that, right? I don't experience this with the latest dev pull, but I also usually don't have conversations longer than the pre-allocated sequence length. That said, paged attention is now pretty good at handling rolling cache too... so maybe not relevant. |
Nope, it's 32k model. Haven't seen it crop up lately though. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
OS
Linux
GPU Library
CUDA 11.8
Python version
3.10
Pytorch version
2.4
Model
Luminum 123b 4.0bpw H6
Describe the bug
I am happily genning at 11-16t/s. Suddenly random messages go really slow and the t/s drops. After some more messages it goes back up.
I went into nvtop to check GPU temps but they were all in the 60s. One GPU is cranking and the others are at low % for some reason as if it was processing sequentially.
Am not sure if it's related to my machine but it's a new behavior for me since 2.2 and dev. Mostly probing to see if anyone has experienced the same.
Reproduction steps
Generate as normal. Some messages will be slow.
Expected behavior
Consistent speeds.
Logs
No response
Additional context
Acknowledgements
The text was updated successfully, but these errors were encountered: