Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Random slowdowns in tensor parallel. #630

Open
3 tasks done
Ph0rk0z opened this issue Sep 21, 2024 · 2 comments
Open
3 tasks done

[BUG] Random slowdowns in tensor parallel. #630

Ph0rk0z opened this issue Sep 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 21, 2024

OS

Linux

GPU Library

CUDA 11.8

Python version

3.10

Pytorch version

2.4

Model

Luminum 123b 4.0bpw H6

Describe the bug

I am happily genning at 11-16t/s. Suddenly random messages go really slow and the t/s drops. After some more messages it goes back up.

I went into nvtop to check GPU temps but they were all in the 60s. One GPU is cranking and the others are at low % for some reason as if it was processing sequentially.

Am not sure if it's related to my machine but it's a new behavior for me since 2.2 and dev. Mostly probing to see if anyone has experienced the same.

Reproduction steps

Generate as normal. Some messages will be slow.

Expected behavior

Consistent speeds.

Logs

No response

Additional context

tp-fail
tp-fail2

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@Ph0rk0z Ph0rk0z added the bug Something isn't working label Sep 21, 2024
@grimulkan
Copy link

You're not rolling over the KV cache at ~4K context or something like that, right? I don't experience this with the latest dev pull, but I also usually don't have conversations longer than the pre-allocated sequence length. That said, paged attention is now pretty good at handling rolling cache too... so maybe not relevant.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Oct 19, 2024

Nope, it's 32k model. Haven't seen it crop up lately though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants