Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][distributed] fix zmq hang #6759

Merged
merged 4 commits into from
Jul 25, 2024
Merged

Conversation

youkaichao
Copy link
Member

fixes #6700

this is caused by incorrect usage of zmq, or a bug of zmq, reported at zeromq/libzmq#4713 .

By using XPUB channel, we can make sure all subscribers already subscribed, and we are ready to publish (broadcast).

Locally tested, previously it hangs once in 20 runs.

Now it runs without any problem in 1000 runs.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@@ -40,7 +40,7 @@ def _validate_http_url(self, url: str):
raise ValueError("Invalid HTTP URL: A valid HTTP URL "
"must have scheme 'http' or 'https'.")

def _headers(self, **extras: str) -> Mapping[str, str]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lint error I fix by the way

Copy link

@davidthomas426 davidthomas426 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as long as XSUB isn't needed to make this semantically correct. I didn't dig into XPUB/XSUB quite enough to know that for sure myself, yet.

It's nice to be able to fix an issue AND simplify this code a lot :)

@@ -9,7 +9,7 @@
import torch
import torch.distributed as dist
from torch.distributed import ProcessGroup
from zmq import PUB, REP, REQ, SUB, SUBSCRIBE, Context # type: ignore
from zmq import SUB, SUBSCRIBE, XPUB, XPUB_VERBOSE, Context # type: ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need XSUB, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can check https://netmq.readthedocs.io/en/latest/xpub-xsub/ .

XPUB connects to SUB.

XSUB is used to connect many PUB, which is not our usecase.

@youkaichao youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2024
@youkaichao youkaichao merged commit 740374d into vllm-project:main Jul 25, 2024
82 of 84 checks passed
@youkaichao youkaichao deleted the fix_zmq_hang branch July 25, 2024 00:37
cadedaniel pushed a commit to cadedaniel/vllm-public that referenced this pull request Jul 27, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
russellb pushed a commit to russellb/vllm that referenced this pull request Sep 18, 2024
n1hility added a commit to opendatahub-io/vllm that referenced this pull request Oct 2, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: vLLM 0.5.3 is getting stuck at LLAMA 3.1 405B FP8 model loading
2 participants