[core][distributed] fix zmq hang #6759

youkaichao · 2024-07-24T21:42:27Z

this is caused by incorrect usage of zmq, or a bug of zmq, reported at zeromq/libzmq#4713 .

By using XPUB channel, we can make sure all subscribers already subscribed, and we are ready to publish (broadcast).

Locally tested, previously it hangs once in 20 runs.

Now it runs without any problem in 1000 runs.

github-actions · 2024-07-24T21:42:37Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-07-24T21:42:51Z

vllm/connections.py

@@ -40,7 +40,7 @@ def _validate_http_url(self, url: str):
            raise ValueError("Invalid HTTP URL: A valid HTTP URL "
                             "must have scheme 'http' or 'https'.")

-    def _headers(self, **extras: str) -> Mapping[str, str]:


this is a lint error I fix by the way

davidthomas426

LGTM, as long as XSUB isn't needed to make this semantically correct. I didn't dig into XPUB/XSUB quite enough to know that for sure myself, yet.

It's nice to be able to fix an issue AND simplify this code a lot :)

davidthomas426 · 2024-07-24T22:02:34Z

vllm/distributed/device_communicators/shm_broadcast.py

@@ -9,7 +9,7 @@
 import torch
 import torch.distributed as dist
 from torch.distributed import ProcessGroup
-from zmq import PUB, REP, REQ, SUB, SUBSCRIBE, Context  # type: ignore
+from zmq import SUB, SUBSCRIBE, XPUB, XPUB_VERBOSE, Context  # type: ignore


Do we need XSUB, too?

No, you can check https://netmq.readthedocs.io/en/latest/xpub-xsub/ .

XPUB connects to SUB.

XSUB is used to connect many PUB, which is not our usecase.

(cherry picked from commit 740374d)

[core][distributed] fix zmq hang (vllm-project#6759)

Signed-off-by: Alvant <[email protected]>

fix zmq hang

7dc4e27

youkaichao commented Jul 24, 2024

View reviewed changes

youkaichao added 3 commits July 24, 2024 14:47

add comments

051ca02

move comments

8ad6c43

move comments

1c652f2

davidthomas426 approved these changes Jul 24, 2024

View reviewed changes

youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2024

youkaichao merged commit 740374d into vllm-project:main Jul 25, 2024
82 of 84 checks passed

youkaichao deleted the fix_zmq_hang branch July 25, 2024 00:37

youkaichao mentioned this pull request Jul 26, 2024

[Installation]: Running ohereForAI/c4ai-command-r-v01 with main pytorch #6355

Open

cadedaniel pushed a commit to cadedaniel/vllm-public that referenced this pull request Jul 27, 2024

[core][distributed] fix zmq hang (vllm-project#6759)

86e6775

dtrifiro mentioned this pull request Aug 5, 2024

Sync with [email protected] opendatahub-io/vllm#120

Closed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[core][distributed] fix zmq hang (vllm-project#6759)

731dab2

russellb pushed a commit to russellb/vllm that referenced this pull request Sep 18, 2024

[core][distributed] fix zmq hang (vllm-project#6759)

a7c521b

(cherry picked from commit 740374d)

n1hility added a commit to opendatahub-io/vllm that referenced this pull request Oct 2, 2024

Merge pull request #165 from russellb/instructlab-zmq-hang-backport

e3f32d4

[core][distributed] fix zmq hang (vllm-project#6759)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[core][distributed] fix zmq hang (vllm-project#6759)

cd97007

Signed-off-by: Alvant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][distributed] fix zmq hang #6759

[core][distributed] fix zmq hang #6759

youkaichao commented Jul 24, 2024

github-actions bot commented Jul 24, 2024

youkaichao Jul 24, 2024

davidthomas426 left a comment

davidthomas426 Jul 24, 2024

youkaichao Jul 24, 2024

[core][distributed] fix zmq hang #6759

[core][distributed] fix zmq hang #6759

Conversation

youkaichao commented Jul 24, 2024

github-actions bot commented Jul 24, 2024

youkaichao Jul 24, 2024

Choose a reason for hiding this comment

davidthomas426 left a comment

Choose a reason for hiding this comment

davidthomas426 Jul 24, 2024

Choose a reason for hiding this comment

youkaichao Jul 24, 2024

Choose a reason for hiding this comment