Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler deadlocked after stealing failed in move_task_confirm #8787

Open
hendrikmakait opened this issue Jul 22, 2024 · 0 comments
Open

Scheduler deadlocked after stealing failed in move_task_confirm #8787

hendrikmakait opened this issue Jul 22, 2024 · 0 comments
Labels
bug Something is broken deadlock The cluster appears to not make any progress

Comments

@hendrikmakait
Copy link
Member

I've investigated a cluster that deadlocked after work-stealing failed in move_task_confirm with the following traceback:

distributed.stealing - ERROR - <TaskState 
('rechunk-getitem-getitem-getitem-1bd003f53f0630ef705ff016830a2c8f', 0, 1, 0) released>
  Traceback (most recent call last):
    File "/opt/coiled/env/lib/python3.10/site-packages/distributed/stealing.py", line 380, in move_task_confirm
      victim.remove_from_processing(ts)
    File "/opt/coiled/env/lib/python3.10/site-packages/distributed/scheduler.py", line 771, in remove_from_processing
      self.processing.remove(ts)
  KeyError: <TaskState ('rechunk-getitem-getitem-getitem-1bd003f53f0630ef705ff016830a2c8f', 0, 1, 0) released>

From what I understand, stealing has come pretty far in confirmation, i.e., it checked that the request is up-to-date, that the worker has indeed confirmed the request (by checking the worker status), and checked whether the task is currently stealable.

After looking into this for a while, I have not been able to understand the root-cause of this, so I'm leaving this here in case this ever comes up again.

Environment:

  • Dask version: 2024.7.1
  • Python version: 3.10.12
@hendrikmakait hendrikmakait added bug Something is broken deadlock The cluster appears to not make any progress labels Jul 22, 2024
@hendrikmakait hendrikmakait changed the title Scheduler deadlocked with after stealing failed in move_task_confirm Scheduler deadlocked after stealing failed in move_task_confirm Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken deadlock The cluster appears to not make any progress
Projects
None yet
Development

No branches or pull requests

1 participant