Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_task_group_and_prefix_statistics is very flaky on mindeps-pandas CI #8795

Open
hendrikmakait opened this issue Jul 23, 2024 · 0 comments
Labels
flaky test Intermittent failures on CI.

Comments

@hendrikmakait
Copy link
Member

CI run: https://github.com/dask/distributed/actions/runs/10053562726/job/27786569150#step:20:1479

Screenshot 2024-07-23 at 16 13 32

Traceback:

____________________ test_task_group_and_prefix_statistics _____________________

args = (), kwds = {'no_time_resync': None}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)

../../../miniconda3/envs/dask-distributed/lib/python3.9/contextlib.py:79: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../miniconda3/envs/dask-distributed/lib/python3.9/contextlib.py:79: in inner
    return func(*args, **kwds)
distributed/utils_test.py:1103: in test_func
    return _run_and_close_tornado(async_fn_outer)
distributed/utils_test.py:378: in _run_and_close_tornado
    return asyncio_run(inner_fn(), loop_factory=get_loop_factory())
distributed/compatibility.py:236: in asyncio_run
    return loop.run_until_complete(main)
../../../miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:647: in run_until_complete
    return future.result()
distributed/utils_test.py:375: in inner_fn
    return await async_fn(*args, **kwargs)
distributed/utils_test.py:1100: in async_fn_outer
    return await utils_wait_for(async_fn(), timeout=timeout * 2)
distributed/utils.py:1957: in wait_for
    return await asyncio.wait_for(fut, timeout)
../../../miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py:479: in wait_for
    return fut.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    async def async_fn():
        result = None
        with dask.config.set(config):
            async with (
                _cluster_factory() as (s, workers),
                _client_factory(s) as c,
            ):
                args = [s] + workers
                if c is not None:
                    args = [c] + args
                try:
                    coro = func(*args, *outer_args, **kwargs)
                    task = asyncio.create_task(coro)
                    coro2 = utils_wait_for(
                        asyncio.shield(task), timeout=deadline.remaining
                    )
                    result = await coro2
                    validate_state(s, *workers)
    
                except asyncio.TimeoutError:
                    assert task
                    elapsed = deadline.elapsed
                    buffer = io.StringIO()
                    # This stack indicates where the coro/test is suspended
                    task.print_stack(file=buffer)
    
                    if cluster_dump_directory:
                        await dump_cluster_state(
                            s=s,
                            ws=workers,
                            output_dir=cluster_dump_directory,
                            func_name=func.__name__,
                        )
    
                    task.cancel()
                    while not task.cancelled():
                        await asyncio.sleep(0.01)
    
                    # Hopefully, the hang has been caused by inconsistent
                    # state, which should be much more meaningful than the
                    # timeout
                    validate_state(s, *workers)
    
                    # Remove as much of the traceback as possible; it's
                    # uninteresting boilerplate from utils_test and asyncio
                    # and not from the code being tested.
>                   raise asyncio.TimeoutError(
                        f"Test timeout ({timeout}) hit after {elapsed}s.\n"
                        "========== Test stack trace starts here ==========\n"
                        f"{buffer.getvalue()}"
                    ) from None
E                   asyncio.exceptions.TimeoutError: Test timeout (30) hit after 30.000372628999912s.
E                   ========== Test stack trace starts here ==========
E                   Stack for <Task pending name='Task-173399' coro=<test_task_group_and_prefix_statistics() running at /home/runner/work/distributed/distributed/distributed/tests/test_scheduler.py:2879> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f05150b8580>()]>> (most recent call last):
E                     File "/home/runner/work/distributed/distributed/distributed/tests/test_scheduler.py", line 2879, in test_task_group_and_prefix_statistics
E                       await asyncio.sleep(0.01)

distributed/utils_test.py:1042: TimeoutError

Logs:

----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_task_group_and_prefix_statistics.yaml
----------------------------- Captured stderr call -----------------------------
2024-07-23 06:17:03,713 - distributed.scheduler - INFO - State start
2024-07-23 06:17:03,718 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:40205
2024-07-23 06:17:03,720 - distributed.scheduler - INFO -   dashboard at:  http://127.0.0.1:44235/status
2024-07-23 06:17:03,722 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-07-23 06:17:03,729 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:44723
2024-07-23 06:17:03,731 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:44723
2024-07-23 06:17:03,732 - distributed.worker - INFO -           Worker name:                          0
2024-07-23 06:17:03,733 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34313
2024-07-23 06:17:03,735 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,736 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,737 - distributed.worker - INFO -               Threads:                          1
2024-07-23 06:17:03,739 - distributed.worker - INFO -                Memory:                  15.61 GiB
2024-07-23 06:17:03,740 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/noschedulerdelayworker-z9yoj0xu
2024-07-23 06:17:03,741 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,743 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:34909
2024-07-23 06:17:03,744 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:34909
2024-07-23 06:17:03,746 - distributed.worker - INFO -           Worker name:                          1
2024-07-23 06:17:03,747 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34479
2024-07-23 06:17:03,748 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,749 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,751 - distributed.worker - INFO -               Threads:                          2
2024-07-23 06:17:03,752 - distributed.worker - INFO -                Memory:                  15.61 GiB
2024-07-23 06:17:03,753 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/noschedulerdelayworker-dm_v4rwu
2024-07-23 06:17:03,755 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,789 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:44723', name: 0, status: init, memory: 0, processing: 0>
2024-07-23 06:17:03,805 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:44723
2024-07-23 06:17:03,807 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:59288
2024-07-23 06:17:03,808 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:34909', name: 1, status: init, memory: 0, processing: 0>
2024-07-23 06:17:03,823 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:34909
2024-07-23 06:17:03,825 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:59304
2024-07-23 06:17:03,826 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-07-23 06:17:03,828 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-07-23 06:17:03,830 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,832 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,833 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,834 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,836 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:40205
2024-07-23 06:17:03,836 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:40205
2024-07-23 06:17:03,856 - distributed.scheduler - INFO - Receive client connection: Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:03,872 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:59320
2024-07-23 06:17:34,230 - distributed.scheduler - INFO - Remove client Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:34,233 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:59320; closing.
2024-07-23 06:17:34,233 - distributed.scheduler - INFO - Remove client Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:34,236 - distributed.scheduler - INFO - Close client connection: Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:34,240 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:44723. Reason: worker-close
2024-07-23 06:17:34,243 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:34909. Reason: worker-close
2024-07-23 06:17:34,248 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:59288; closing.
2024-07-23 06:17:34,248 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:59304; closing.
2024-07-23 06:17:34,248 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:44723', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-17217[1545](https://github.com/dask/distributed/actions/runs/10053562726/job/27786569150#step:20:1546)4.248713')
2024-07-23 06:17:34,250 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:34909', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1721715454.2504525')
2024-07-23 06:17:34,251 - distributed.scheduler - INFO - Lost all workers
2024-07-23 06:17:34,253 - distributed.core - INFO - Connection to tcp://127.0.0.1:40205 has been closed.
2024-07-23 06:17:34,254 - distributed.core - INFO - Connection to tcp://127.0.0.1:40205 has been closed.
2024-07-23 06:17:34,256 - distributed.scheduler - INFO - Scheduler closing due to unknown reason...
2024-07-23 06:17:34,257 - distributed.scheduler - INFO - Scheduler closing all comms
@hendrikmakait hendrikmakait added the flaky test Intermittent failures on CI. label Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

1 participant