`test_task_group_and_prefix_statistics` is very flaky on `mindeps-pandas` CI #8795

hendrikmakait · 2024-07-23T14:15:15Z

CI run: https://github.com/dask/distributed/actions/runs/10053562726/job/27786569150#step:20:1479

Traceback:

____________________ test_task_group_and_prefix_statistics _____________________

args = (), kwds = {'no_time_resync': None}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)

../../../miniconda3/envs/dask-distributed/lib/python3.9/contextlib.py:79: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../miniconda3/envs/dask-distributed/lib/python3.9/contextlib.py:79: in inner
    return func(*args, **kwds)
distributed/utils_test.py:1103: in test_func
    return _run_and_close_tornado(async_fn_outer)
distributed/utils_test.py:378: in _run_and_close_tornado
    return asyncio_run(inner_fn(), loop_factory=get_loop_factory())
distributed/compatibility.py:236: in asyncio_run
    return loop.run_until_complete(main)
../../../miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py:647: in run_until_complete
    return future.result()
distributed/utils_test.py:375: in inner_fn
    return await async_fn(*args, **kwargs)
distributed/utils_test.py:1100: in async_fn_outer
    return await utils_wait_for(async_fn(), timeout=timeout * 2)
distributed/utils.py:1957: in wait_for
    return await asyncio.wait_for(fut, timeout)
../../../miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py:479: in wait_for
    return fut.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    async def async_fn():
        result = None
        with dask.config.set(config):
            async with (
                _cluster_factory() as (s, workers),
                _client_factory(s) as c,
            ):
                args = [s] + workers
                if c is not None:
                    args = [c] + args
                try:
                    coro = func(*args, *outer_args, **kwargs)
                    task = asyncio.create_task(coro)
                    coro2 = utils_wait_for(
                        asyncio.shield(task), timeout=deadline.remaining
                    )
                    result = await coro2
                    validate_state(s, *workers)
    
                except asyncio.TimeoutError:
                    assert task
                    elapsed = deadline.elapsed
                    buffer = io.StringIO()
                    # This stack indicates where the coro/test is suspended
                    task.print_stack(file=buffer)
    
                    if cluster_dump_directory:
                        await dump_cluster_state(
                            s=s,
                            ws=workers,
                            output_dir=cluster_dump_directory,
                            func_name=func.__name__,
                        )
    
                    task.cancel()
                    while not task.cancelled():
                        await asyncio.sleep(0.01)
    
                    # Hopefully, the hang has been caused by inconsistent
                    # state, which should be much more meaningful than the
                    # timeout
                    validate_state(s, *workers)
    
                    # Remove as much of the traceback as possible; it's
                    # uninteresting boilerplate from utils_test and asyncio
                    # and not from the code being tested.
>                   raise asyncio.TimeoutError(
                        f"Test timeout ({timeout}) hit after {elapsed}s.\n"
                        "========== Test stack trace starts here ==========\n"
                        f"{buffer.getvalue()}"
                    ) from None
E                   asyncio.exceptions.TimeoutError: Test timeout (30) hit after 30.000372628999912s.
E                   ========== Test stack trace starts here ==========
E                   Stack for <Task pending name='Task-173399' coro=<test_task_group_and_prefix_statistics() running at /home/runner/work/distributed/distributed/distributed/tests/test_scheduler.py:2879> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f05150b8580>()]>> (most recent call last):
E                     File "/home/runner/work/distributed/distributed/distributed/tests/test_scheduler.py", line 2879, in test_task_group_and_prefix_statistics
E                       await asyncio.sleep(0.01)

distributed/utils_test.py:1042: TimeoutError

Logs:

----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_task_group_and_prefix_statistics.yaml
----------------------------- Captured stderr call -----------------------------
2024-07-23 06:17:03,713 - distributed.scheduler - INFO - State start
2024-07-23 06:17:03,718 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:40205
2024-07-23 06:17:03,720 - distributed.scheduler - INFO -   dashboard at:  http://127.0.0.1:44235/status
2024-07-23 06:17:03,722 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-07-23 06:17:03,729 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:44723
2024-07-23 06:17:03,731 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:44723
2024-07-23 06:17:03,732 - distributed.worker - INFO -           Worker name:                          0
2024-07-23 06:17:03,733 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34313
2024-07-23 06:17:03,735 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,736 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,737 - distributed.worker - INFO -               Threads:                          1
2024-07-23 06:17:03,739 - distributed.worker - INFO -                Memory:                  15.61 GiB
2024-07-23 06:17:03,740 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/noschedulerdelayworker-z9yoj0xu
2024-07-23 06:17:03,741 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,743 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:34909
2024-07-23 06:17:03,744 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:34909
2024-07-23 06:17:03,746 - distributed.worker - INFO -           Worker name:                          1
2024-07-23 06:17:03,747 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34479
2024-07-23 06:17:03,748 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,749 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,751 - distributed.worker - INFO -               Threads:                          2
2024-07-23 06:17:03,752 - distributed.worker - INFO -                Memory:                  15.61 GiB
2024-07-23 06:17:03,753 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/noschedulerdelayworker-dm_v4rwu
2024-07-23 06:17:03,755 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,789 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:44723', name: 0, status: init, memory: 0, processing: 0>
2024-07-23 06:17:03,805 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:44723
2024-07-23 06:17:03,807 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:59288
2024-07-23 06:17:03,808 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:34909', name: 1, status: init, memory: 0, processing: 0>
2024-07-23 06:17:03,823 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:34909
2024-07-23 06:17:03,825 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:59304
2024-07-23 06:17:03,826 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-07-23 06:17:03,828 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-07-23 06:17:03,830 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,832 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,833 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:40205
2024-07-23 06:17:03,834 - distributed.worker - INFO - -------------------------------------------------
2024-07-23 06:17:03,836 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:40205
2024-07-23 06:17:03,836 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:40205
2024-07-23 06:17:03,856 - distributed.scheduler - INFO - Receive client connection: Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:03,872 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:59320
2024-07-23 06:17:34,230 - distributed.scheduler - INFO - Remove client Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:34,233 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:59320; closing.
2024-07-23 06:17:34,233 - distributed.scheduler - INFO - Remove client Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:34,236 - distributed.scheduler - INFO - Close client connection: Client-2ded1ac6-48bb-11ef-895b-000d3a30a29f
2024-07-23 06:17:34,240 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:44723. Reason: worker-close
2024-07-23 06:17:34,243 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:34909. Reason: worker-close
2024-07-23 06:17:34,248 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:59288; closing.
2024-07-23 06:17:34,248 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:59304; closing.
2024-07-23 06:17:34,248 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:44723', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-17217[1545](https://github.com/dask/distributed/actions/runs/10053562726/job/27786569150#step:20:1546)4.248713')
2024-07-23 06:17:34,250 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:34909', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1721715454.2504525')
2024-07-23 06:17:34,251 - distributed.scheduler - INFO - Lost all workers
2024-07-23 06:17:34,253 - distributed.core - INFO - Connection to tcp://127.0.0.1:40205 has been closed.
2024-07-23 06:17:34,254 - distributed.core - INFO - Connection to tcp://127.0.0.1:40205 has been closed.
2024-07-23 06:17:34,256 - distributed.scheduler - INFO - Scheduler closing due to unknown reason...
2024-07-23 06:17:34,257 - distributed.scheduler - INFO - Scheduler closing all comms

The text was updated successfully, but these errors were encountered:

hendrikmakait added the flaky test Intermittent failures on CI. label Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_task_group_and_prefix_statistics` is very flaky on `mindeps-pandas` CI #8795

`test_task_group_and_prefix_statistics` is very flaky on `mindeps-pandas` CI #8795

hendrikmakait commented Jul 23, 2024

test_task_group_and_prefix_statistics is very flaky on mindeps-pandas CI #8795

test_task_group_and_prefix_statistics is very flaky on mindeps-pandas CI #8795

Comments

hendrikmakait commented Jul 23, 2024

`test_task_group_and_prefix_statistics` is very flaky on `mindeps-pandas` CI #8795

`test_task_group_and_prefix_statistics` is very flaky on `mindeps-pandas` CI #8795