Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered there is no reactor running, must be called from the context of a Madsim runtime while spawning task in drop impl #228

Closed
kwannoel opened this issue Oct 7, 2024 · 8 comments

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Oct 7, 2024

I encountered the following error message in this branch: risingwavelabs/risingwave#18795, in its main-cron ci workflow.

recovery-streaming-4.log

This happens when running a forked version of sqlx with madsim-tokio 0.2.30 as its dependency, to override tokio.

thread '<unnamed>' panicked at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/context.rs:27:44:
there is no reactor running, must be called from the context of a Madsim runtime
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/panicking.rs:74:14
   2: core::panicking::panic_display
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/panicking.rs:264:5
   3: core::option::expect_failed
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/option.rs:2030:5
   4: expect<alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/option.rs:933:21
   5: {closure#0}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/context.rs:27:44
   6: try_with<core::cell::RefCell<core::option::Option<alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>>, madsim::sim::runtime::context::current_task::{closure_env#0}, alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/std/src/thread/local.rs:283:12
   7: with<core::cell::RefCell<core::option::Option<alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>>, madsim::sim::runtime::context::current_task::{closure_env#0}, alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/std/src/thread/local.rs:260:9
   8: current_task
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/context.rs:27:10
   9: current
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/task/mod.rs:586:20
  10: spawn<sqlx_core::pool::connection::{impl#5}::return_to_pool::{async_block_env#1}<sqlx_sqlite::database::Sqlite>>
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/task/mod.rs:664:5
  11: spawn<sqlx_core::pool::connection::{impl#5}::return_to_pool::{async_block_env#1}<sqlx_sqlite::database::Sqlite>>
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-tokio-0.2.30/src/sim/runtime.rs:152:9
  12: spawn<sqlx_core::pool::connection::{impl#5}::return_to_pool::{async_block_env#1}<sqlx_sqlite::database::Sqlite>>
             at ./.cargo/git/checkouts/sqlx-89fb48fb6611e438-shallow/15ee4ba/sqlx-core/src/rt/mod.rs:68:41
  13: drop<sqlx_sqlite::database::Sqlite>
             at ./.cargo/git/checkouts/sqlx-89fb48fb6611e438-shallow/15ee4ba/sqlx-core/src/pool/connection.rs:169:13
  14: drop_in_place<sqlx_core::pool::connection::PoolConnection<sqlx_sqlite::database::Sqlite>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs:542:1
  15: drop_in_place<sea_orm::database::db_connection::InnerConnection>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs:542:1
  16: drop_in_place<core::cell::UnsafeCell<sea_orm::database::db_connection::InnerConnection>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs:542:1
  17: drop_in_place<futures_util::lock::mutex::Mutex<sea_orm::database::db_connection::InnerConnection>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs:542:1
  18: drop_slow<futures_util::lock::mutex::Mutex<sea_orm::database::db_connection::InnerConnection>, alloc::alloc::Global>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/alloc/src/sync.rs:1844:18
  19: drop<futures_util::lock::mutex::Mutex<sea_orm::database::db_connection::InnerConnection>, alloc::alloc::Global>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/alloc/src/sync.rs:2531:13
  20: drop_in_place<alloc::sync::Arc<futures_util::lock::mutex::Mutex<sea_orm::database::db_connection::InnerConnection>, alloc::alloc::Global>>
             at /rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs:542:1
...

The origin of the panic is here: https://github.com/kwannoel/sqlx/blob/15ee4ba0eedc269540e42c6b4437c10de7325ed7/sqlx-core/src/pool/connection.rs#L166-L171:

impl<DB: Database> Drop for PoolConnection<DB> {
    fn drop(&mut self) {
        // We still need to spawn a task to maintain `min_connections`.
        if self.live.is_some() || self.pool.options.min_connections > 0 {
            crate::rt::spawn(self.return_to_pool());
        }
    }
}

This calls rt::spawn https://github.com/kwannoel/sqlx/blob/15ee4ba0eedc269540e42c6b4437c10de7325ed7/sqlx-core/src/rt/mod.rs#L61-L78:

pub fn spawn<F>(fut: F) -> JoinHandle<F::Output>
where
    F: Future + Send + 'static,
    F::Output: Send + 'static,
{
    #[cfg(feature = "_rt-tokio")]
    if let Ok(handle) = tokio::runtime::Handle::try_current() {
        return JoinHandle::Tokio(handle.spawn(fut)); // <----------- Panics here.
    }


    #[cfg(feature = "_rt-async-std")]
    {
        return JoinHandle::AsyncStd(async_std::task::spawn(fut));
    }


    #[cfg(not(feature = "_rt-async-std"))]
    missing_rt(fut)
}

That calls:

#[track_caller]
pub fn spawn<F>(future: F) -> JoinHandle<F::Output>
where
F: Future + Send + 'static,
F::Output: Send + 'static,
{
Spawner::current().spawn(future)
}

Which calls:

fn current() -> Self {
let info = crate::context::current_task();
let sender = crate::context::current(|h| h.task.sender.clone());
Spawner {
sender,
info: info.node.clone(),
}
}

And finally calls:

pub(crate) fn current_task() -> Arc<TaskInfo> {
TASK.with(|task| task.borrow().clone().expect(MSG))
}

It then panics, since there's no current task.

My modification of try_current did not manage to fix it #227. Even if we modified try_current to properly check the task context, it will still be missing and just return an error instead of panicking. So we will be unable to spawn a thread to run the future still.

I realize it's quite similar to existing issues in madsim, specifically when spawn is called during a drop, similar to #182. So I'm trying to understand if this is a known problem, and if so, what is the exact root cause.

Reproducible example: #229

A list of related issues I found:

Thanks in advance for your advice 🙏

@fuyufjh
Copy link

fuyufjh commented Oct 9, 2024

This seems to be the most related explanation: #182 (comment)

The reason of this panic is that madsim needs to know which node is spawning the task. But when a task is dropping, the global context is empty. In other words, madsim doesn't run any task at that time, so it can't know who is spawning the task. This seems to be a bug since this behavior is normal on tokio. But I don't have a clear solution yet.

As the bug only happens in drop(), I guess the Handler or some context is incorrectly dropped before our drop(). Need further investigation.

@fuyufjh
Copy link

fuyufjh commented Oct 9, 2024

Kindly ask for assistance from the project author @wangrunji0408

@kwannoel
Copy link
Contributor Author

kwannoel commented Oct 10, 2024

2022-04-16T11:16:45.000000Z  INFO madsim::sim::runtime::context: enter task with system thread: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO node{id=0 name="main"}:task{id=0}: madsim::sim::task: system thread of run_all_ready begin: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO node{id=0 name="main"}:task{id=0}: madsim_tokio::sim::runtime::tests: system thread of t1: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO node{id=0 name="main"}:task{id=0}: madsim::sim::runtime::context: system thread of current_task: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO node{id=0 name="main"}:task{id=0}: madsim::sim::runtime::context: drop task: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO madsim::sim::task: system thread of run_all_ready done: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO madsim_tokio::sim::runtime::tests: system thread of t3 drop: ThreadId(2)
2022-04-16T11:16:45.000000Z  INFO madsim::sim::runtime::context: system thread of current_task: ThreadId(2)

After doing some tracing, here's some further finding. The dropped task does not run on any executor node, so we don't see the attached span. Therefore it does not enter the task context, and we get "no runtime" error. Will continue investigating.

(A side note, we can make the error messages more specific, rather than no runtime, we can say "task context not found", but perhaps this is inconsistent with tokio's errors.)

@kwannoel
Copy link
Contributor Author

kwannoel commented Oct 10, 2024

In relation to this specific testcase:

    fn runtime_drop_and_spawn_struct_fail_with_try_current_task() {
        init_logger();

        #[derive(Debug)]
        struct A {}
        impl Drop for A {
            fn drop(&mut self) {
                let bt = Backtrace::force_capture();
                tracing::info!("backtrace of t3 drop: {:#?}", bt);
                tracing::info!("system thread of t3 drop: {:?}", std::thread::current().id());
                if let Some(handle) = Handle::try_current().ok() {
                    handle.spawn(async {
                        std::future::pending::<()>().await
                    });
                }
            }
        }

        let a = A {};
        let runtime = madsim::runtime::Runtime::new();
        runtime.block_on(async move {
            tracing::info!("system thread of t1: {:?}", std::thread::current().id());
            let rt = Runtime::new().unwrap();
            let join_handle = rt.spawn(async move {
                tracing::info!("system thread of t2: {:?}", std::thread::current().id());
                loop { &a; }
            });
            drop(rt);

            let err = join_handle.await.unwrap_err();
            assert!(err.is_cancelled());
        })
    }

The major difference is that we enter the runtime context per task in madsim.

let _guard = crate::context::enter_task(info.clone());
std::panic::catch_unwind(move || runnable.run())

In tokio however, we always enter the runtime context when we block_on the runtime, so it will always be accessible by spawned tasks. The runtime context is not on a task level.
https://github.com/tokio-rs/tokio/blob/679d7657dc267a4d6472597c91cdda8792cf9e97/tokio/src/runtime/scheduler/multi_thread/mod.rs#L86-L88

Now consider that destructors for Runnable (and hence the drop impl for A in our testcase) are ran after it goes out of scope. If we look at the backtrace for the Drop test case, we can indeed see it goes out of scope in the next iteration of the loop:

Backtrace:

...
    { fn: "<madsim_tokio::sim::runtime::tests::runtime_drop_and_spawn_struct_fail_with_try_current_task::A as core::ops::drop::Drop>::drop", file: "./src/sim/runtime.rs", line: 205 },
    { fn: "core::ptr::drop_in_place<madsim_tokio::sim::runtime::tests::runtime_drop_and_spawn_struct_fail_with_try_current_task::A>", file: "/rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs", line: 542 },
...
    { fn: "<async_task::runnable::Runnable<M> as core::ops::drop::Drop>::drop", file: "/Users/noelkwan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.7.1/src/runnable.rs", line: 920 },
    { fn: "core::ptr::drop_in_place<async_task::runnable::Runnable<alloc::sync::Weak<madsim::sim::task::TaskInfo>>>", file: "/rustc/5affbb17153bc69a9d5d8d2faa4e399a014a211e/library/core/src/ptr/mod.rs", line: 542 },
    { fn: "madsim::sim::task::Executor::run_all_ready",
file: "/Users/noelkwan/projects/madsim/madsim/src/sim/task/mod.rs",
 line: 319 },
^^^^^^^^^^^^  this line (ignore 319, there's some local changes, it coresponds to L315 above)
    { fn: "madsim::sim::task::Executor::block_on", file: "/Users/noelkwan/projects/madsim/madsim/src/sim/task/mod.rs", line: 242 },
    { fn: "madsim::sim::runtime::Runtime::block_on", file: "/Users/noelkwan/projects/madsim/madsim/src/sim/runtime/mod.rs", line: 129 },

At this specific part, it is obvious that there is no task context entered, as it is only entered in L279-L280! So it clearly explains why there is no task context.

Now the root cause is known, we need to decide how to solve it.

@BugenZhao
Copy link
Contributor

By explicitly adding a drop(runnable), we can see that the future is dropped here:

if info.cancelled.load(Ordering::Relaxed) || info.node.killed.load(Ordering::Relaxed) {
// cancelled task or killed node: drop the future
continue;

Also we don't need a fake madsim_tokio runtime to trigger this issue. Simply aborting the task will lead to panic:

fn runtime_drop_bugen() {
    #[derive(Debug)]
    struct A;
    impl Drop for A {
        fn drop(&mut self) {
            Handle::current().spawn(std::future::pending::<()>());
        }
    }

    let a = A;
    let runtime = madsim::runtime::Runtime::new();
    runtime.block_on(async move {
        let join_handle = madsim::task::spawn(async move { drop(a) });
        join_handle.abort();

        let err = join_handle.await.unwrap_err();
        assert!(err.is_cancelled());
    })
}

@BugenZhao
Copy link
Contributor

Can we simply call enter_task with the TaskInfo of the task to be dropped while dropping it? Is there anything blocking us from doing this?

@kwannoel

This comment was marked as resolved.

@kwannoel
Copy link
Contributor Author

kwannoel commented Oct 10, 2024

See: #231 for a solution by @BugenZhao.

Bugen took #228 (comment) further, and found that it's actually triggered inside the block which handles cancelled / killed nodes (#231 (comment)). We can abort the future since the node was killed or the task itself was cancelled.

Importantly, it also ensures that Runnable will not be dropped without a task context, if it has a future, because the future may contain some struct, which has a drop with spawn impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants