Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subclass + GC is flaky on free-threaded build #4627

Open
davidhewitt opened this issue Oct 18, 2024 · 3 comments
Open

subclass + GC is flaky on free-threaded build #4627

davidhewitt opened this issue Oct 18, 2024 · 3 comments

Comments

@davidhewitt
Copy link
Member

We have some flaky issue in CI on the freethreaded Python with test_traverse_subclass and test_traverse_subclass_override_clear:

# exact line number can vary depending on source changes
thread 'test_traverse_subclass' panicked at tests/test_gc.rs:640:9:
assertion failed: drop_called.load(Ordering::Relaxed)

I can reproduce this locally (with a freethreaded Python):

# eventually this fails and the while loop stops
while UNSAFE_PYO3_BUILD_FREE_THREADED=1 cargo test --test test_gc; do :; done

It seems to be the case that the combination of PyO3 inheritance and freethreaded Python and GC has a subtle edge case. Needs further investigation. I spent some time looking at this today, but I didn't come to a clear conclusion.

I think it's a rare case, it only affects subclassing on the freethreaded build, and I think the resulting issue is just a memory leak. I think therefore it's fine to not block the 0.23 release on this and to instead investigate later.

@ngoldbaum
Copy link
Contributor

ngoldbaum commented Oct 22, 2024

Here's a weird thing. If I make the following modification:

diff --git a/tests/test_gc.rs b/tests/test_gc.rs
index 9483819c..e78755d3 100644
--- a/tests/test_gc.rs
+++ b/tests/test_gc.rs
@@ -66,7 +66,7 @@ impl DropCheck {
     #[track_caller]
     fn assert_drops_with_gc(&self, object: *mut pyo3::ffi::PyObject) {
         // running the GC might take a few cycles to collect an object
-        for _ in 0..100 {
+        for i in 0..100 {
             if self.0.is_completed() {
                 return;
             }
@@ -77,6 +77,9 @@ impl DropCheck {
             });
             #[cfg(Py_GIL_DISABLED)]
             {
+                if i == 99 {
+                    dbg!("hello!");
+                }
                 // on the free-threaded build, the GC might be running in a se>
                 // some time for this
                 std::thread::sleep(std::time::Duration::from_millis(5));
@@ -627,12 +630,9 @@ fn test_traverse_subclass() {
             check.assert_not_dropped();
         }

-        #[cfg(not(Py_GIL_DISABLED))]
-        {
-            // FIXME: seems like a bug that this is flaky on the free-threaded>
-            // https://github.com/PyO3/pyo3/issues/4627
-            check.assert_drops_with_gc(ptr);
-        }
+        // FIXME: seems like a bug that this is flaky on the free-threaded bui>
+        // https://github.com/PyO3/pyo3/issues/4627
+        check.assert_drops_with_gc(ptr);

And then if I put a breakpoint in the dbg!() call, it looks like all the other worker threads have completed, and for whatever reason drop just never gets called, even when the GC has an opportunity to drop it.

Here are the stack traces, the main thread is parked and the only worker thread is stuck waiting for drop to be called:

(lldb) bt all
  thread #1, name = 'main', queue = 'com.apple.main-thread'
    frame #0: 0x0000000183b24d68 libsystem_kernel.dylib`semaphore_timedwait_trap + 8
    frame #1: 0x00000001839b5a00 libdispatch.dylib`_dispatch_sema4_timedwait + 64
    frame #2: 0x00000001839b5ffc libdispatch.dylib`_dispatch_semaphore_wait_slow + 76
    frame #3: 0x00000001000a8974 test_gc-66d106ef29e58418`std::thread::park_timeout::h3094a03c38ba3b52 [inlined] std::sys::sync::thread_parking::darwin::Parker::park_timeout::h96aa039ec5b0172e at darwin.rs:95:23 [opt]
    frame #4: 0x00000001000a8920 test_gc-66d106ef29e58418`std::thread::park_timeout::h3094a03c38ba3b52 at mod.rs:1193:9 [opt]
    frame #5: 0x000000010002df28 test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::_$u7b$$u7b$closure$u7d$$u7d$::hc33f0b5eb0ca8722 [inlined] std::sync::mpmc::context::Context::wait_until::hcdc22f8b88ad8315 at context.rs:130:21 [opt]
    frame #6: 0x000000010002dee0 test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::_$u7b$$u7b$closure$u7d$$u7d$::hc33f0b5eb0ca8722 at list.rs:448:27 [opt]
    frame #7: 0x000000010002db40 test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::h91c339d4d40124d9 [inlined] std::sync::mpmc::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h7f4807ad92eb3c83 at context.rs:50:13 [opt]
    frame #8: 0x000000010002db30 test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::h91c339d4d40124d9 [inlined] std::sync::mpmc::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h86c2d1298e1347fc at context.rs:58:31 [opt]
    frame #9: 0x000000010002db0c test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::h91c339d4d40124d9 [inlined] std::thread::local::LocalKey$LT$T$GT$::try_with::h68753fd7897fe468 at local.rs:283:12 [opt]
    frame #10: 0x000000010002daf8 test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::h91c339d4d40124d9 [inlined] std::sync::mpmc::context::Context::with::hfc199ee25f643040 at context.rs:53:9 [opt]
    frame #11: 0x000000010002daf8 test_gc-66d106ef29e58418`std::sync::mpmc::list::Channel$LT$T$GT$::recv::h91c339d4d40124d9 at list.rs:438:13 [opt]
    frame #12: 0x00000001000417ac test_gc-66d106ef29e58418`test::console::run_tests_console::h5a9ae9f7721cc08b [inlined] std::sync::mpmc::Receiver$LT$T$GT$::recv_deadline::h6644a5598e1ffbaa at mod.rs:340:43 [opt]
    frame #13: 0x00000001000417a0 test_gc-66d106ef29e58418`test::console::run_tests_console::h5a9ae9f7721cc08b [inlined] std::sync::mpmc::Receiver$LT$T$GT$::recv_timeout::h45357c38d9f774de at mod.rs:323:31 [opt]
    frame #14: 0x000000010004179c test_gc-66d106ef29e58418`test::console::run_tests_console::h5a9ae9f7721cc08b [inlined] std::sync::mpsc::Receiver$LT$T$GT$::recv_timeout::hb06ff65321947cd4 at mod.rs:909:9 [opt]
    frame #15: 0x000000010004179c test_gc-66d106ef29e58418`test::console::run_tests_console::h5a9ae9f7721cc08b [inlined] test::run_tests::hbf9ceb03867fb34f at lib.rs:417:30 [opt]
    frame #16: 0x0000000100040e44 test_gc-66d106ef29e58418`test::console::run_tests_console::h5a9ae9f7721cc08b at console.rs:322:5 [opt]
    frame #17: 0x0000000100056ad8 test_gc-66d106ef29e58418`test::test_main::hdc33dc7d1f92ce28 at lib.rs:141:19 [opt]
    frame #18: 0x00000001000576c0 test_gc-66d106ef29e58418`test::test_main_static::hdd8bc03d990864c2 at lib.rs:163:5 [opt]
    frame #19: 0x00000001000225e4 test_gc-66d106ef29e58418`test_gc::main::he50a1009f849a57a at test_gc.rs:1:1
    frame #20: 0x000000010000c088 test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once::h5a687af0df0cfd93((null)=(test_gc-66d106ef29e58418`test_gc::main::he50a1009f849a57a at test_gc.rs:1), (null)=<unavailable>) at function.rs:250:5
    frame #21: 0x00000001000169b0 test_gc-66d106ef29e58418`std::sys::backtrace::__rust_begin_short_backtrace::hd16621ff12f246dc(f=(test_gc-66d106ef29e58418`test_gc::main::he50a1009f849a57a at test_gc.rs:1)) at backtrace.rs:152:18
    frame #22: 0x00000001000178f0 test_gc-66d106ef29e58418`std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hecd052d19fcfe11c at rt.rs:162:18
    frame #23: 0x00000001000a7f90 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::h4f74490c6170ea16 at function.rs:284:13 [opt]
    frame #24: 0x00000001000a7f88 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::panicking::try::do_call::h2f36d2f1f1af8d28 at panicking.rs:557:40 [opt]
    frame #25: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::panicking::try::ha6af9029a7d93c94 at panicking.rs:521:19 [opt]
    frame #26: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::panic::catch_unwind::ha4f738ae2ba7c3a4 at panic.rs:350:14 [opt]
    frame #27: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::hf216622dc2c733e3 at rt.rs:141:48 [opt]
    frame #28: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::panicking::try::do_call::haa691957db1dd55f at panicking.rs:557:40 [opt]
    frame #29: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::panicking::try::ha0c1a49b9fabc98e at panicking.rs:521:19 [opt]
    frame #30: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 [inlined] std::panic::catch_unwind::h68ad032c646bb0ab at panic.rs:350:14 [opt]
    frame #31: 0x00000001000a7f84 test_gc-66d106ef29e58418`std::rt::lang_start_internal::hdd117cb81a316264 at rt.rs:141:20 [opt]
    frame #32: 0x00000001000178bc test_gc-66d106ef29e58418`std::rt::lang_start::hc72436f0cd48603c(main=(test_gc-66d106ef29e58418`test_gc::main::he50a1009f849a57a at test_gc.rs:1), argc=1, argv=0x000000016fdff118, sigpipe='\0') at rt.rs:161:17
    frame #33: 0x0000000100022610 test_gc-66d106ef29e58418`main + 36
    frame #34: 0x00000001837db154 dyld`start + 2476
* thread #2, name = 'test_traverse_subclass', stop reason = breakpoint 1.1
  * frame #0: 0x000000010002018c test_gc-66d106ef29e58418`test_gc::DropCheck::assert_drops_with_gc::h8772b9cd98bdc362(self=0x00000001710667a0, object=0x00000200040501c0) at test_gc.rs:81:21
    frame #1: 0x0000000100007ac0 test_gc-66d106ef29e58418`test_gc::test_traverse_subclass::_$u7b$$u7b$closure$u7d$$u7d$::h0cce1a156679a28d((null)=(__0 = core::marker::PhantomData<(&pyo3::gil::GILGuard, pyo3::impl_::not_send::NotSend)> @ 0x00000001710666b7)) at test_gc.rs:635:9
    frame #2: 0x00000001000253e8 test_gc-66d106ef29e58418`pyo3::marker::Python::with_gil::h2fecf6d10d403869(f={closure_env#0} @ 0x0000000171066740) at marker.rs:410:9
    frame #3: 0x0000000100022214 test_gc-66d106ef29e58418`test_gc::test_traverse_subclass::hd90c145a15d6a426 at test_gc.rs:612:5
    frame #4: 0x0000000100007818 test_gc-66d106ef29e58418`test_gc::test_traverse_subclass::_$u7b$$u7b$closure$u7d$$u7d$::h426445ac18904e7e((null)=0x00000001710667fe) at test_gc.rs:597:28
    frame #5: 0x000000010000c4f4 test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once::headc3e6da91f44ef((null)={closure_env#0} @ 0x00000001710667fe, (null)=<unavailable>) at function.rs:250:5
    frame #6: 0x000000010005af90 test_gc-66d106ef29e58418`test::__rust_begin_short_backtrace::hc730144174a2f2b8 [inlined] core::ops::function::FnOnce::call_once::h0ba6e3d0adcb0fb8 at function.rs:250:5 [opt]
    frame #7: 0x000000010005af88 test_gc-66d106ef29e58418`test::__rust_begin_short_backtrace::hc730144174a2f2b8 at lib.rs:624:18 [opt]
    frame #8: 0x000000010005a8e4 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 [inlined] test::run_test_in_process::_$u7b$$u7b$closure$u7d$$u7d$::h9e544e0587f41be2 at lib.rs:647:60 [opt]
    frame #9: 0x000000010005a8d8 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 [inlined] _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h8389036b3da9abb4 at unwind_safe.rs:272:9 [opt]
    frame #10: 0x000000010005a8d8 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 [inlined] std::panicking::try::do_call::h6c15d214f8b0efc9 at panicking.rs:557:40 [opt]
    frame #11: 0x000000010005a8d4 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 [inlined] std::panicking::try::h4dec6d151c5d7c52 at panicking.rs:521:19 [opt]
    frame #12: 0x000000010005a8d4 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 [inlined] std::panic::catch_unwind::h08246350d14b78e5 at panic.rs:350:14 [opt]
    frame #13: 0x000000010005a8d4 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 [inlined] test::run_test_in_process::h5176f0d7330017af at lib.rs:647:27 [opt]
    frame #14: 0x000000010005a850 test_gc-66d106ef29e58418`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h3dade545f948edf5 at lib.rs:568:43 [opt]
    frame #15: 0x000000010002ca10 test_gc-66d106ef29e58418`std::sys::backtrace::__rust_begin_short_backtrace::h927c17d76a6dfbc6 [inlined] test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h331e167c3eb94f21 at lib.rs:598:41 [opt]
    frame #16: 0x000000010002c988 test_gc-66d106ef29e58418`std::sys::backtrace::__rust_begin_short_backtrace::h927c17d76a6dfbc6 at backtrace.rs:152:18 [opt]
    frame #17: 0x000000010002fb74 test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 [inlined] std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h9a08c487421e7042 at mod.rs:538:17 [opt]
    frame #18: 0x000000010002fb6c test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 [inlined] _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hf6a2344f0bd24956 at unwind_safe.rs:272:9 [opt]
    frame #19: 0x000000010002fb6c test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 [inlined] std::panicking::try::do_call::hfdb1a93845faf3ef at panicking.rs:557:40 [opt]
    frame #20: 0x000000010002fb6c test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 [inlined] std::panicking::try::h4b35ce5ad8a162fd at panicking.rs:521:19 [opt]
    frame #21: 0x000000010002fb6c test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 [inlined] std::panic::catch_unwind::h3bb4d3ee2986e761 at panic.rs:350:14 [opt]
    frame #22: 0x000000010002fb6c test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 [inlined] std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h5e36dd32c0d26255 at mod.rs:537:30 [opt]
    frame #23: 0x000000010002faf8 test_gc-66d106ef29e58418`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4b1e8564c6e52dc1 at function.rs:250:5 [opt]
    frame #24: 0x00000001000b4400 test_gc-66d106ef29e58418`std::sys::pal::unix::thread::Thread::new::thread_start::h1bd1b9c95010bf71 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h01276ebbe54a8110 at boxed.rs:2070:9 [opt]
    frame #25: 0x00000001000b43f4 test_gc-66d106ef29e58418`std::sys::pal::unix::thread::Thread::new::thread_start::h1bd1b9c95010bf71 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::habad1faa89d23086 at boxed.rs:2070:9 [opt]
    frame #26: 0x00000001000b43f0 test_gc-66d106ef29e58418`std::sys::pal::unix::thread::Thread::new::thread_start::h1bd1b9c95010bf71 at thread.rs:108:17 [opt]
    frame #27: 0x0000000183b65f94 libsystem_pthread.dylib`_pthread_start + 136

If I break on the first or second sleep, I get a much more reasonable stack trace where multiple worker threads are still running.

@ngoldbaum
Copy link
Contributor

Created an upstream bug, hopefully one of the CPython devs can help us figure out if this is a PyO3 or CPython bug.

python/cpython#125853

@ngoldbaum
Copy link
Contributor

The upstream issue is fixed so we can close this and remove the fixme comments once 3.13.1 comes out. We should also maybe look closely at the other skipped asserts since they might have also been caused by GC corruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants