syzkaller: crashes caused by the programs finished long ago? #5297

a-nogikh · 2024-09-11T12:31:30Z

With a small hacky patch, one can see that in quite a number of cases kernel panics mention Comm: syz.PROC.ID of the programs executed minutes before the crash.

On my local syzkaller instance, most of such cases are INFO: task hung, but there are also rcu stalls, WARNING and even KASAN reports.

Currently, we include the last 6 executed programs per each proc into the crash log, while the IDs mentioned in the Comm: field are 100s of programs ago from those last executed IDs.

Does this happen because we lose track of some forked syz-executor child processes? Or were these processes actually killed and these are just some residual pieces of information in the kernel?
We can theoretically keep track of the last hundreds of executed programs per each VM and then append the serialized program from Comm: to the crash log. That should (hopefully) increase the bug reproduction rate, but it will also cost more memory. Is it worth it?

The text was updated successfully, but these errors were encountered:

a-nogikh · 2024-09-13T16:39:51Z

From a discussion with @dvyukov:

This can be due to parent syz-executor process killing the runner sub-process

syzkaller/executor/executor_runner.h

Lines 102 to 105 in 158f485

 if (now > exec_start_ + timeout) { 

 Restart(); 

 return; 

 }

before the sub-process finished waiting until it has killed its fork that was actually executing the program (probably becase that fork was stuck in the syscall context).

syzkaller/executor/common.h

Lines 715 to 717 in 158f485

 debug("killing hanging pid %d\n", pid); 

 kill_and_wait(pid, &status); 

 break;

We should try to stop killing the child runner process if it's already begun to execute a program:

syzkaller/executor/executor_runner.h

Line 88 in 158f485

if (state_ == State::Handshaking || state_ == State::Executing) {

But we need to add some monitoring/stats collection to ensure it has not caused any regressions.

a-nogikh · 2024-09-17T22:04:57Z

Local experiment (3 days uptime as of now). Two instances, 12 VMs each, 3 procs per VM.

Upstream syzkaller: 3.6M execs, 176 crash types, 58 C repros and 8 syz repros (66 total, 66/176=37%)
Patched (*): 2.4M execs, 145 crash types, 71 C repros and 8 syz repros (79 total, 79/145=54%)

(*) Make timed out runners Restart() only for state_ == State::Handshaking.

So it does improve the bug reproduction rate by a lot (especially noticeable for INFO: task hung bugs). But it has slowed down the fuzzing by 1/3. So could it mean that at least one proc (of the three total) hung blocked most of the time on each VM?

a-nogikh added the bug label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syzkaller: crashes caused by the programs finished long ago? #5297

syzkaller: crashes caused by the programs finished long ago? #5297

a-nogikh commented Sep 11, 2024

a-nogikh commented Sep 13, 2024

a-nogikh commented Sep 17, 2024

syzkaller: crashes caused by the programs finished long ago? #5297

syzkaller: crashes caused by the programs finished long ago? #5297

Comments

a-nogikh commented Sep 11, 2024

a-nogikh commented Sep 13, 2024

a-nogikh commented Sep 17, 2024