Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syzkaller: crashes caused by the programs finished long ago? #5297

Open
a-nogikh opened this issue Sep 11, 2024 · 2 comments
Open

syzkaller: crashes caused by the programs finished long ago? #5297

a-nogikh opened this issue Sep 11, 2024 · 2 comments
Labels

Comments

@a-nogikh
Copy link
Collaborator

With a small hacky patch, one can see that in quite a number of cases kernel panics mention Comm: syz.PROC.ID of the programs executed minutes before the crash.

On my local syzkaller instance, most of such cases are INFO: task hung, but there are also rcu stalls, WARNING and even KASAN reports.

Currently, we include the last 6 executed programs per each proc into the crash log, while the IDs mentioned in the Comm: field are 100s of programs ago from those last executed IDs.

  1. Does this happen because we lose track of some forked syz-executor child processes? Or were these processes actually killed and these are just some residual pieces of information in the kernel?
  2. We can theoretically keep track of the last hundreds of executed programs per each VM and then append the serialized program from Comm: to the crash log. That should (hopefully) increase the bug reproduction rate, but it will also cost more memory. Is it worth it?
@a-nogikh a-nogikh added the bug label Sep 11, 2024
@a-nogikh
Copy link
Collaborator Author

From a discussion with @dvyukov:

This can be due to parent syz-executor process killing the runner sub-process

if (now > exec_start_ + timeout) {
Restart();
return;
}

before the sub-process finished waiting until it has killed its fork that was actually executing the program (probably becase that fork was stuck in the syscall context).

syzkaller/executor/common.h

Lines 715 to 717 in 158f485

debug("killing hanging pid %d\n", pid);
kill_and_wait(pid, &status);
break;

We should try to stop killing the child runner process if it's already begun to execute a program:

if (state_ == State::Handshaking || state_ == State::Executing) {

But we need to add some monitoring/stats collection to ensure it has not caused any regressions.

@a-nogikh
Copy link
Collaborator Author

Local experiment (3 days uptime as of now). Two instances, 12 VMs each, 3 procs per VM.

  • Upstream syzkaller: 3.6M execs, 176 crash types, 58 C repros and 8 syz repros (66 total, 66/176=37%)
  • Patched (*): 2.4M execs, 145 crash types, 71 C repros and 8 syz repros (79 total, 79/145=54%)

(*) Make timed out runners Restart() only for state_ == State::Handshaking.

So it does improve the bug reproduction rate by a lot (especially noticeable for INFO: task hung bugs). But it has slowed down the fuzzing by 1/3. So could it mean that at least one proc (of the three total) hung blocked most of the time on each VM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant