Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I2C should not treat a mux as having been reset if it NACKs or otherwise fails. #1822

Open
cbiffle opened this issue Jul 12, 2024 · 0 comments

Comments

@cbiffle
Copy link
Collaborator

cbiffle commented Jul 12, 2024

This is a follow-on to #1821, continuing my hunt for the Gimlet Disk Wumpus.

In that bug I pointed out that we're generating bus resets in the middle of perfectly good transactions for no apparent reason, and doing it in a way that scrambles the I2C controller's state. We should stop doing that, obvs.

However, while we're doing that, it caused me to notice a different bug that's compounding the issue.

tl;dr: we treat a mux as having been reset if it NACKs.

Clues!

I posted this on #1821 but let's look at it anew:

2024-07-12-133909_3466x1534_scrot

This is at the tail end of some Very Bad Things happening to the I2C neighborhood, and the Saleae protocol decoder is badly out of sync. What it reports as a weird post-NACK 0xC1 byte is actually us attempting to address the mux at 0x70 to reset it. (0x70 + NACK (1), shifted left 1 because the decoder is desync'd, gives you 0xC1.)

We don't generate a start condition because shrug emoji. We've totally scrambled the controller at this point, all bets are off. But without a start condition, the mux is certainly not going to ack its address -- it doesn't even know we're speaking to it!

Normally this would be fine -- the transaction fails, we try again -- but this is a mux we're talking to, so things get weird.

We're handling mux errors wrong

The problem is what happens in the driver when a mux nacks. This was being indicated in the ringbuf output during my debugging session by this repeating pattern:

49  262       12        1 ResetMux(0x70)
50  262       12        1 ResetMux(0x71)
51  262       12        1 ResetMux(0x72)
52  178       12        1 MuxMissing(0x70)
53  190       12        1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
54  525       12        1 Error(0x6a, BusReset)

From inspecting the code, I can clarify what this means:

  • ResetMux does not mean the mux has been reset. It means that, if the mux has a GPIO reset line (which it does not on this rev of Gimlet), we have toggled it. It also means we have recorded the mux's state as Unknown.
  • Next time we try to talk to anything that involves mux 0x70 (which means anything on the same segment, iirc), we see the Unknown and attempt to put the mux into a known state, signalling success at this with MuxUnknownRecover
  • ...but there's a MuxMissing between the two, that's weird

The I2C error codes in the driver have names that bear no relation to the I2C standard, which makes them hard to follow. I eventually worked out the data flow, which is as follows:

  • At the lowest level, when the driver observes NACKF set in the status register, it produces an error code NoDevice, which means that the device nacked once.
  • In the stm32xx_i2c crate's I2cMux::error_code function, this gets translated -- for muxes specifically -- into something called MuxMissing. So MuxMissing also means that the device nacked once.
  • If this should happen during a call to configure_mux in the stm32xx_i2c_server crate, the code special-cases MuxMissing and treats it as success. (A comment says this should not happen in production. Technically I'm on a bench, so the comment is technically not false yet.)
  • Because the mux reconfiguration "succeeded," the driver removes the Unknown state from the mux (preventing further recovery attempts) and records MuxUnknownRecover to trumpet its success.

In this case, the net effect is that mux 0x70 is left with port 0 enabled (to a disk), and we then go and enable a port on the mux at 0x72, and now we have two hotplug controllers and two disks with the same addresses, wired together. Needless to say, the next transaction fails.

This appears to be happening in practice once every two minutes or so. It's not every time, because the controller-stomping behavior makes it delightfully non-deterministic. I've written a tool that reads Saleae Logic captures, models the mux state, and reports any violations of our "one mux channel active" invariant, and it's finding a lot of meat.

What to do?

This section of the code is complex, with a lot of indirections and functions that take closures and the like, so I haven't totally wrapped my head around the fix. But the goal is: don't take a mux out of Unknown state unless we get acknowledgement that we've been able to update it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant