host: Fix RFNoC graph action queue lockup on action exceptions #730

hannodewind · 2024-02-19T10:01:23Z

Pull Request Details

Description

Processing of the action queue gets locked up when any action being executed in the send_action call throws an exception. Exceptions are not caught in the loop handling the action queue, resulting in the handling_ongoing queue locking flag to never be released. Any subsequent call to enqueue_action will return on the early exit with the assumption that we're already handling the actions, yet the previous handler exited with an exception.

This fix uses a RAII wrapper rather than a manually claimed and released atomic flag to ensure that the handling_ongoing will be released even under exceptional conditions.

Related Issue

Relates to issue #611

Which devices/areas does this affect?

UHD hosts using RFNoC graph

Testing Done

X310 with dual 10GbE links to server, running both RF inputs at 200MHz sample rate using 2x RX streamers.
Stress the server with CPU load (can use stress-ng), inducing UDP packet drops. (Also relates to #611, which stressed the link using iperf, probably also causing UDP packet drops).
At some point (difficult to reproduce, but does happen every so often), one of the RX streamers will experience an overrun, which calls the _overrun_handler -> ACTION_KEY_RX_RESTART_REQ which calls get_time_now(), doing a peek64 to the device. This peek64 then throws an exception due to an ACK timeout.

This exception is caught all the way up in thread that called recv on the RX streamer, but the stream is irrecoverable since the graph action queue is locked up.

Checklist

I have read the CONTRIBUTING document.
My code follows the code style of this project. See CODING.md.
I have updated the documentation accordingly.
I have added tests to cover my changes, and all previous tests pass.
I have checked all compat numbers if they need updating (FPGA compat,
MPM compat, noc_shell, specific RFNoC block, ...)

github-actions · 2024-02-19T10:01:36Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

hannodewind · 2024-02-19T10:02:09Z

I have read the CLA Document and I hereby sign the CLA

hannodewind · 2024-02-19T10:02:29Z

recheck

Processing of the action queue gets locked up when any action being executed in the send_action call throws an exception. Exceptions are not caught in the loop handling the action queue, resulting in the handling_ongoing queue locking flag to never be released. Any subsequent call to enqueue_action will return on the early exit with the assumption that we're already handling the actions, yet the previous handler exited with an exception. This fix uses a RAII wrapper rather than a manually claimed and released atomic flag to ensure that the handling_ongoing will be released even under exceptional conditions.

hannodewind · 2024-02-19T10:25:33Z

recheck

mbr0wn · 2024-02-29T13:52:09Z

@hannodewind Don't worry about the CLA checker bot, it's a misconfig on our end (I think).

This is all we need for now:

mbr0wn

@hannodewind your analysis is correct, and the solution is fine, too. I do want to double-check if we can have the atomic lock without adding this special-case class, but overall, this is an excellent solution.

mbr0wn · 2024-04-17T15:13:53Z

@hannodewind I think I will modify this to use UHD's scope_exit instead of your bespoke class, but otherwise, this is a fantastic change. Kudos to figuring out the issue, and for modifying in such a great way so far inside its guts!

hannodewind · 2024-04-18T07:14:55Z

@mbr0wn Thank you for the feedback, I am keen to see the scope_exit implementation. Let me know if there is anything else I can/should do on this PR, always happy to assist!

joergho · 2024-04-29T07:32:57Z

The change is now in master: 0f2007f

hannodewind force-pushed the bugfix/host/rfnoc_graph_action_lockup branch from 42a1756 to 0d68db5 Compare February 19, 2024 10:24

mbr0wn reviewed Apr 17, 2024

View reviewed changes

joergho closed this Apr 29, 2024

github-actions bot locked and limited conversation to collaborators Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

host: Fix RFNoC graph action queue lockup on action exceptions #730

host: Fix RFNoC graph action queue lockup on action exceptions #730

hannodewind commented Feb 19, 2024

github-actions bot commented Feb 19, 2024 •

edited

Loading

hannodewind commented Feb 19, 2024

hannodewind commented Feb 19, 2024

hannodewind commented Feb 19, 2024

mbr0wn commented Feb 29, 2024 •

edited

Loading

mbr0wn left a comment

mbr0wn commented Apr 17, 2024

hannodewind commented Apr 18, 2024

joergho commented Apr 29, 2024

host: Fix RFNoC graph action queue lockup on action exceptions #730

host: Fix RFNoC graph action queue lockup on action exceptions #730

Conversation

hannodewind commented Feb 19, 2024

Pull Request Details

Description

Related Issue

Which devices/areas does this affect?

Testing Done

Checklist

github-actions bot commented Feb 19, 2024 • edited Loading

hannodewind commented Feb 19, 2024

hannodewind commented Feb 19, 2024

hannodewind commented Feb 19, 2024

mbr0wn commented Feb 29, 2024 • edited Loading

mbr0wn left a comment

Choose a reason for hiding this comment

mbr0wn commented Apr 17, 2024

hannodewind commented Apr 18, 2024

joergho commented Apr 29, 2024

github-actions bot commented Feb 19, 2024 •

edited

Loading

mbr0wn commented Feb 29, 2024 •

edited

Loading