Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for rosbag2_transport::Recorder failures due to the unhandled exceptions #1382

Conversation

MichaelOrlov
Copy link
Contributor

When Record exits by signal interrupt we are invalidating context by calling rclcpp::shutdown() it could potentially cause raising exceptions in inquiries related to the node's operations and node_graph which we are not handling.

Unhandled exceptions from the underlying node could potentially lead to the unclean exit of the Recorder i.e. abnormal termination with a non-zero exit code.

@emersonknapp
Copy link
Collaborator

I don't fully understand the motivation for this change. Is the problem that we are calling APIs after rclcpp::ok() == false?

As a pattern, I don't think it's a good practice to put catch-all catch (std::exception) blocks in a lot of places. If we are going to handle literally any possible exception, then it should probably be in one place up high, in the main control function or something.

One place I see you catching exceptions is in the event_publisher_thread - because an unhandled exception in the thread can halt the whole program. Maybe this means that it should be an std::async that provides us an std::future instead - that way it can have an exception safely without crashing the program, and we can handle it up top in close()

@MichaelOrlov
Copy link
Contributor Author

MichaelOrlov commented Jun 9, 2023

Yes, the motivation was that we are calling node's APIs after rclcpp::shutdown() i.e. when rclcpp::ok() == false?

On one hand, I agree that this is not a good practice to put catch-all catch (std::exception) blocks in a lot of places, but from another hand, I have to do this since 1) our rclcpp API very often doesn't specify what exact exception could be thrown and 2) we don't have the main control function or something like that to be able to handle all of them in one place.
And here is the really matter to consider in each concrete case if an exception from the underlying node is critical for us and we should abnormally finish execution or not so critical and we can continue.

As regards to the catching exceptions is in the event_publisher_thread I don't think that makes sense to rewrite it with std::async. The only difference is that future::get() will rethrow the exception that happened inside the async task.
But do we really want this behavior?
I don't think so. At least for a while I do not see reasons for that.
And we can achieve the same result if pass exception_ptr to the event_publisher_thread and process it after the join() command.

@emersonknapp In general I understand your concern that catching all exceptions in multiple places is a bad practice.
I will try to reconsider my changes if I can combine or get rid of some of them or make handlers more specific.
If you have a concrete proposals or suggestions - I would be glad to consider them.

@MichaelOrlov MichaelOrlov marked this pull request as draft June 10, 2023 20:11
@MichaelOrlov MichaelOrlov force-pushed the morlov/reduce_failures_in_recorder_due_to_unhandled_exceptions branch from 767d440 to 1cd4dd2 Compare June 11, 2023 02:56
… exceptions"

This reverts commit 767d440.

Signed-off-by: Michael Orlov <[email protected]>
- Exceptions potentially may come from the underlying node operations
when we are invalidating context via rclcpp::shutdown(). We need to have
possibility to correct finish recording in this case.

Signed-off-by: Michael Orlov <[email protected]>
@MichaelOrlov MichaelOrlov force-pushed the morlov/reduce_failures_in_recorder_due_to_unhandled_exceptions branch from 1cd4dd2 to 1c79e8a Compare June 11, 2023 02:57
@MichaelOrlov MichaelOrlov marked this pull request as ready for review June 11, 2023 03:28
@MichaelOrlov
Copy link
Contributor Author

@emersonknapp I've rewritten exception handling from the underlying node in my new commit 1c79e8a. Please review it again.
Now I am handling exceptions only in two places in event publisher and discovery threads which are acting like our "main" function analogs. Since during recording, we are running only two of those threads.

I've tried to consider rethrowing exceptions or not handling them in relevant threads and processing them inside close() method. Although it doesn't look like the right approach IMO.
The questions arise for instance:

  • What if we got failure in discovery for some reason?
  • What if we occasionally failed to subscribe to one of the topics?
  • Shall we terminate the recording or discovery thread in these cases?
    I think it would be more reasonable to not interrupt the recording or discovery thread in case of an error on discovery or an attempt to subscribe to one topic. We may run discovery and fail on some topic occasionally which is \rosout or some stray topic that might be not very interesting for us. However, it would be valuable to continue recording on already subscribed topics or try to subscribe to other available topics.
  • As regards the event publisher thread, the situation is similar. We likely don't want to interrupt the recording if we failed to publish a notification about the bag split event.

If considering handling deferred exceptions from discovery and event publisher threads in recorder::close() method.
Let's assume we will get them there - but what we can do with them?
We can't really do much with them since errors in close() usually will be happening during exit and we are just before or already in the destructor.

@MichaelOrlov
Copy link
Contributor Author

@emersonknapp friendly ping for review.

Copy link
Collaborator

@emersonknapp emersonknapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@MichaelOrlov
Copy link
Contributor Author

Gist: https://gist.githubusercontent.com/MichaelOrlov/4b2e711c40c326977285ea111ede124c/raw/3e4519e1f68ff0a741131a0f145c9e62ec617050/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_transport ros2bag
TEST args: --packages-above rosbag2_transport ros2bag
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/12258

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov MichaelOrlov merged commit bb8d2a5 into rolling Jun 17, 2023
@delete-merged-branch delete-merged-branch bot deleted the morlov/reduce_failures_in_recorder_due_to_unhandled_exceptions branch June 17, 2023 01:37
@MichaelOrlov
Copy link
Contributor Author

https://github.com/Mergifyio backport iron

@mergify
Copy link

mergify bot commented Jun 17, 2023

backport iron

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jun 17, 2023
…eptions (#1382)

* Fix for rosbag2_transport::Recorder failures due to unhandled exceptions

Signed-off-by: Michael Orlov <[email protected]>

* Revert "Fix for rosbag2_transport::Recorder failures due to unhandled exceptions"

This reverts commit 767d440.

Signed-off-by: Michael Orlov <[email protected]>

* Handle exceptions in event publisher and discovery threads

- Exceptions potentially may come from the underlying node operations
when we are invalidating context via rclcpp::shutdown(). We need to have
possibility to correct finish recording in this case.

Signed-off-by: Michael Orlov <[email protected]>

---------

Signed-off-by: Michael Orlov <[email protected]>
(cherry picked from commit bb8d2a5)
@MichaelOrlov
Copy link
Contributor Author

https://github.com/Mergifyio backport humble

@mergify
Copy link

mergify bot commented Jun 17, 2023

backport humble

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jun 17, 2023
…eptions (#1382)

* Fix for rosbag2_transport::Recorder failures due to unhandled exceptions

Signed-off-by: Michael Orlov <[email protected]>

* Revert "Fix for rosbag2_transport::Recorder failures due to unhandled exceptions"

This reverts commit 767d440.

Signed-off-by: Michael Orlov <[email protected]>

* Handle exceptions in event publisher and discovery threads

- Exceptions potentially may come from the underlying node operations
when we are invalidating context via rclcpp::shutdown(). We need to have
possibility to correct finish recording in this case.

Signed-off-by: Michael Orlov <[email protected]>

---------

Signed-off-by: Michael Orlov <[email protected]>
(cherry picked from commit bb8d2a5)

# Conflicts:
#	rosbag2_transport/src/rosbag2_transport/recorder.cpp
MichaelOrlov added a commit that referenced this pull request Jun 17, 2023
…eptions (#1382) (#1402)

* Fix for rosbag2_transport::Recorder failures due to unhandled exceptions

Signed-off-by: Michael Orlov <[email protected]>

* Revert "Fix for rosbag2_transport::Recorder failures due to unhandled exceptions"

This reverts commit 767d440.

Signed-off-by: Michael Orlov <[email protected]>

* Handle exceptions in event publisher and discovery threads

- Exceptions potentially may come from the underlying node operations
when we are invalidating context via rclcpp::shutdown(). We need to have
possibility to correct finish recording in this case.

Signed-off-by: Michael Orlov <[email protected]>

---------

Signed-off-by: Michael Orlov <[email protected]>
(cherry picked from commit bb8d2a5)

Co-authored-by: Michael Orlov <[email protected]>
MichaelOrlov added a commit that referenced this pull request Jun 18, 2023
…ndled exceptions (backport #1382) (#1403)

* Fix for rosbag2_transport::Recorder failures due to the unhandled exceptions (#1382)

* Fix for rosbag2_transport::Recorder failures due to unhandled exceptions

Signed-off-by: Michael Orlov <[email protected]>

* Revert "Fix for rosbag2_transport::Recorder failures due to unhandled exceptions"

This reverts commit 767d440.

Signed-off-by: Michael Orlov <[email protected]>

* Handle exceptions in event publisher and discovery threads

- Exceptions potentially may come from the underlying node operations
when we are invalidating context via rclcpp::shutdown(). We need to have
possibility to correct finish recording in this case.

Signed-off-by: Michael Orlov <[email protected]>

---------

Signed-off-by: Michael Orlov <[email protected]>
(cherry picked from commit bb8d2a5)

# Conflicts:
#	rosbag2_transport/src/rosbag2_transport/recorder.cpp

* Fix for merge conflicts

Signed-off-by: Michael Orlov <[email protected]>

---------

Signed-off-by: Michael Orlov <[email protected]>
Co-authored-by: Michael Orlov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants