-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul event publication lifecycle #796
Comments
Does not being able to tell that an event is being processed also mean that currently multi-instance apps are not an option? I’m not a database expert, but I believe at least PostgreSQL supports row-level locking, which would allow concurrent processing of events by multiple instances, unlike some leader election method. |
@breun The whole processing (sending event -> handler -> mark as finished) is not done by "submitting" the event to the table and a "worker" picks it up. The processing happens always on the instance the Event has been sent in the first place. The publication_log is just there to keep track of what events have been processed. And the only information you currently have is if there is a completion_date on the event and when it got published. We've built our own retry mechanism around the log, in which we are retrying events that are at least n-minutes old, but because it is an "publication_log" we have the issue that sometimes events get processed multiple times, when an event takes a long time to be processed (either because one of the steps take a long time or when a lot of events get sent and they can't get processed because all threads in our thread pool are already busy). And that's where this misconception comes into place. If you work around the fact that the table is not used for processing at all than you may get into these issues if you build your retry mechanism. My thoughts around this topicWe would really wish for a way to distinguish between events that are currently being processed and events that have failed, but in all implementations there are edge cases which you may or may not support in spring-modulith. If you have a dedicated status field (e.g. SUBMITTED, PROCESSING, FINSIHED, FAILED) you can easily find out which events to retry, based on the failed field and can skip the PROCESSING ones - unless you have events that are struck, because the instance went down when processing them. In order to identify them in a multi-instance setup you would have to keep track of which instance is currently active. If you handle it via a failedDate column you have to identify the currently being processed ones via an offset (as described in the issue description) - but here you have to be careful with longer running tasks, because as i mentioned, it can happen that it takes a few minutes until an event is picked up (because of all threads are being utilized) In that case it could make sense to also have an column for when the Event got picked up and the handler is being triggered.... ConclusionThinking more about it, a big problem with the event_publication table is the misconception I mentioned. For example I expected that the event_publication could be seen as a "light" version of an event-externalization, but it does definetely does not work that way (and probably shouldn't be used in that way). From my gut feeling (and talking with colleagues about that) it feels like that I am not the only one that stepped into that. Maybe I am jumping a little bit to a different topic, where this issue isn't about, but I think it should be clearer from the docs that the current event_publication mechanism should not be seen as as an externalized event processing mechanism and should show up the limitations of that. And regarding what the users are expecting (and what @breun even mentioned) Edit:
I just took a look into the docs and Event-Externalization means that you just publish events to other systems so that other applications can get them - not that you consume them from these. |
It is unclear to me what Modulith's responsibility should be: ensuring the delivery of events only or also dealing with problems. To ensure the delivery of events, For handling problems, Therefore, I think Modulith should focus on the delivery side of things, for example, by better tracking the status (event queued, event processing, …) and providing better docs, and perhaps some callbacks, on how to deal with event delivery problems, like event classes that have been removed or event listeners that no longer exist. |
The persistent structure of an event publication currently effectively represents two states. Their default state captures the fact that a transactional event listener will have to be invoked eventually. The publication also stays in that state while the listener processes the event. Once the listener succeeds, the event publication is marked as completed. If the listener fails, the event publication stays in its original state.
This basic lifecycle is easy to work with, but has a couple of downsides. First and foremost, we cannot differentiate between publications that are about to be processed, ones that are processed and ones that have failed. Especially the latter is problematic, as a primary use case supported by the registry is to be able to recover from erroneous situations by resubmitting failed event publications. Developers usually resort to rather fuzzy approaches like considering events that have not been completed in a given time frame to be incomplete.
To improve on this, we’d like to move to a more sophisticated event publication lifecycle that allows to detect failed ones easier. One possible way to achieve this would be to introduce a dedicated status field, or — consistent with the current approach of setting a completion date — a failed date field which would need to be set in case an event listener fails. That step, however, might fail as well, as the erroneous situation that leads to the event listener failing in the first place. That’s why it might make sense to introduce a duration configuration property, after which incomplete event publications might be considered incomplete as well.
The feature bears a bit of risk, as we will have to think about the upgrade process of Spring Modulith applications. Existing apps might still contain entries in the database of incomplete event publications.
Ideas / Action Items
failedDate
column.publishedDate before now() minus duration
.CompletionRegisteringMethodInterceptor
would need to issue the marking as failed on exception.IncompleteEventPublications
would have to get agetFailedPublications()
.Related tickets
The text was updated successfully, but these errors were encountered: