-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pro-active dataflow termination #306
Comments
benesch
added a commit
to benesch/timely-dataflow
that referenced
this issue
Mar 31, 2023
Between 2673e8c and ff516c8, drop_dataflow appears to be working well, based on testing in Materialize (MaterializeInc/materialize#18442). So remove the "public beta" warning from the docstring. Fix TimelyDataflow#306.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
At the moment dataflows only shut down through clean completion of the work within them. One releases all capabilities at inputs, and awaits the draining of work within the dataflow. One can imagine situations where ill-conceived dataflows start up that will create more work than the system can expect to handle, and we should have a way to terminate these dataflows without awaiting their successful completion.
It seems like most of the dataflows are instantiated by trees of owned data, with connection points back to the communication infrastructure. In principle we could drop these dataflows, and a fair amount of collection will happen automatically.
This is probably an idealized view of shutdown, and there are several issues one can already expect.
For example, if the channels associated with a dataflow go away, on the receipt of messages for these channels the system cannot currently distinguish between an old dataflow that has been shut down, and a new dataflow that has not yet started up. The system will reconstruct these channels, stash the messages, and await the dataflow construction to collect them. This won't happen, and the effect would be a memory leak.
We also have existed in the pleasant world where dataflows are only shut down after the system is certain there will not be additional messages, and so there may be cases where post-shutdown message receipt causes more egregious errors, up through panicking. At the moment we just don't know.
This issue also relates to the use of capability multisets, in that use case is for messages with empty capability sets, which also have the problem that they may exist beyond the lifetime of a dataflow: one could learn that the dataflow can be terminated (all capabilities are gone) and yet still receive messages without capabilities, also currently resulting in memory leaks or worse. A fix here might remove that blocker for multiset capabilities.
One candidate solution is to inform the communication plane of shut down channels, and then to discard messages for these channels. The current channel allocation mechanism uses sequentially assigned identifiers, and the system should be able to distinguish between "yet to be created" and "created and terminated" channels just by looking at the identifiers and comparing them with the next-to-be-assigned identifier. Messages for old identifiers without an associated channel should simply be discarded.
Dataflow destruction should also happen on all workers, but it seems like there isn't the same urgency that it happens in exactly the same order on all workers. Dataflows shut down on a subset of workers should (ideally) just block dataflow execution on other workers rather than result in unrecoverable errors.
The text was updated successfully, but these errors were encountered: