is it worth discussing rolling back orchestrations? #1828

vigouredelaruse · 2021-05-07T17:15:21Z

vigouredelaruse
May 7, 2021

assuming a suitably idempotent orchestration and associated activities

what's to stop somebody from reaching a point in the orchestration and deciding they want to roll back the orchestration to some point, and begin as new?

perhaps the orchestration gets out of synch with some external feature (like the source branch for instance), and you want to have a circuit breaker that treats some subset of the durable functions replay mechanism history as discarded

not trying to rewrite math on a distributed transaction manager here, but is this worth discussing?

cgillum · 2021-05-07T18:47:25Z

cgillum
May 7, 2021
Maintainer

It's an interesting idea, and certainly possible given the event-sourcing nature of Durable Functions. In fact, we have a narrower implementation of this pattern for rolling back orchestrations that have failed due to some external issue. More info on that here: Rewind instances.

As you say, we could theoretically generalize this to support non-failed orchestrations as well. The challenge, however, becomes figuring out what the API is for deciding how far back to rewind. Would an API that gives you a list of timestamps to rewind back to be easy enough to use (I'm guessing not)? What about an API that gives you a list of tasks and allowing users to choose where in that list to rewind to (I have worries about this as well)? Another option would be to add a "checkpoint" feature into the Durable Functions programming model, and allowing users to rewind back to specific checkpoints. That feels like it might provide the best usability in terms of how easy it is to understand and provide a good UX.

However, the part that gets really tricky is what to do with durable timers. These timers are scheduled for specific points in time, so it's not really something you can "rewind" (if the timer already expired before the rewind, it will be forced to expire immediately after the rewind). It's part of the reason why we haven't spent more time on the rewind scenario in general. I'm open to hearing if folks have thoughts or opinions on how to deal with this.

7 replies

cgillum Sep 21, 2021
Maintainer

Suspend / resume is definitely something I'd like to have at some point. One concrete example use-case I had in mind was with "at-most-once" delivery guarantees. For example, if we think we may possibly be executing a message/task/activity twice, rather than executing it, suspend the orchestration and wait for an admin to confirm whether it's safe to resume or not.

Your point about Failed being a permanent terminal state is an interesting one that admittedly I hadn't considered before. I can see how some automated action may have been taken in response to an orchestration failure, and how that external system could get confused if you revived a failed orchestration. I suppose a separate Suspended state could help alleviate this (though I think having a Rewind operation as an emergency lever is still valuable).

However, I feel a little weird about the suggestion to rewrite history when an orchestration is "resumed". I tend to think of suspend/resume as being operations that aren't represented in the history. Essentially, a suspended orchestration is just prevented from executing until someone resumes it. Otherwise I'm not really sure why Suspended is different from Failed beyond the nominal distinction. I would also expect that history should be append-only except in truly exceptional cases. Running out of retries doesn't feel like an exceptional case, thus it doesn't seem right to introduce history editing in such a case unless it's explicit (e.g. "rewind").

There's also the question of whether or how you transition an orchestration from Suspended to Failed.

olitomlinson Sep 22, 2021

I think my intention got a little lost in my words, this visual might help somewhat :

sebastianburckhardt Sep 22, 2021
Collaborator

I like the proposal. One thing to keep in mind is that the state of an orchestration should be a function of the history only (this is a design invariant we should maintain I think). With that in mind, we can define "suspended" by appending a special "Suspended" event to the history. This also means that in your picture, "rewind" and "resume" can be the same, since resuming simply means removing the "Suspended" event from the history.

One thing that needs some thought is how a suspended orchestration should react to messages (e.g. TimerFired, TaskCompleted, EventRaised). Perhaps we don't want to throw away those messages (as we do for failed orchestrations), but process them when the orchestration resumes.

Another thing to keep in mind is that if we plan to rely heavily on "rewind" for our solution, we may need to revisit the current "rewind" semantics and implementation to make it more robust. For example, the ability to "edit histories" is something that is completely invisible to DTFx, but it's done in storage directly. This makes it backend-specific, and can introduce problems with consistency, e.g. histories may be modified concurrently in memory and in storage, and histories cached in memory (e.g. for extended sessions) can be stale. This could perhaps be solved by adding some kind of support for history-editing in DTFx, which may deserve its own discussion.

olitomlinson Sep 22, 2021

You're quite right - I'm getting too involved in the implementation specifics by suggesting options like dropping history :) I'll stick to providing my views on the kind of API surface I would like to consume.

This also means that in your picture, "rewind" and "resume" can be the same, since resuming simply means removing the "Suspended" event from the history.

How would removing just the suspended event from the history promote the Last Good State (Before Failure) to the current state? Wouldn't you still have to deal with the fact that there would be 3 iterations of a failing Activity embedded in the history? Or would there be some other mechanism for promoting the Last Good State (Before Failure) to current state?

If an activity has 3 retry attempts, and then arrived in the suspended state given exhaustion of retry attempts, I would expect that current attempt would be reset to 0 when the rewind / resume command is given.

One thing that needs some thought is how a suspended orchestration should react to messages (e.g. TimerFired, TaskCompleted, EventRaised). Perhaps we don't want to throw away those messages (as we do for failed orchestrations), but process them when the orchestration resumes.

I think there might be a bunch of opposing paths which may all have justifications.

A fired timer can override a suspended state and advance the orchestration forward into Running.
A fired timer is queued. Only when Orchestration is restarted/resumed is the timer fired event realised by the Orchestration.
Drop all timer fired events.

In my use-cases they would align closely to option 1.

Task activityTask = context.CallActivityAsync("GetQuote");
        Task timeoutTask = context.CreateTimer(deadline, cts.Token);

        Task winner = await Task.WhenAny(activityTask, timeoutTask);
        if (winner == activityTask)
        {
            // success case
            cts.Cancel();
            return true;
        }
        else
        {
            // escalation path
            return false;
        }

In the above example, I would be in favour of retaining and acting upon any timers that have fired (while in a suspended state), Timers typically represent a business requirement that there must be some other escalation path that must be taken if an activity hasn't completed in time. I would say that if an Orchestration is hanging around in a suspended state due to a Failed Activity it is fair-game for the timer trigger to fire and override the 'Suspended' state and advance the orchestration forward into a running state - what ever happens downstream of this can't be helped.

In a way I see a timer as a form of compensation action to an inconclusive Activity. Does it matter why the Activity is inconclusive? - i.e. the activity didn't return a result in time, or failed multiple times and reached its attempt limit. Imo is the same.

There is probably evidence for 2) and 3) but I just don't have them in my use-cases.

vigouredelaruse Sep 22, 2021
Author

just noticing that the nasa people would call this state FAULTED as per https://www.nasa.gov/pdf/636372main_NASA-HDBK-1002_Draft.pdf and it's seems they would treat fault events as a valid though disappointing system state transition to safe mode, a state of degraded goals

in safe mode personally i'd be happy for the main loop to stop advancing and listening to signals while i execute or maybe deploy the compensation and human validate the mitigations on a separate signalling channel

so rather than semantically acting as undo for orchestrations it's an available feature because systems may enter states where misfired timers (system sync error at scope boundary) cause tribulation

sebastianburckhardt · 2021-09-17T18:37:49Z

sebastianburckhardt
Sep 17, 2021
Collaborator

There are some versions of this story that I think could make sense.

For example, the saga pattern is quite popular for workflows. The idea is that one first defines some 'compensation' for every component of a workflow. For example, for an activity that makes a flight reservation one can define a compensation that cancels the reservation. The framework can then take care of calling compensations in reverse order when unrolling an orchestration, and this can be taken care of in a recursive way.

As usual the devil is in the detail though. For example, it is not clear what should be done if a compensation fails and the workflow gets stuck in a half-rolled-back state.

Still, I think it may be worth considering some solution for this. Entities are also interesting in this context because it is always possible to roll back the entity state when needed. Perhaps experimenting with something like this, but implemented as a library, would be a good starting point, so we can iterate on the design a bit.

1 reply

cgillum Sep 21, 2021
Maintainer

The saga pattern is definitely an interesting one and could be built directly on-top of the existing Durable Functions abstraction. In fact, Temporal (almost the same thing as Durable) has a Java sample that does exactly this: https://github.com/temporalio/samples-java/tree/master/src/main/java/io/temporal/samples/bookingsaga.

  public void bookTrip(String name) {
    // Configure SAGA to run compensation activities in parallel
    Saga.Options sagaOptions = new Saga.Options.Builder().setParallelCompensation(true).build();
    Saga saga = new Saga(sagaOptions);
    try {
      String carReservationID = activities.reserveCar(name);
      saga.addCompensation(activities::cancelCar, carReservationID, name);

      String hotelReservationID = activities.bookHotel(name);
      saga.addCompensation(activities::cancelHotel, hotelReservationID, name);

      String flightReservationID = activities.bookFlight(name);
      saga.addCompensation(activities::cancelFlight, flightReservationID, name);
    } catch (ActivityFailure e) {
      saga.compensate();
      throw e;
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it worth discussing rolling back orchestrations? #1828

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

is it worth discussing rolling back orchestrations? #1828

vigouredelaruse May 7, 2021

Replies: 2 comments · 8 replies

cgillum May 7, 2021 Maintainer

cgillum Sep 21, 2021 Maintainer

olitomlinson Sep 22, 2021

sebastianburckhardt Sep 22, 2021 Collaborator

olitomlinson Sep 22, 2021

vigouredelaruse Sep 22, 2021 Author

sebastianburckhardt Sep 17, 2021 Collaborator

cgillum Sep 21, 2021 Maintainer

vigouredelaruse
May 7, 2021

Replies: 2 comments 8 replies

cgillum
May 7, 2021
Maintainer

cgillum Sep 21, 2021
Maintainer

sebastianburckhardt Sep 22, 2021
Collaborator

vigouredelaruse Sep 22, 2021
Author

sebastianburckhardt
Sep 17, 2021
Collaborator

cgillum Sep 21, 2021
Maintainer