Instance history management for slot based zero-downtime deployments #1943

tgskiman · 2021-09-03T17:46:30Z

tgskiman
Sep 3, 2021

I'm looking for some thoughts/guidance on dealing with orchestration/instance history in slot based zero-downtime deployments. (BTW, calling out that I am overthinking/overcomplicating this is welcome too)

Background

Currently using durable functions to process transactional orders between a CPQ system and our ERP system. Function chaining (workflow) and reusability were some of the reasons we went with durable. We are using Table storage, so we are tracking Orchestration Instance ID to use to clear the history as opposed to using the time based. This is due to two factors:

The guidance against time based due to storage I/O hits (assuming cross partition queries): Purge multiple instance histories.
Our business transactions expire at various times depending on business rules and so we are currently preserving the history until the business item is expired for which then we are removing the history for that instance.

Enter Slot Based Zero-Downtime deployments

We started look at Slot Based Zero-Downtime deployments to improve our reliability and deployment processes and think overall this method works best for us. The only uncertainty we have with this is instance history management detailed below. We also suspected that the upcoming rewind functionality may have some considerations here too (but have not really dove into this feature yet).

Our analysis so far

The issue we saw was with managing instance history with this method. At first glance, it seems if we have a high number of deployments in a small time, we will need to store something with our instance history record to know which storage account the history is in to remove it. We see a few paths to doing this, but fear it will be overly complicated. We also pondered an n-slot type of configuration, but are limited by slot max and again, feel this is too complicated.

We also looked at clearing ALL instance history against the staging slot by augmenting the suggested function gate (status slot) prior to deployment. In this case we would lose some history prior to the business items being expired, but we are split on whether this is an issue for us.

Thank you for your time. We are hoping to see if a) we are overthinking/engineering this or b) there has been any thought and/or guidance on this so far. Any thoughts/discussions are greatly appreciated and questions are very welcome.

Side note, I just wanted to say we are really excited for Distributed Tracing support!

Thanks!
Tim

ConnorMcMahon · 2021-09-14T17:25:32Z

ConnorMcMahon
Sep 14, 2021

Note that for starting with a fresh history like the slot-based zero-downtime deployment approach takes, you just need to change either the TaskHub name or the storage account. In general, changing the taskhub name should be a substantially easier operational burden, which just changes the prefix on all of the storage resources so it effectively gives you a fresh history.

The other nice thing about this is that you can customize IDurableClient bindings (or our new IDurableClientFactory support for managing Durable Functions from normal .NET Core/5/6 apps) to point to different TaskHubs fairly easily. For the binding, you would just change the function.json or the attribute to change which TaskHub it points to. For IDurableClientFactory, you would just call IDurableClientFactory.GetClient(DurableClientOptions), and customize the DurableClientOptions.TaskHub property.

I hope some of these suggestions were helpful. There is still some burden of keeping track which taskhub/deployment that your instance ids are associated with, but TaskHubs are generally lighter weight then managing a storage account for each release.

In theory, you could even make a composite IDurableClient that queries all of your taskhubs and combines the results. This wouldn't be the most performant option, as it would make a storage request for each TaskHub, but it would make it so you don't have to keep track of which instances live on which TaskHubs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance history management for slot based zero-downtime deployments #1943

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Instance history management for slot based zero-downtime deployments #1943

tgskiman Sep 3, 2021

Background

Enter Slot Based Zero-Downtime deployments

Our analysis so far

Replies: 1 comment

ConnorMcMahon Sep 14, 2021

tgskiman
Sep 3, 2021

ConnorMcMahon
Sep 14, 2021