Fan-out durable functions consume excessive amount of CPU #2568

dxynnez · 2023-09-06T06:08:38Z

dxynnez
Sep 6, 2023

Hello team,

We have a function app running on dedicated app service plan (I1V2 with 3 instances). We recently added a durable function which would fan-out to ~100 sub orchestration functions, and eventually results in ~ 200 activity functions (1 sub-orch to 2 activitys). We tried to concurrently 'enqueue' 800 such durable functions and observed dramatic performance degradation and excessive amount of CPU usage.

Each activity function seems to execute relatively fast (~ 30 - 155ms), but the sub orchestrations seem to take 400s to complete on average, which eventually slow down the completion of the top-level orchestrations (took 950s to complete). The CPU usage on each worker spikes to > 80% when the durable functions are running.

We took a profiler trace on the worker and most of the CPU clocks seem to be spent on durable storage related modules (> 50%), while the CPU spent on our function module was relatively small (~ 7%).

Given our scale (800 concurrent durables * 100 sub-orchestraction * 2 activity function each) and 3 I1V2 workers, is this expected (perf and resource utilization wise)? Anything we can do to improve it (besides scaling up & out the ASP)?

Some other information that might help:

We are on .NET 6 Azure function V4 in-process mode with Microsoft.Azure.WebJobs.Extensions.DurableTask 2.11.0

We are using the default host.json, but we did try to tweak it a little bit by:

reducing the concurrency level of the activity function & orchestration function (as i1v2 only has 2 cores)
increase the control queue buffer
enable extended session & increase extended session idle timeout

But the default still seems to be the most performant setting.

cgillum · 2023-09-20T14:42:53Z

cgillum
Sep 20, 2023
Maintainer

400s (6-7 minutes) for a sub-orchestration seems really excessive. The only reason for performance this bad is that something is crashing. I would definitely look into the memory usage of your app. I also recommend going through our troubleshooting guide, if you haven't already. Note the part about using Azure Function app diagnostics to help diagnose common problems.

What size payloads are you passing between activities and sub-orchestrations? Sending large payloads can have a dramatic impact on performance.

Lastly, you should not be running any production apps on Functions v3, since that version is out of support. Please upgrade your runtime to Functions V4 ASAP.

8 replies

cgillum Oct 9, 2023
Maintainer

OK, so that helps narrow down the problem, which is something about the activities and not about orchestration processing generally. I could see two possibilities:

The work-item queue (which schedules activity calls) is really backed up or
The activities periodically crash, adding ~300s (5 minute) delays.

For (1), you could check the Azure Storage queue length to see if this might be the case. For (2), you should be able to detect this by looking at the troubleshooting tools, or by looking at the Durable Task Framework logs for Azure Storage and checking to see whether the value for DequeueCount is ever 2 or more.

dxynnez Oct 11, 2023
Author

Thanks @cgillum !
For 1), I checked the storage Queue Message Count metric and it's showing ~18K message during the particular hour (1hr is the minimum granularity that the metric supports) - is that a lot? If so, does it mean we need more powerful workers / more workers to handle the load?

For 2), that shouldn't be the case as our telemetry is showing very low to none failure rate for the activities, the function app's Function Executions and Errors detector is indicating the same thing.

We also observed another weird behavior that, even for executing a single durable function (which fans out to 4 sub orchestrations), it took around 25s to complete. Our telemetry is indicating that the sub orchestrations were sort of executed in sequence - they were executed on different workers but never concurrently with around 5-10s difference (assuming no clock skew here). Is this normal?

cgillum Oct 11, 2023
Maintainer

~18K messages is a lot. You'll need to consider a) increasing activity function concurrency (particularly if resource utilization is low), b) scaling out to a larger number of VMs, or c) using more powerful VMs (if resource utilization is high and you're not able to scale-out further).

Our telemetry is indicating that the sub orchestrations were sort of executed in sequence...Is this normal?

I'd need to look at your code to confirm whether you correctly scheduled the sub-orchestrations to run in parallel, but assuming you did, I'd expect that they actually run in parallel, regardless of whether they're on the same VM or different VMs. You may want to take a look at increasing your maximum orchestration concurrency as well if you think they delay might be load-related, just like it is for your activity functions.

dxynnez Oct 12, 2023
Author

Thanks @cgillum !

You'll need to consider a) increasing activity function concurrency (particularly if resource utilization is low), b) scaling out to a larger number of VMs, or c) using more powerful VMs (if resource utilization is high and you're not able to scale-out further).
I guess this basically implies that the situation is completely normal as our CPU is constantly hitting >90%？Do we have any formula to calculate how many durable functions / activity functions (dry run) that the durable framework can handle, with an average VM? Does 3000 executions / min (this is what's showing on the Function Execution Count metric, after we scheduled the 800 durables which fan-out to 80000 activities in total) sound like a lot, assuming all the activities are basically dry-run and would exit right after it's triggered?

For fan-out, we just add the task returned by the IDurableOrchestrationContext and doing a Task.WhenAll at this end, something like this:

  foreach (var input in inputs)
  {
      Task task= context.CallSubOrchestratorAsync(nameof(#functionName#, input);
      allTasks.Add(task);
  }
  await Task.WhenAll(allTasks);

This shouldn't be load-related as we really only triggered a single durable function when testing. The default orchestration concurrency (cores * 10) should be enough to handle all the sub-orchestrations (4). Also, per my understanding the orchestration concurrency is per machine, but in our cases the sub-orchestrations are executed by different VMs, but the VMs seem to pickup those fan-out sub-orchestrations in sequence.

dxynnez Oct 30, 2023
Author

poking @cgillum . Any more pointer? Appreciate it!

dxynnez · 2023-10-07T10:01:23Z

dxynnez
Oct 7, 2023
Author

Hello @cgillum , is there anything else we should look into?

0 replies

xsurfer · 2024-04-26T06:38:40Z

xsurfer
Apr 26, 2024

is there any update on this thread? I’m interested since facing a similar huge CPU usage

0 replies

quido · 2024-06-06T23:53:29Z

quido
Jun 6, 2024

My understanding is that hitting the CPU hard via CPU-intensive loops and such causes some sort of storage resource to crash and retry after a delay. #1687

In my durable function, I read a ton of data via API, and this works great. Then when I hit a loop to build 10k large objects in memory with this data, the activity or something supporting it crashes dozens of times before finally getting lucky and succeeding usually 40-60 minutes later. I've been trying to throttle the CPU usage with Thread.Sleep(), but it only seems to make it worse in the cloud.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fan-out durable functions consume excessive amount of CPU #2568

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fan-out durable functions consume excessive amount of CPU #2568

dxynnez Sep 6, 2023

Replies: 4 comments · 8 replies

cgillum Sep 20, 2023 Maintainer

cgillum Oct 9, 2023 Maintainer

dxynnez Oct 11, 2023 Author

cgillum Oct 11, 2023 Maintainer

dxynnez Oct 12, 2023 Author

dxynnez Oct 30, 2023 Author

dxynnez Oct 7, 2023 Author

xsurfer Apr 26, 2024

quido Jun 6, 2024

dxynnez
Sep 6, 2023

Replies: 4 comments 8 replies

cgillum
Sep 20, 2023
Maintainer

cgillum Oct 9, 2023
Maintainer

dxynnez Oct 11, 2023
Author

cgillum Oct 11, 2023
Maintainer

dxynnez Oct 12, 2023
Author

dxynnez Oct 30, 2023
Author

dxynnez
Oct 7, 2023
Author

xsurfer
Apr 26, 2024

quido
Jun 6, 2024