Need help to find out drop in performance after updated to azure function durable extension v2.6.1 (from v2.5.1) #2358

siemenstan · 2023-01-17T06:02:10Z

siemenstan
Jan 17, 2023

Hi,

I would like to ask for some help or guidance in checking why the drop in performance after updated my app (running inside azure kubernetes) with durable extension v2.6.1 (from v2.5.1), there is no any code change nor configuration changes apart from the durable extension library version change (Didn't update to the latest durable extension due to the issue filed in discussion#2335).

Below is my app info:
Azure Region: West US
App info: C# .NET 6.0 on Azure Kubernetes (1 Orchestrator function with 3 activity chaining pattern)
ApplicationInsights Name: ncppfc01aiseus01
Storage Account Name: ncppfc01xenstrwus02

partitionCount=6, and manually scale to 6 pods to avoid rebalance during the loadtest. Here is the host.json

Load test is done by using jmeter to fire 8000 requests per minute to the app, for 15 minutes. Result and (timing) response from each request is captured and benchmarked.
Load test session (with durable extension v2.6.1) is between 14-Jan-2023, 09:38:00 to 10:07:00 UTC
Sample orchestration instance id: b921a973-8728-4c32-b37b-ac2b3513e721 (AppInsights link)

Load test session (with durable extension v2.5.1) is between 14-Jan-2023, 08:06:00 to 08:35:00 UTC
Sample orchestration instance id: e9c62ccc-de90-4954-913a-e34645e40ec8 (AppInsights link)

Here is the AppInsights E2E timeline visualisation, I would like to understand why there orchestrator is slow when the app is updated with durable extension v2.6.1.

I can see some high delay, PendingOrchestratorMessageLimitReached, and split brain traces in the trace message. The occurrence of such trace is much higher with durable extension v2.6.1 than v2.5.1 with same configuration and same loadtest condition. Is there any parameter we need to update after updated to durable extension v2.6.1?

Thank you

cgillum · 2023-01-18T18:25:52Z

cgillum
Jan 18, 2023
Maintainer

It's hard to say what the source of the slowness, or whether it's even related to the specific version update. Both versions you're referring to are quite old, though, so I wonder if it would be better to try upgrading to v2.9.0 (the latest as of right now) instead of v2.6.1.

That said, split-brain errors are pretty much never expected, so something appears to be off. What configuration values do you have in your host.json file? Also, are you doing any auto-scaling of your environment? Split-brain was more common back when we used an older partition management scheme and when the number of instance hosting an app changed (e.g. scaling from 1 to 10), but we've since defaulted to a cooperative partition management strategy that should get rid of these types of problems. The other time I've seen split-brain is when multiple instances of the app are running on the same machine - i.e. the machine name environment variable is the same. This confuses our lease mechanism, though it's typically something you see when trying to run multiple instances on a local machine. I assume a containerized app wouldn't have this problem.

Lastly, is this problem persistent, or does it eventually go away? If it eventually goes away, that might suggest it is related to some change in the update, though I didn't see anything obvious when scanning the release notes for v2.6.0 and v2.6.1.

1 reply

siemenstan Jan 20, 2023
Author

This is my app host.json. Only the few parameters are configured, the rest are pretty much all stick with default value and behavior.

{
"version": "2.0",
"extensions": {
  "durableTask": {
    "storageProvider": {
      "controlQueueBatchSize": 32,
      "controlQueueBufferThreshold": 256,
      "controlQueueVisibilityTimeout": "00:05:00",
      "maxQueuePollingInterval": "00:00:30",
      "partitionCount": 6,
      "workItemQueueVisibilityTimeout": "00:05:00",
      "connectionStringName": "AzureWebJobsStorage"
    },
    "maxConcurrentActivityFunctions": 30,
    "maxConcurrentOrchestratorFunctions": 30,
    "hubName": "%TaskHubName%"
  }
},
"logging": {
  "fileLoggingMode": "never",
  "console": {
    "isEnabled": "false"
  },
  "logLevel": {
    "default": "Debug",
    "Host.Results": "Debug",
    "Host.Aggregator": "Debug",
    "DurableTask.Core": "Debug"
  },
  "applicationInsights": {
    "samplingSettings": {
      "isEnabled": false,
      "maxTelemetryItemsPerSecond": 500
    }
  }
}
}

Auto-scaling is off during the load test, all the pods are manually scaled to tested loadtest configuration (which is from loadtest result with durable extension v2.5.1). We would like to upgrade the to latest durable extension but there is issue starting from v2.7.0 (till v2.9.0), reported in issue#2336, so v2.6.1 is the latest version we can used now. But, for your information, I did the same loadtest with app updated to modified durable extension v2.9.0 (I modified with my hacky changes on top of the v2.9.0 code base), and the loadtest result is also no good. I also tried adjust some of the parameters value like controlQueueBufferThreshold, maxConcurrentActivityFunctions, and maxConcurrentOrchestratorFunctions but the result is still far from the performance we tested with durable extension v2.5.1, and I out of idea what else should I look into.

Load test session (with modified durable extension v2.9.0) is between 15-Jan-2023, 11:19:00 to 11:55:00 UTC
Sample orchestration instance id: 89f7935e-87b4-41b4-8fc5-76f058037053

Btw, back to the app insights log with the durable extension v2.6.1, I am seeing the messages of this instance is moving to different control queue (control-03 to control-01 and back to control-03), is that expected? (I didn't see such situation with durable extension v2.5.1)

let start = datetime("2023-01-14T09:38:00.000Z"); let end = datetime("2023-01-14T10:07:00.000Z");
let instanceId = "b921a973-8728-4c32-b37b-ac2b3513e721";
let serviceName = "<removed>";
let shardName = "<removed>";
union isfuzzy=true
    availabilityResults,
    requests,
    exceptions,
    pageViews,
    traces,
    customEvents,
    dependencies
| where timestamp > start and timestamp < end
| where cloud_RoleInstance has shardName
| where * has instanceId
| extend eventName = tostring(customDimensions.EventName)
| extend operationName = tostring(customDimensions.OperationName)
| extend partitionId = tostring(customDimensions.prop__PartitionId)
| order by timestamp asc
| take 1000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help to find out drop in performance after updated to azure function durable extension v2.6.1 (from v2.5.1) #2358

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Need help to find out drop in performance after updated to azure function durable extension v2.6.1 (from v2.5.1) #2358

siemenstan Jan 17, 2023

Replies: 1 comment · 1 reply

cgillum Jan 18, 2023 Maintainer

siemenstan Jan 20, 2023 Author

siemenstan
Jan 17, 2023

Replies: 1 comment 1 reply

cgillum
Jan 18, 2023
Maintainer

siemenstan Jan 20, 2023
Author