Allow concurrent broker restarts within same broker rack #1001

ctrlaltluc · 2023-06-22T09:33:50Z

Description

Cluster restarts (during rolling upgrades) are time consuming for big clusters.

Given that clients might be using Kafka broker.rack config to spread replicas across N AZs, we can speed up the cluster restart by allowing concurrent broker restarts within the same AZ. Allowing only same AZ concurrent restarts makes sure that we always have at most 1/Nth of the replicas of any topic-partition offline, not more, while the other (N-1)/N replicas are fine.
For example, using a common 3AZ setup, this would ensure that at most 1/3rd of the replicas are offline, while the other 2/3rds are in-sync.

Key changes

add configuration parameter RollingUpgradeConfig.ConcurrentBrokerRestartsPerRackCount which is by default 1
allow concurrent broker restarts only if there are no failures or restarts in other AZs and if there are no more failures or restarts in the same AZ than specified by RollingUpgradeConfig.ConcurrentBrokerRestartsPerRackCount and RollingUpgradeConfig.FailureThreshold
do not change the semantic of RollingUpgradeConfig.FailureThreshold by allowing (same as before) multiple failures even if not in the same AZ
if broker.rack is not properly configured for all brokers, consider each broker as being in a separate AZ, meaning we err on the side of caution and restart at most 1 broker at once
concurrent restarts are triggered only if Cruise Control has the RackAwareDistributionGoal ready and not recently violated (either as fixable or unfixable)

Notes

in order to enable multiple concurrent restarts, we must set both RollingUpgradeConfig.ConcurrentBrokerRestartsPerRackCount and RollingUpgradeConfig.FailureThreshold to a value above 1

Type of Change

New Feature

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

pkg/resources/kafka/configmap_test.go

panyuenlau

Thx for the contribution! Took a quick look at the implementation and it looks good. Will come back for the tests when I get some more time tmr

pkg/resources/kafka/kafka.go

api/v1beta1/kafkacluster_types.go

bartam1 · 2023-06-27T13:00:05Z

Hello @ctrlaltluc !
Thank you for the contribution. We are reviewing it.
I can see a problematic case when the rack awareness is only enabled after some time. In this case it can happen that there are some topics without rack awareness. Restarting multiple brokers in same rack which contains all of those topics partitions can be problem. (com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal is turned off or violated)

What if we dont do restart based on racks rather we impelement a check which maps broker-topic-partitions on the kafka cluster and checks that the actual broker contains a partition which is not appeared on another healthy broker (also check is the partition in sync) . If there is no such partition then we can restart the broker. It would be more advanced if before the broker restart there would be a new leader election (for that partition which is on the actual broker what we want to restart) so the consuming producing would be totally seamless (no need to wait for the new leader election by controller). We can introduce a new field which would specify how many in sync replicas should be remain when rollingUpgrade happens. (In this way if another broker goes down unintentionally under rolling upgrade then cluster can be also helathy). After the rolling upgrade we should checks that everything is in sync and when yes then we should execute the kafka-preferred-replica-election command to achieve that state where every leader is the prefered one.

The https://github.com/linkedin/cruise-control/wiki/REST-APIs#demote-a-list-of-brokers-from-the-kafka-cluster can also be used before we restart a broker. (Actually it also can be used before remove_broker in this case the remove_broker operation could be finished sooner because there were be no leader on the removable broker which is problematic under heavy load).

Execute a demotebroker operation in rolling upgrade. When the broker is demoted do the restart and go to the next broker. If we do this it is possible that scennario when the rest of the healthy broker getting too much load.

pkg/resources/kafka/kafka.go

bartam1 · 2023-06-27T09:04:31Z

pkg/resources/kafka/kafka.go

+func getBrokerAzMap(cluster *v1beta1.KafkaCluster) map[int32]string {
+	brokerAzMap := make(map[int32]string)
+	for _, broker := range cluster.Spec.Brokers {
+		brokerRack := getBrokerRack(broker.ReadOnlyConfig)


We should take care of that case when the broker config is coming from the brokerConfigGroup.
Here we should use this function to extend:

koperator/api/v1beta1/kafkacluster_types.go

Line 1099 in 25c4e97

func (b *Broker) GetBrokerConfig(kafkaClusterSpec KafkaClusterSpec) (*BrokerConfig, error) {

Can the broker rack be specified via broker config?

I assumed the source of truth for broker rack is the read-only config, as in broker groups you can define it more freely, not necessarily with rack semantics, plus that the broker.rack needs to be set in read-only configs for it to work.

Please lmk if I am mistaken.

pkg/resources/kafka/kafka.go

pkg/resources/kafka/kafka_test.go

ctrlaltluc · 2023-07-17T07:32:01Z

Hello @ctrlaltluc ! Thank you for the contribution. We are reviewing it. I can see a problematic case when the rack awareness is only enabled after some time. In this case it can happen that there are some topics without rack awareness. Restarting multiple brokers in same rack which contains all of those topics partitions can be problem. (com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal is turned off or violated)

What if we dont do restart based on racks rather we impelement a check which maps broker-topic-partitions on the kafka cluster and checks that the actual broker contains a partition which is not appeared on another healthy broker (also check is the partition in sync) . If there is no such partition then we can restart the broker. It would be more advanced if before the broker restart there would be a new leader election (for that partition which is on the actual broker what we want to restart) so the consuming producing would be totally seamless (no need to wait for the new leader election by controller). We can introduce a new field which would specify how many in sync replicas should be remain when rollingUpgrade happens. (In this way if another broker goes down unintentionally under rolling upgrade then cluster can be also helathy). After the rolling upgrade we should checks that everything is in sync and when yes then we should execute the kafka-preferred-replica-election command to achieve that state where every leader is the prefered one.

The https://github.com/linkedin/cruise-control/wiki/REST-APIs#demote-a-list-of-brokers-from-the-kafka-cluster can also be used before we restart a broker. (Actually it also can be used before remove_broker in this case the remove_broker operation could be finished sooner because there were be no leader on the removable broker which is problematic under heavy load).

Execute a demotebroker operation in rolling upgrade. When the broker is demoted do the restart and go to the next broker. If we do this it is possible that scennario when the rest of the healthy broker getting too much load.

Hi Marton @bartam1! I would like to reach a conclusion here before addressing the other PR comments.
Please note that, as I mentioned in the CRD field description, enabling concurrent restarts requires CC com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal to be enabled so we ensure we have uniform replica distribution across AZs. The reason for this is not only to ensure more than 1 replica per AZ, but also to balance load in other AZs, while some replicas in a certain AZ are down.

Regarding the suggestion for keeping a map of topic-partitions to brokers, I believe this would not be easy to implement as to ensure consistency. Topic-partitions might change between queries and we must keep that map consistent with actual cluster metadata. I would keep this option as a last resort.

Regarding the suggestion for using demote, I agree that would be safe, but that would also make the restart much slower and might not be necessary. I first want to understand if it is necessary. Our foremost motivation for this feature was to increase the speed of cluster restarts during rolling upgrade, as currently 60 broker clusters take a long time to restart (3h+).

Regarding the problem you raised, I'll try to summarize to make sure I understand that what you mention is not covered and needs addressing. Please correct me if I am wrong.

Problematic use case:

Kafka clusters with multiple brokers, broker.rack disabled
client sets proper broker.rack read-only config for all brokers
Kafka starts enforcing rack distribution
before this finishes, client sets concurrent broker restarts
Expected behavior: brokers are not yet restarted concurrently
Actual behavior: brokers are restarted concurrently and might match all of a topic's replicas as rack distribution is not yet finalized by Kafka and CC

Is this what you mean is not covered?

If this is accurate, then I would argue that:

prerequisite for enabling CC com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal was not met
if the above is enabled and CC starts a replica rebalance, the rolling restart will be prevented anyway as long as the CC task is running

Wdyt?

bartam1 · 2023-07-20T09:38:28Z

if the above is enabled and CC starts a replica rebalance, the rolling restart will be prevented anyway as long as the CC task is running

Hello @ctrlaltluc
Yes, the problematic use-case is that one what you wrote.
I'm not sure regarding this:

if the above is enabled and CC starts a replica rebalance, the rolling restart will be prevented anyway as long as the CC task is running
I think the rolling restart will be not prevented in that case and we also cannot be sure that the RackAwareDistributionGoal self healing operation is started at all.

This is an edge case.

If we want to be sure we can check two things before multiple broker restart:

Is the RackAwareDistributionGoal is set in the CC configuration
Is there RackAwareDistributionGoal violation (CC state/AnomalyDetectorState/recentGoalViolations)
What do you think?

Regarding the demoting: It happens instantly when the target broker is in sync. I tested this with a continuously producer-consumer. There were no any communication interruption under demoting. Otherwise when there is an active producer to the leader and there is a rolling upgrade (without demotion) there will be interruption (until the new leader is elected). I'm not sure about that is it worth or not the effort to implement what I wrote regarding mapping and demotion. I don't have much experience with production Kafka clusters. If it is worth it can be done. I saw that as a possible improvement and I was curious about your opinion.

ctrlaltluc · 2023-07-21T06:20:55Z

I was curious about your opinion.

Thanks @bartam1! Your point makes perfect sense.

I will add these additional checks before the decision to restart more than 1 broker:

RackAwareDistributionGoal is set in the CC configuration
RackAwareDistributionGoal is not violated

I will have a look into demoting too, as it would make it smoother for clients (no interruptions), makes sense. Will come back on this with either code changes or a comment.

bartam1 · 2023-07-22T02:16:40Z

I was curious about your opinion.

Thanks @bartam1! Your point makes perfect sense.

I will add these additional checks before the decision to restart more than 1 broker:
* RackAwareDistributionGoal is set in the CC configuration

* RackAwareDistributionGoal is not violated
I will have a look into demoting too, as it would make it smoother for clients (no interruptions), makes sense. Will come back on this with either code changes or a comment.

Thank you for doing!

ctrlaltluc

@bartam1 @panyuenlau I have addressed all comments on the PR, including the check for RackAwareDistributionGoal. Please have another look, thanks in advance for your time!

@bartam1 I have not implemented demote brokers, as it is not necessary. By default, Kafka brokers have controlled.shutdown.enable default true. This means that any graceful Kafka shutdown (i.e. anything excluding SIGKILL) will ensure that leaders are moved.
LE: I also tested demote myself, and the behavior is the same as normal broker restarts (leaders are moved). However, demote has some downsides: 1/ it takes a long time (~10 min per broker for brokers with ~4500 replicas) as demote moves the demoted broker as last in any set of in-sync replicas and this takes time, even though there is no data movement, 2/ we later need to add the demoted broker back, as it would otherwise remain in recently demoted brokers, which complicates the code without added benefit.

Please note that the tests will still be failing, until #1002 is merged and a new API module release is done.

pkg/resources/kafka/kafka.go

ctrlaltluc · 2023-07-21T13:02:48Z

pkg/resources/kafka/kafka.go

+func getBrokerAzMap(cluster *v1beta1.KafkaCluster) map[int32]string {
+	brokerAzMap := make(map[int32]string)
+	for _, broker := range cluster.Spec.Brokers {
+		brokerRack := getBrokerRack(broker.ReadOnlyConfig)


Can the broker rack be specified via broker config?

I assumed the source of truth for broker rack is the read-only config, as in broker groups you can define it more freely, not necessarily with rack semantics, plus that the broker.rack needs to be set in read-only configs for it to work.

Please lmk if I am mistaken.

pkg/resources/kafka/kafka_test.go

panyuenlau

Apologies - was zoned out for doing my own implementation. Thx for the great work! Only left a couple of quick comments (more like nits)

panyuenlau · 2023-07-26T18:37:29Z

pkg/resources/kafka/kafka.go

+	return brokerAzMap
+}
+
+func getBrokerRack(readOnlyConfig string) string {


Question: It looks like the latest implementation doesn't require this func, can we remove it since it's not being used?

Oh, good catch! Forgot about it. Will address in another commit, thanks for paying attention!

panyuenlau · 2023-07-26T18:39:53Z

pkg/resources/kafka/kafka.go

 	}
 }
+
+func getBrokerAzMap(cluster *v1beta1.KafkaCluster) map[int32]string {


Suggestion: It'd be appreciated if we can add some quick unit tests for func like this one, but since the caller handleRollingUpgrade is tested, I am not going to ask for any more changes.

Let's do this properly. I added unit tests for it in cae78f5, please have a look. Thanks!

P.S. I saw you were doing heavy lifting with kraft, so no worries and thanks for the review. :)

panyuenlau · 2023-07-27T18:27:42Z

FYI - The new API tag https://github.com/banzaicloud/koperator/tree/api/v0.28.7 should be available soon

ctrlaltluc · 2023-07-28T08:22:20Z

@panyuenlau @bartam1 the PR is now final and ready for review. CI fails because of the flaky nodeports test (tried once to retrigger it with an empty commit, gave up), otherwise all tests are passing. Thanks for your patience!

bartam1 · 2023-07-30T08:36:53Z

I think this can have (#1023) effect for this PR
Maybe we should not do multiple broker restarts when the brokers are controllers.
When multiple brokers (controllers) are restarting it has bigger chance to not be enough controller to make quorum.

ctrlaltluc · 2023-07-31T08:26:13Z

I think this can have (#1023) effect for this PR Maybe we should not do multiple broker restarts when the brokers are controllers. When multiple brokers (controllers) are restarting it has bigger chance to not be enough controller to make quorum.

@bartam1 had a look on #1023 and indeed, if KRaft is enabled and the controllers are not spread across AZs (racks), then there are edge cases where multiple restarts within the same rack/AZ might be an issue.

I see 2 options to move forward:

Merge this PR as it is now, then when Initial KRaft support #1023 PR is merged, we will have the spec.kraft available and we can add another check: concurrent restarts > 1, terminating or pending pods > 1 and exists any controller in the same AZ/rack among terminating or pending pods, then do not restart another pod (to ensure at most 1 KRaft controller is restarted)
Wait until Initial KRaft support #1023 is merged, then add the check above in this PR.

@bartam1 @panyuenlau lmk your thoughts.

bartam1 · 2023-08-03T10:18:36Z

I think this can have (#1023) effect for this PR Maybe we should not do multiple broker restarts when the brokers are controllers. When multiple brokers (controllers) are restarting it has bigger chance to not be enough controller to make quorum.

@bartam1 had a look on #1023 and indeed, if KRaft is enabled and the controllers are not spread across AZs (racks), then there are edge cases where multiple restarts within the same rack/AZ might be an issue.

I see 2 options to move forward:
1. Merge this PR as it is now, then when [Initial KRaft support #1023](https://github.com/banzaicloud/koperator/pull/1023) PR is merged, we will have the `spec.kraft` available and we can add another check: concurrent restarts > 1, terminating or pending pods > 1 and exists any controller in the same AZ/rack among terminating or pending pods, then do not restart another pod (to ensure at most 1 KRaft controller is restarted)

2. Wait until [Initial KRaft support #1023](https://github.com/banzaicloud/koperator/pull/1023) is merged, then add the check above in this PR.
@bartam1 @panyuenlau lmk your thoughts.

I agree. We should merge this PR. We can add modifications later if needed.

pkg/resources/kafka/kafka.go

pkg/scale/scale.go

bartam1

LGTM!
Thank you for implementing it with consideration of the edge cases.

panyuenlau

Thx for the consideration for the potentional impact on KRaft support @bartam1 @ctrlaltluc,
will keep this new feature in mind when we get there during the KRaft implementation.

ctrlaltluc · 2023-08-04T11:07:42Z

Thanks for your thorough reviews as well! Looking forward to getting this merged.

ctrlaltluc mentioned this pull request Jun 22, 2023

Add ConcurrentBrokerRestartCountPerRack to RollingUpgradeConfig #1002

Merged

8 tasks

ctrlaltluc commented Jun 22, 2023

View reviewed changes

pkg/resources/kafka/configmap_test.go Outdated Show resolved Hide resolved

panyuenlau reviewed Jun 26, 2023

View reviewed changes

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved

api/v1beta1/kafkacluster_types.go Outdated Show resolved Hide resolved

bartam1 requested changes Jun 27, 2023

View reviewed changes

ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from 3398dc9 to 43a982a Compare July 21, 2023 14:30

ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from 43a982a to 69eee7b Compare July 26, 2023 10:19

ctrlaltluc commented Jul 26, 2023

View reviewed changes

ctrlaltluc marked this pull request as ready for review July 26, 2023 10:27

ctrlaltluc requested a review from a team as a code owner July 26, 2023 10:27

ctrlaltluc requested review from panyuenlau and bartam1 July 26, 2023 10:27

panyuenlau previously approved these changes Jul 26, 2023

View reviewed changes

ctrlaltluc dismissed panyuenlau’s stale review via cae78f5 July 27, 2023 13:57

ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from 69eee7b to cae78f5 Compare July 27, 2023 13:57

Lucian Ilie and others added 7 commits July 28, 2023 10:37

Allow concurrent broker restarts from same AZ (broker rack)

0d9da13

Address PR comments - part 1

29e0dc3

Address PR comments - part 2

523d62d

Address PR comments - part 3

9a39707

Address PR comments - RackAwareDistributionGoal

9306fcf

Address PR comments - part 5

1ede455

Update API dependency

997f46e

ctrlaltluc force-pushed the banzaicloud/parallel-restarts branch from cae78f5 to 997f46e Compare July 28, 2023 07:45

ctrlaltluc added 2 commits July 28, 2023 10:55

Fix linting errors

e861dc7

empty commit to retrigger flaky tests

59dcd09

ctrlaltluc requested a review from panyuenlau July 28, 2023 08:20

Merge branch 'master' into banzaicloud/parallel-restarts

86d16b2

bartam1 requested changes Aug 3, 2023

View reviewed changes

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved

pkg/resources/kafka/kafka.go Outdated Show resolved Hide resolved

pkg/scale/scale.go Outdated Show resolved Hide resolved

Address PR comments - part 6

adac793

ctrlaltluc requested a review from bartam1 August 3, 2023 14:29

bartam1 approved these changes Aug 3, 2023

View reviewed changes

panyuenlau approved these changes Aug 3, 2023

View reviewed changes

bartam1 merged commit 4329189 into banzaicloud:master Aug 4, 2023
4 checks passed

ctrlaltluc deleted the banzaicloud/parallel-restarts branch August 7, 2023 09:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow concurrent broker restarts within same broker rack #1001

Allow concurrent broker restarts within same broker rack #1001

ctrlaltluc commented Jun 22, 2023 •

edited

Loading

panyuenlau left a comment

bartam1 commented Jun 27, 2023 •

edited

Loading

bartam1 Jun 27, 2023

ctrlaltluc Jul 21, 2023

ctrlaltluc commented Jul 17, 2023 •

edited

Loading

bartam1 commented Jul 20, 2023

ctrlaltluc commented Jul 21, 2023

bartam1 commented Jul 22, 2023

ctrlaltluc left a comment •

edited

Loading

ctrlaltluc Jul 21, 2023

panyuenlau left a comment

panyuenlau Jul 26, 2023

ctrlaltluc Jul 27, 2023

panyuenlau Jul 26, 2023

ctrlaltluc Jul 27, 2023

panyuenlau commented Jul 27, 2023

ctrlaltluc commented Jul 28, 2023

bartam1 commented Jul 30, 2023 •

edited

Loading

ctrlaltluc commented Jul 31, 2023

bartam1 commented Aug 3, 2023

bartam1 left a comment

panyuenlau left a comment

ctrlaltluc commented Aug 4, 2023

Allow concurrent broker restarts within same broker rack #1001

Allow concurrent broker restarts within same broker rack #1001

Conversation

ctrlaltluc commented Jun 22, 2023 • edited Loading

Description

Key changes

Notes

Type of Change

Checklist

panyuenlau left a comment

Choose a reason for hiding this comment

bartam1 commented Jun 27, 2023 • edited Loading

bartam1 Jun 27, 2023

Choose a reason for hiding this comment

ctrlaltluc Jul 21, 2023

Choose a reason for hiding this comment

ctrlaltluc commented Jul 17, 2023 • edited Loading

bartam1 commented Jul 20, 2023

ctrlaltluc commented Jul 21, 2023

bartam1 commented Jul 22, 2023

ctrlaltluc left a comment • edited Loading

Choose a reason for hiding this comment

ctrlaltluc Jul 21, 2023

Choose a reason for hiding this comment

panyuenlau left a comment

Choose a reason for hiding this comment

panyuenlau Jul 26, 2023

Choose a reason for hiding this comment

ctrlaltluc Jul 27, 2023

Choose a reason for hiding this comment

panyuenlau Jul 26, 2023

Choose a reason for hiding this comment

ctrlaltluc Jul 27, 2023

Choose a reason for hiding this comment

panyuenlau commented Jul 27, 2023

ctrlaltluc commented Jul 28, 2023

bartam1 commented Jul 30, 2023 • edited Loading

ctrlaltluc commented Jul 31, 2023

bartam1 commented Aug 3, 2023

bartam1 left a comment

Choose a reason for hiding this comment

panyuenlau left a comment

Choose a reason for hiding this comment

ctrlaltluc commented Aug 4, 2023

ctrlaltluc commented Jun 22, 2023 •

edited

Loading

bartam1 commented Jun 27, 2023 •

edited

Loading

ctrlaltluc commented Jul 17, 2023 •

edited

Loading

ctrlaltluc left a comment •

edited

Loading

bartam1 commented Jul 30, 2023 •

edited

Loading