Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ConcurrentBrokerRestartCountPerRack to RollingUpgradeConfig #1002

Merged

Conversation

ctrlaltluc
Copy link
Contributor

@ctrlaltluc ctrlaltluc commented Jun 22, 2023

Description

Add ConcurrentBrokerRestartCountPerRack property to RollingUpgradeConfig, in order to allow multiple brokers to be restarted in parallel during a rolling upgrade.

This PR is a prerequisite for #1001.

Type of Change

  • New Feature

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

Copy link
Contributor

@bartam1 bartam1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few comments. Otherwise LGTM.
Thank you for the contribution!

api/v1beta1/kafkacluster_types.go Outdated Show resolved Hide resolved
@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts-api branch 2 times, most recently from 529bf71 to 6367359 Compare July 14, 2023 10:03
@ctrlaltluc
Copy link
Contributor Author

ctrlaltluc commented Jul 14, 2023

@bartam1 Hi Marton, thanks a lot for the thorough reviews! Was on holiday until now. All your comments make sense, will have a look on your suggestions on the other PR too. Addressed all comments on this one.

@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts-api branch from 6367359 to d50401a Compare July 17, 2023 07:06
@ctrlaltluc
Copy link
Contributor Author

Rebased again with master, ready for final review.

@ctrlaltluc ctrlaltluc requested a review from bartam1 July 17, 2023 07:35
@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts-api branch from d50401a to dab5a01 Compare July 21, 2023 06:23
@ctrlaltluc ctrlaltluc changed the title Add ConcurrentBrokerRestartsAllowed to RollingUpgradeConfig Add ConcurrentBrokerRestartCountPerRack to RollingUpgradeConfig Jul 21, 2023
@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts-api branch from d9d3e15 to 9daa903 Compare July 21, 2023 10:52
Copy link
Contributor

@bartam1 bartam1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts-api branch from 9daa903 to 7cbaea8 Compare July 26, 2023 06:37
@ctrlaltluc
Copy link
Contributor Author

@panyuenlau this PR requires one more review to be merged. Can you help please? Asking you as you already had a look on #1001 and this one is a prerequisite for that. Thanks! :)

Copy link
Member

@panyuenlau panyuenlau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the main implementation - looks great

@ctrlaltluc ctrlaltluc force-pushed the banzaicloud/parallel-restarts-api branch from 7cbaea8 to 777a971 Compare July 27, 2023 13:26
@ctrlaltluc
Copy link
Contributor Author

Rebased again with master (no other changes) in hopes to have it merged and there is an integration test failing. First time I see it failing here.
Probably a flaky one? Is there any way to re-run it?

[FAILED] [120.019 seconds]
KafkaClusterNodeportExternalAccess [JustBeforeEach] when NodePortExternalIP is configured reconciles the service successfully
  [JustBeforeEach] /home/runner/work/koperator/koperator/controllers/tests/kafkacluster_controller_externalnodeport_test.go:78
  [It] /home/runner/work/koperator/koperator/controllers/tests/kafkacluster_controller_externalnodeport_test.go:147

@panyuenlau
Copy link
Member

Rebased again with master (no other changes) in hopes to have it merged and there is an integration test failing. First time I see it failing here. Probably a flaky one? Is there any way to re-run it?

[FAILED] [120.019 seconds]
KafkaClusterNodeportExternalAccess [JustBeforeEach] when NodePortExternalIP is configured reconciles the service successfully
  [JustBeforeEach] /home/runner/work/koperator/koperator/controllers/tests/kafkacluster_controller_externalnodeport_test.go:78
  [It] /home/runner/work/koperator/koperator/controllers/tests/kafkacluster_controller_externalnodeport_test.go:147

Context: The controller test failure "spec.ports[0].nodePort: Invalid value: 31125: provided port is already allocated" is seen from time to time in GHA

My guess: this is probably because the underlying node that GHA uses to run the tests happens to be the same one that was used in previous run, and GHA's infrastructure didn't get to clean up the allocated port in time

@panyuenlau panyuenlau merged commit a384f4b into banzaicloud:master Jul 27, 2023
4 checks passed
panyuenlau pushed a commit that referenced this pull request Jul 27, 2023
* Add ConcurrentBrokerRestartCountPerRack to RollingUpgradeConfig
@ctrlaltluc ctrlaltluc deleted the banzaicloud/parallel-restarts-api branch July 28, 2023 07:58
@ctrlaltluc
Copy link
Contributor Author

Rebased again with master (no other changes) in hopes to have it merged and there is an integration test failing. First time I see it failing here. Probably a flaky one? Is there any way to re-run it?

[FAILED] [120.019 seconds]
KafkaClusterNodeportExternalAccess [JustBeforeEach] when NodePortExternalIP is configured reconciles the service successfully
  [JustBeforeEach] /home/runner/work/koperator/koperator/controllers/tests/kafkacluster_controller_externalnodeport_test.go:78
  [It] /home/runner/work/koperator/koperator/controllers/tests/kafkacluster_controller_externalnodeport_test.go:147

Context: The controller test failure "spec.ports[0].nodePort: Invalid value: 31125: provided port is already allocated" is seen from time to time in GHA

My guess: this is probably because the underlying node that GHA uses to run the tests happens to be the same one that was used in previous run, and GHA's infrastructure didn't get to clean up the allocated port in time

@panyuenlau understood.

We thought it only happens in the Adobe fork.
We attempted to fix this in adobe#67, by explicitly deleting the nodeports. I thought it was related to envtest not doing garbage collection (when KafkaCluster is deleted, even if it is a parent of those nodeports, the nodeports are not deleted):

One common example is garbage collection; because there are no controllers monitoring built-in resources, objects do not get deleted, even if an OwnerReference is set up; as a consequence, usually test implements code for cleaning up created objects.

But it seems explicitly deleting the nodeports does not fix it either. We still saw it afterwards, on occasion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants