Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doesn't clean up invalid target groups #475

Closed
universam1 opened this issue Feb 8, 2022 · 10 comments · Fixed by #476
Closed

doesn't clean up invalid target groups #475

universam1 opened this issue Feb 8, 2022 · 10 comments · Fixed by #476

Comments

@universam1
Copy link
Contributor

universam1 commented Feb 8, 2022

Invalid, non-existent target groups registrations on autoscaling groups are not removed, causing errors such as that scaling is blocked.

Invalid registrations can appear when CFN replaces a target group which happens due to various reasons.

Looks like a bug that the controller keeps non-existent target groups registered because it cannot determine the ownership - given that the TG is gone and thus not enumerable.
Probably this logic can be simplified to remove any invalid registration, even when it lost track of.

@AlexanderYastrebov @szuecs

@szuecs
Copy link
Member

szuecs commented Feb 8, 2022

The problem could be if you mix manual and our CF stack related Load Balancers. If you would clean up the ones that are not having our Tag, then you clean up things you do not own.

@universam1
Copy link
Contributor Author

universam1 commented Feb 9, 2022

The problem could be if you mix manual and our CF stack related Load Balancers. If you would clean up the ones that are not having our Tag, then you clean up things you do not own.

This issue is only about cleaning up non-existent thus invalid target groups. I cannot think of a reason why it should not remove such broken links in any case. It could be a simple check if the linked TG ID is existent, otherwise deregistering it.

@szuecs
Copy link
Member

szuecs commented Feb 9, 2022

@universam1 but you link to code path which tests for ownership. If you like to add some cleanup code here you would need to check the Tag anyways, because we should only care about our resources. Other controllers need to keep track on theirs and manual created ones need to be manually maintained or you have a custom control loop that cleans up "broken things".
I don't see that this controller cleans up resource created by others and I do not believe right now that Cloudformation will create such a state. So right now it would be a won't fix, IMO.

@mikkeloscar
Copy link
Collaborator

@universam1 @szuecs We do try to cleanup non-existing target groups:

// find non-existing target groups which should be detached

If there are cases not handled by the current logic then it makes sense to handle that especially if it's is caused by the controller. We have a disconnect between target group creation/deletion (Cloudformation) and attaching to ASG which is done in the referenced controller code. As already stated checking for the ownership of target groups and checking for non-existing target groups is not the same.

@universam1
Copy link
Contributor Author

universam1 commented Feb 9, 2022

@szuecs Maybe I was too deep into the solution to explain verbose enough that this is a bug created by this controller.
I believe there are other cases such as #446 that will cause a replacement of the TG CFN resource, but at least this one here you can replay consistently:

  1. change access from HostPort to AWSCNI
  2. CFN creates an new TG
  3. CFN deletes the TG linked to the ASG by this controller
  4. The ASG is now in error state

@universam1
Copy link
Contributor Author

@szuecs @mikkeloscar I've ran some more debug sessions and found the actual bug that the cleanup would work as expected but it is simply not called. Thanks for the discussion.

Given that there is only one TG and it is of type IP created by the stack, we are not callling updateTargetGroupsForAutoScalingGroup -> detachTargetGroupsFromAutoScalingGroup

// don't do anything if there are no target groups
if len(targetGroupARNs) == 0 {
return
}
ownerTags := map[string]string{
clusterIDTagPrefix + a.ClusterID(): resourceLifecycleOwned,
kubernetesCreatorTag: a.controllerID,
}
for _, asg := range a.TargetedAutoScalingGroups {
// This call is idempotent and safe to execute every time
if err := updateTargetGroupsForAutoScalingGroup(a.autoscaling, a.elbv2, targetGroupARNs, asg.name, ownerTags); err != nil {

That means this condition is oversimplified

@universam1
Copy link
Contributor Author

universam1 commented Feb 9, 2022

From my tests removing this condition does solve the problem just fine!

if len(targetGroupARNs) == 0 {
return
}

However, this practically reverts #141 and might open up #139 again - so wondering if we can cover #141 differently @mikkeloscar such as to filter calls where 'targetGroupARNs' is 'nil'?

@mikkeloscar
Copy link
Collaborator

@universam1 We should just move the if len(targetGroupARNs) == 0 { logic inside updateTargetGroupsForAutoScalingGroup so it has the same effect as #141 but also can clean up non-existing target groups from the ASGs.

@universam1
Copy link
Contributor Author

Thanks - sounds great!
Would you @mikkeloscar agree that this is already implemented in meantime via #436 ?

attachARNs := make([]string, 0, len(targetGroupARNs))
for _, tgARN := range targetGroupARNs {
if _, ok := allTGs[tgARN]; ok {
attachARNs = append(attachARNs, tgARN)
} else {
// TODO: it is better to validate stack's target groups earlier to identify owning stack
log.Errorf("Target group %q does not exist, will not attach", tgARN)
}
}
if len(attachARNs) > 0 {

universam1 added a commit to o11n/kube-ingress-aws-controller that referenced this issue Feb 9, 2022
fixes: zalando-incubator#475

Given that there is only one TG and it is of type IP created by the stack, we are not callling `updateTargetGroupsForAutoScalingGroup` -> `detachTargetGroupsFromAutoScalingGroup`

This possible solution is to move this guard inside `updateTargetGroupsForAutoScalingGroup` - see zalando-incubator#475 (comment)

However, the  guard introduced in zalando-incubator#141 is practically present now inside `updateTargetGroupsForAutoScalingGroup`  introduced by zalando-incubator#436 and therefore we can simply remove it.

Tested in lab clusters.

Signed-off-by: Samuel Lang <[email protected]>
@szuecs
Copy link
Member

szuecs commented Feb 9, 2022

Yes I agree that this is already implemented in meantime via #436

mikkeloscar pushed a commit that referenced this issue Feb 10, 2022
fixes: #475

Given that there is only one TG and it is of type IP created by the stack, we are not callling `updateTargetGroupsForAutoScalingGroup` -> `detachTargetGroupsFromAutoScalingGroup`

This possible solution is to move this guard inside `updateTargetGroupsForAutoScalingGroup` - see #475 (comment)

However, the  guard introduced in #141 is practically present now inside `updateTargetGroupsForAutoScalingGroup`  introduced by #436 and therefore we can simply remove it.

Tested in lab clusters.

Signed-off-by: Samuel Lang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants