doesn't clean up invalid target groups #475

universam1 · 2022-02-08T14:23:00Z

Invalid, non-existent target groups registrations on autoscaling groups are not removed, causing errors such as that scaling is blocked.

Invalid registrations can appear when CFN replaces a target group which happens due to various reasons.

Looks like a bug that the controller keeps non-existent target groups registered because it cannot determine the ownership - given that the TG is gone and thus not enumerable.
Probably this logic can be simplified to remove any invalid registration, even when it lost track of.

@AlexanderYastrebov @szuecs

szuecs · 2022-02-08T16:13:59Z

The problem could be if you mix manual and our CF stack related Load Balancers. If you would clean up the ones that are not having our Tag, then you clean up things you do not own.

universam1 · 2022-02-09T09:07:54Z

The problem could be if you mix manual and our CF stack related Load Balancers. If you would clean up the ones that are not having our Tag, then you clean up things you do not own.

This issue is only about cleaning up non-existent thus invalid target groups. I cannot think of a reason why it should not remove such broken links in any case. It could be a simple check if the linked TG ID is existent, otherwise deregistering it.

szuecs · 2022-02-09T09:22:58Z

@universam1 but you link to code path which tests for ownership. If you like to add some cleanup code here you would need to check the Tag anyways, because we should only care about our resources. Other controllers need to keep track on theirs and manual created ones need to be manually maintained or you have a custom control loop that cleans up "broken things".
I don't see that this controller cleans up resource created by others and I do not believe right now that Cloudformation will create such a state. So right now it would be a won't fix, IMO.

mikkeloscar · 2022-02-09T09:40:08Z

@universam1 @szuecs We do try to cleanup non-existing target groups:

kube-ingress-aws-controller/aws/asg.go

Line 176 in 9f2db47

// find non-existing target groups which should be detached

If there are cases not handled by the current logic then it makes sense to handle that especially if it's is caused by the controller. We have a disconnect between target group creation/deletion (Cloudformation) and attaching to ASG which is done in the referenced controller code. As already stated checking for the ownership of target groups and checking for non-existing target groups is not the same.

universam1 · 2022-02-09T09:42:32Z

@szuecs Maybe I was too deep into the solution to explain verbose enough that this is a bug created by this controller.
I believe there are other cases such as #446 that will cause a replacement of the TG CFN resource, but at least this one here you can replay consistently:

change access from HostPort to AWSCNI
CFN creates an new TG
CFN deletes the TG linked to the ASG by this controller
The ASG is now in error state

universam1 · 2022-02-09T11:10:58Z

@szuecs @mikkeloscar I've ran some more debug sessions and found the actual bug that the cleanup would work as expected but it is simply not called. Thanks for the discussion.

Given that there is only one TG and it is of type IP created by the stack, we are not callling updateTargetGroupsForAutoScalingGroup -> detachTargetGroupsFromAutoScalingGroup

kube-ingress-aws-controller/aws/adapter.go

Lines 597 to 609 in 9f2db47

 // don't do anything if there are no target groups 

 if len(targetGroupARNs) == 0 { 

 return 

 } 

 ownerTags := map[string]string{ 

 clusterIDTagPrefix + a.ClusterID(): resourceLifecycleOwned, 

 kubernetesCreatorTag: a.controllerID, 

 } 

 for _, asg := range a.TargetedAutoScalingGroups { 

 // This call is idempotent and safe to execute every time 

 if err := updateTargetGroupsForAutoScalingGroup(a.autoscaling, a.elbv2, targetGroupARNs, asg.name, ownerTags); err != nil {

That means this condition is oversimplified

universam1 · 2022-02-09T13:38:14Z

From my tests removing this condition does solve the problem just fine!

kube-ingress-aws-controller/aws/adapter.go

Lines 598 to 600 in 9f2db47

 if len(targetGroupARNs) == 0 { 

 return 

 }

However, this practically reverts #141 and might open up #139 again - so wondering if we can cover #141 differently @mikkeloscar such as to filter calls where 'targetGroupARNs' is 'nil'?

mikkeloscar · 2022-02-09T14:35:55Z

@universam1 We should just move the if len(targetGroupARNs) == 0 { logic inside updateTargetGroupsForAutoScalingGroup so it has the same effect as #141 but also can clean up non-existing target groups from the ASGs.

universam1 · 2022-02-09T14:46:05Z

Thanks - sounds great!
Would you @mikkeloscar agree that this is already implemented in meantime via #436 ?

kube-ingress-aws-controller/aws/asg.go

Lines 216 to 226 in 9f2db47

 attachARNs := make([]string, 0, len(targetGroupARNs)) 

 for _, tgARN := range targetGroupARNs { 

 if _, ok := allTGs[tgARN]; ok { 

 attachARNs = append(attachARNs, tgARN) 

 } else { 

 // TODO: it is better to validate stack's target groups earlier to identify owning stack 

 log.Errorf("Target group %q does not exist, will not attach", tgARN) 

 } 

 } 

 if len(attachARNs) > 0 {

fixes: zalando-incubator#475 Given that there is only one TG and it is of type IP created by the stack, we are not callling `updateTargetGroupsForAutoScalingGroup` -> `detachTargetGroupsFromAutoScalingGroup` This possible solution is to move this guard inside `updateTargetGroupsForAutoScalingGroup` - see zalando-incubator#475 (comment) However, the guard introduced in zalando-incubator#141 is practically present now inside `updateTargetGroupsForAutoScalingGroup` introduced by zalando-incubator#436 and therefore we can simply remove it. Tested in lab clusters. Signed-off-by: Samuel Lang <[email protected]>

szuecs · 2022-02-09T20:35:41Z

Yes I agree that this is already implemented in meantime via #436

fixes: #475 Given that there is only one TG and it is of type IP created by the stack, we are not callling `updateTargetGroupsForAutoScalingGroup` -> `detachTargetGroupsFromAutoScalingGroup` This possible solution is to move this guard inside `updateTargetGroupsForAutoScalingGroup` - see #475 (comment) However, the guard introduced in #141 is practically present now inside `updateTargetGroupsForAutoScalingGroup` introduced by #436 and therefore we can simply remove it. Tested in lab clusters. Signed-off-by: Samuel Lang <[email protected]>

universam1 mentioned this issue Feb 9, 2022

clean up invalid target groups #476

Merged

mikkeloscar closed this as completed in #476 Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doesn't clean up invalid target groups #475

doesn't clean up invalid target groups #475

universam1 commented Feb 8, 2022 •

edited

Loading

szuecs commented Feb 8, 2022

universam1 commented Feb 9, 2022 •

edited

Loading

szuecs commented Feb 9, 2022

mikkeloscar commented Feb 9, 2022

universam1 commented Feb 9, 2022 •

edited

Loading

universam1 commented Feb 9, 2022

universam1 commented Feb 9, 2022 •

edited by mikkeloscar

Loading

mikkeloscar commented Feb 9, 2022

universam1 commented Feb 9, 2022

szuecs commented Feb 9, 2022

doesn't clean up invalid target groups #475

doesn't clean up invalid target groups #475

Comments

universam1 commented Feb 8, 2022 • edited Loading

szuecs commented Feb 8, 2022

universam1 commented Feb 9, 2022 • edited Loading

szuecs commented Feb 9, 2022

mikkeloscar commented Feb 9, 2022

universam1 commented Feb 9, 2022 • edited Loading

universam1 commented Feb 9, 2022

universam1 commented Feb 9, 2022 • edited by mikkeloscar Loading

mikkeloscar commented Feb 9, 2022

universam1 commented Feb 9, 2022

szuecs commented Feb 9, 2022

universam1 commented Feb 8, 2022 •

edited

Loading

universam1 commented Feb 9, 2022 •

edited

Loading

universam1 commented Feb 9, 2022 •

edited

Loading

universam1 commented Feb 9, 2022 •

edited by mikkeloscar

Loading