Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay the initialization of ARP/NDP responders #6700

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xliuxu
Copy link
Contributor

@xliuxu xliuxu commented Sep 29, 2024

For secondary-network scenarios, the transport interface can be changed after the agent is started. The ARP/NDP responders should be started after the initialization of secondary-network to bind to the transport interface of the new index.

Besides, this change also addresses the following issues:

  • NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process.
  • Golang caches the zone index for the interface, which may result in NDP responder binding on the stale interface

Fixes: #6623

Comment on lines 995 to 1010
if egressController != nil {
go egressController.Run(stopCh)
}

if externalIPController != nil {
go externalIPController.Run(stopCh)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with changing the order is that it is a bit arbitrary, we can introduce new unexpected issues, and it limits what we can do in the future. For example, we could in the future want to introduce a dependency of flowRestoreCompleteWait on the realization of Egress policies. It would make sense: delay the removal of flow-restore-wait until Egress policy flows have been installed, in order to provide a more consistent datapath on (re)start. See #6342 for more context.

However, we know that there is already a dependency of SecondaryNetwork initialization on flowRestoreCompleteWait. This dependency is important and AFAIK cannot be broken. So with the change described above, we would end up with a circular dependency:
EgressController before flowRestoreCompleteWait before SecondaryNetwork initialization before EgressController.

I would rather avoid "introducing" this new dependency (or rather enforcing this new dependency).

cc @tnqn for his opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid concern. we can check/watch for interface changes in the responders to avoid the hard dependencies. Waiting for Quan's insights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introducing the dependency makes sense to me. Actually I'm considering something similar (check/watch for interface changes) to support #6547, for which we might add an externalInterface configuration and it could happen that not all nodes have the interface (and it's a valid case because user can select certain nodes as egress nodes, then raising error because the interface doesn't exist on non egress nodes doesn't make sense). If we can handle interface change in egress controller, it would solve two problems.

@luolanzone luolanzone added the kind/bug Categorizes issue or PR as related to a bug. label Oct 17, 2024
@luolanzone luolanzone added this to the Antrea v2.2 release milestone Oct 17, 2024
@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch 2 times, most recently from 0b59d67 to 1945b5b Compare October 23, 2024 09:16
For secondary-network scenarios, the transport interface can be
changed after the agent is started. The ARP/NDP responders should
be started after the initialization of secondary-network to bind
to the transport interface of the new index.

Besides, this change also addresses the following issues:
- NDP responder may fail to bind to the new interface due to the
Duplicate Address Detection process.
- Golang caches the zone index for the interface, which may result
in NDP responder binding on the stale interface

Fixes: antrea-io#6623

Signed-off-by: Xu Liu <[email protected]>
@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch from 1945b5b to 586807b Compare October 23, 2024 09:56
Comment on lines +19 to +31
type Interface interface {
LinkExists(linkName string) bool

// Run starts the detector.
Run(stopCh <-chan struct{})

// AddEventHandler registers an eventHandler of link updates. It's not thread-safe and should be called before
// starting the detector.
AddEventHandler(handler LinkEventHandler, linkName ...string)

// HasSynced returns true if the cache has been initialized with the existing links.
HasSynced() bool
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn, do you think this interface is appropriate to address the issue and #6547 as well? I did not cache the netlink.Link structs in the detector for simplicity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SecondaryNetwork breaks ServiceExternalIP feature
4 participants