Waku Peer and Connection Management #6

jm-clius · 2022-11-17T11:54:34Z

Background

This forms part of a series of issues raised as output of the recent Status client offsite.

Proper peer and connection management is crucial to ensure reliable message delivery in a Waku network, due to nature of relay. A strong relayer network is the most basic underpinning of any Waku v2 network. All other Waku services builds on the assumption that the underlying message routing works well and is scalable. For example, each peer in the network must be able to establish and maintain at least six good connections to other peers. Read this summary for what would constitute a "good" connection.

Under peer and connection management I group the wide range of considerations that any successful p2p application must incorporate, e.g:

suitable discovery methods to find new peers
some mechanism to score peers/be biased towards more reliable connections
connection management to establish, maintain and re-establish a large, healthy set of connections to peers
decisions re peer persistence
strategies and measurements around peer connectivity restrictions (e.g. NAT traversal, uni- vs bidirectionality of connections, etc.)

What has been done

Basic discovery, peer and connection management methods are in place in nwaku. Their is an ongoing effort to extend nwaku peer management according to principles established in similar p2p clients (e.g. nimbus, prysm, etc.)

Problem

Peer and connection management is often an environment-specific choice. Different clients and applications may have different approaches to discovering and maintaining healthy connections, persisting peers (if at all), reattempting connections, scoring peers, etc. However, since the Status application is built on a go-waku client and relies on a mostly nwaku-based infrastructure, it's important to evaluate how effectively the clients find and maintain enough healthy connections for a functioning mesh with reliable message delivery.

As part of the ongoing peer and connection management work in nwaku, some basic recommended principles will be extracted into a general RFC. This should be implemented in go-waku as a first step to improved peer management.

However, this task also extends to Status applications themselves being able to:

monitor how well discovery methods are working within different client environments
implement more suitable discovery methods (e.g. Waku Peer Exchange) where necessary
apply application-level peer scoring

This issue tracks the overall effort to coordinate and improve peer and connection management across different clients.

jm-clius · 2022-11-17T11:56:28Z

cc @alrevuelta

alrevuelta · 2022-11-23T17:22:09Z

Thanks for this great summary @jm-clius. With the goal of establishing a roadmap for peer and connection management, I think we can split the efforts into different verticals:

Research: Some of the topics discussed here would involve some research imho. For example, everything related to peer scoring would involve some simulations at medium/large scale, to see that our scoring strategy with some given parameters works as expected. We can start with a naive scoring mechanism, but in the medium term, I see this as an important research area.
RFC related: As I see it all/most of the work will impact 27/WAKU2-PEERS and some of the items that should be specified are: peering strategy (criteria for selecting peers: capabilities, score, up to a max of n, inbound/outbound ratio, ...) both for new connections and for dropping existing connections.
Implementation: Can only speak from nwaku pov but I see some areas that we can start taking action. We can discuss nwaku specific tasks in issues/1353.

Said that I think we should establish a roadmap to focus on the most immediate needs, mostly implementation related that will allow waku to scale to a few hundred nodes without problems.

Action item: I propose that each implementation lists their pain points related to peer and connection management, with ideas to improve it.

I start with the nwaku ones I've seen over the last week (@jm-clius please correct me if I'm wrong)

Related to metrics, there is no easy way for a node operator to know if its peer is reachable. Ofc you can check with an external tool if the port is open, but we lack some observabiliy here. nwaku will address this with feat(networking): new metric showing if node is reachable nwaku#1206. A simple approach is to add the inbound/outbound amount of peers in the logs/rpc/metrics.
There are no criteria for selecting to which peers connect. We only check that the peer did not error when connecting in the past. This makes us blacklist a peer that errored once.
No criteria to trigger disconnections. We may want to force a disconnection with a given peer. Some ideas: wrong messages being sent (meaning can't be parsed to a WakuMessage), using gossipsub scores, etc. Unsure if we can prevent some spamming at this level before RLN is ready, forcing a disconnection from a peer that is sending tons of messages. Perhaps dangerous since these messages could be just relayed from a malicious peer.
A peer may want to keep at least x connections with peers containing y capabilities. In our peering strategy we should take this into account. Any idea suggestions on the criteria?
In general and perhaps common to other implementations, I don't see an easy way of testing these features. Some stuff can be unit tested, but until we hit a more realistic environment, I find it difficult to validate that some of these features indeed contribute to a healthier network.

richard-ramos · 2022-11-23T19:30:50Z

go-waku:

No criteria to select peers. Currently we can either chose a peer from the peerstore randomly or chose the peer closest to you by ping reply time, but there's no way to know evaluate whether a peer should be chosen or not.
There is no peer persistence. Nodes must be bootstrapped from scratch. (If bootstrap nodes are down, nodes will need to be manually added, otherwise the node will not have any peer to connect to.
Maybe more related to protocols, but there's no defined criteria to evaluate filter / lightpush node behavior. (Is a lightpush node really well connected? is it really publishing the messages? is a filter node you're subscribed to really pushing the messages it receives to your node?)
Currently there's no control over the bandwidth allocated to connected peers as well as no way to blacklist nodes.

alrevuelta · 2022-11-25T08:08:28Z

@richard-ramos Thanks!

Adding some more points:

How about peer disconnections? GossipSub has a built in scoring mechanisim and in nim-libp2p there is a flag named disconnectBadPeers that if enabled disconnects from bad peers see. I assume none of the implementations are using this? In nwaku its disabled and all peers have a score of 0, which is neutral.
Not exactly blacklisting but is there a way to prevent from constantly dialing peers that are not reachable?. If the network size grows, would be nice to have some kind of exponential backoff, so that we try to dial unreachable peers less often. Otherwise, we can end up wasting a lot of time in this. This issue was already raised in chore(networking): too many failed dials, improve strategy nwaku#1414, and I would say its a nice to have in all implementations. Not sure if it should be part of the spec though.

jm-clius · 2023-08-02T17:10:28Z

Closing this issue, as most of the peer management MVP work has been done and outstanding items are tracked in more specific sub-issues, such as #33. The latter tracks work necessary for the public Waku Network.

fryorcraken added the RAID label Nov 18, 2022

jm-clius mentioned this issue Dec 6, 2022

Status MVP: Status Core Contributors use Status Mobile #8

Closed

20 tasks

jm-clius assigned alrevuelta Dec 12, 2022

fryorcraken removed the RAID label Jul 31, 2023

jm-clius closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Waku Peer and Connection Management #6

Waku Peer and Connection Management #6

jm-clius commented Nov 17, 2022

jm-clius commented Nov 17, 2022

alrevuelta commented Nov 23, 2022

richard-ramos commented Nov 23, 2022 •

edited

Loading

alrevuelta commented Nov 25, 2022

jm-clius commented Aug 2, 2023

Waku Peer and Connection Management #6

Waku Peer and Connection Management #6

Comments

jm-clius commented Nov 17, 2022

Background

What has been done

Problem

jm-clius commented Nov 17, 2022

alrevuelta commented Nov 23, 2022

richard-ramos commented Nov 23, 2022 • edited Loading

alrevuelta commented Nov 25, 2022

jm-clius commented Aug 2, 2023

richard-ramos commented Nov 23, 2022 •

edited

Loading