-
Notifications
You must be signed in to change notification settings - Fork 518
network: wsnet with p2p backup meshing strategy #6391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
network: wsnet with p2p backup meshing strategy #6391
Conversation
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
127b66b to
4f7a7ca
Compare
libp2p pubsub can connect to PX streams so that we could end up with more than GossipFanout outgoing peers. This PR marks explicitly dialed peers using peerstore metadata so that each connection can be checked and vetted
4f7a7ca to
dd1293d
Compare
c47fd05 to
7187dfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a WS-network-with-P2P-backup meshing strategy for hybrid topology networks, along with various network improvements and test fixes.
- Updated hybrid relay mesh creator to prioritize WebSocket connections and use P2P as backup to reach target connection counts
- Added validation for non-dialed outgoing P2P connections to prevent interference from sub-components like pubsub
- Updated libp2p-pubsub parameters to align with GossipFanout configuration
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
network/mesh.go |
Implements hybrid WS-priority meshing strategy with P2P backup and updates mesh function signatures |
network/wsNetwork.go |
Updates mesh functions to accept target connection count parameter and pass it through |
network/p2pNetwork.go |
Updates P2P mesh functions and adds dialed peer tracking functionality |
network/p2p/streams.go |
Adds validation to ignore non-dialed outgoing connections in stream manager |
network/p2p/pubsub.go |
Derives GossipSub parameters based on GossipFanout configuration |
network/p2p/peerstore/peerstore.go |
Renames addressData key constant to use standardized psmdk prefix |
*_test.go files |
Adds test coverage and fixes flaky tests by setting DNSBootstrapID = "" |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
network/mesh.go
Outdated
| if wsConnections <= targetConnCount { | ||
| // note "less or equal". Even if p2pTarget is zero it makes sense to call | ||
| // p2p meshThreadInner to fetch DHT peers | ||
| p2pTarget := targetConnCount - wsConnections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking at a high level, how quickly do you want the "backup" to kick in. for example if you lose one of your outgoing connections to a WS node and it is quickly replaced, do we want to go from 0 to 1 here and run all the p2p machinery just for it to soon go back to 0 again, or do we want to wait a little bit to see if the issue is persistent.. or should "backup" mean only kick in if you have 0 wsConnections, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted backup to be "kick-in if we have 0 wsConnections" - probably should pick some period for that to remain true (and some grace for handling when it turns back to false).
A minimum period of time before P2P kicks in if WS connection count is > 0 (or some min connection count) but < target connection count is fine too, but it adds complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the following:
- Skip p2p for the very first time.
- implement timeout/delay as use of
wsConnectionsfrom previous iteration?
@@ -239,12 +239,14 @@ func (c hybridRelayMeshCreator) create(opts ...meshOption) (mesher, error) {
out := make(chan meshRequest, 5)
var wg sync.WaitGroup
+ var prevWsConnections int = -1
meshFn := func(targetConnCount int) int {
wsConnections := cfg.wsnet.meshThreadInner(targetConnCount)
var p2pConnections int
- if wsConnections <= targetConnCount {
+ // skip p2p mesh for the first time to give wsnet to establish connections
+ if prevWsConnections != -1 && prevWsConnections <= targetConnCount {
// note "less or equal". Even if p2pTarget is zero it makes sense to call
// p2p meshThreadInner to fetch DHT peers
p2pTarget := targetConnCount - wsConnections
@@ -255,6 +257,7 @@ func (c hybridRelayMeshCreator) create(opts ...meshOption) (mesher, error) {
wsConnections, p2pConnections, targetConnCount)
}
}
+ prevWsConnections = wsConnections
return wsConnections + p2pConnections
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, this is a good idea/implementation, I'll redo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated:
- ws mesher now takes (target - p2pConns) as a target
- p2p mesher uses (target - wsConns, as before) as a target
No delay added because of the following rationale:
- wsConns count includes active + pending conns so it could be a bit inflated
- the mesher is called every 1 minute and having ws net attempting first, it will always prefer ws net conns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. In practice, it would be good to run a node off of Mainnet and watch who it maintains connections to over time for additional confidence.
| // subset of config.Local needed here | ||
| type nodeSubConfig interface { | ||
| IsHybridServer() bool | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's kind of neat, we should add more accessors to config.Local and do this more often rather than passing a giant config around, so we know what flags are being used by packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, can we be even simpler, and have makeStreamManager take a boolean isHybrid parameter? Done this way, we're still passing the config.Local by value, so the hybridness can't change, and that's all we care about.
| if err := s.host.Peerstore().Put(peer.ID, psmdkDialed, true); err != nil { // mark this peer as explicitly dialed | ||
| return err | ||
| if s.subcfg.IsHybridServer() { | ||
| if err := s.host.Peerstore().Put(peer.ID, psmdkDialed, true); err != nil { // mark this peer as explicitly dialed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so when not in hybrid mode with just plain p2p, we don't set this because later we don't care about Outbound/Inbound for streamManager.Connected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll run extra evaluations for non-hybrid, maybe we'll end up with a similar logic but GossipFanout matching to the default GossipSubD
| wsPeer := stat.peer.(*wsPeer) | ||
| wsPeer.peerMessageDelay = stat.peerDelay | ||
| cc.log.Infof("network performance monitor - peer '%s' delay %d first message portion %d%%", wsPeer.GetAddress(), stat.peerDelay, int(stat.peerFirstMessage*100)) | ||
| if wsPeer.throttledOutgoingConnection && leastPerformingPeer == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went down a rabbit hole trying to understand what "throttled connection" means and why a peer will be throttled and it was really helpful to read the PR description here: #668
basically, it is a "best of both worlds" to avoid clique formation but still allow for some peers to be performance-ranked, so relays keep half of their peers to be random, stable connections (don't disconnect unless an error occurs). the other half are performance-ranked and the least-performing peer is kicked out periodically.
it's not clear whether we need to cargo-cult all of this over to the P2PNetwork use case, but we can get rid of it once we are fully using pubsub for gossip..
a better name might be monitoredOutgoingConnection or performanceMonitoredOutgoingConnection perhaps
|
for posterity, reading this PR #668 helps you understand the motivation behind the code that is being made more generic here |
| case req := <-cfg.wsnet.meshUpdateRequests: | ||
| out <- req | ||
| } | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to run in a for loop, or is it intended to just forward a single message from wsnet.meshUpdateRequests to out, then the goroutine ends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally I thought about the channel as a measure to run initiate a mesh creator so we needed a single message from either source. Later I actually wired RequestConnectOutgoing so the channel have to be a constant source of event and have to be a loop. Updated.
Summary
This PR contains the following changes:
hybridRelayMeshCreatorfor relays: use WS as a primary source of peers and fallback to p2p nodesdialNodeoutgoing connects (such as made by pubsub to peers learned from PX).GossipFanoutDNSBootstrapID = "") most likely caused to non-existing DNS records lookup timeout - if it takes longer thanEventuallytimeout for connectivity assertions tests fail.addressData->psmdkAddressDatavariable rename to standardize with other keys by usingpsmdkprefix.Related: #6045
Test Plan
Added unit tests
Performance cluster tests summary: 8k TPS (as expected for hybrid), 7+1 or 6+2 (ws+p2p) outgoing connections (as expected as well).