Skip to content

Conversation

@algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Jul 23, 2025

Summary

This PR contains the following changes:

  1. Updated hybridRelayMeshCreator for relays: use WS as a primary source of peers and fallback to p2p nodes
  2. Vetting non-dialNode outgoing connects (such as made by pubsub to peers learned from PX).
  3. libp2p-pubsub default D values adjusted according to GossipFanout
  4. Fixes to flaky unit tests (set DNSBootstrapID = "") most likely caused to non-existing DNS records lookup timeout - if it takes longer than Eventually timeout for connectivity assertions tests fail.
  5. addressData -> psmdkAddressData variable rename to standardize with other keys by using psmdk prefix.

Related: #6045

Test Plan

Added unit tests
Performance cluster tests summary: 8k TPS (as expected for hybrid), 7+1 or 6+2 (ws+p2p) outgoing connections (as expected as well).

@codecov
Copy link

codecov bot commented Jul 23, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
105368 1 105367 2546
View the top 1 failed test(s) by shortest run time
github.com/algorand/go-algorand/ledger::TestCatchpointTrackerWaitNotBlocking
Stack Traces | 0.21s run time
=== RUN   TestCatchpointTrackerWaitNotBlocking
    catchpointtracker_test.go:1103: 
        	Error Trace:	.../go-algorand/ledger/catchpointtracker_test.go:1103
        	Error:      	Wait(18) is blocked
        	Test:       	TestCatchpointTrackerWaitNotBlocking
--- FAIL: TestCatchpointTrackerWaitNotBlocking (0.21s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@algorandskiy algorandskiy added the p2p Work related to the p2p project label Jul 25, 2025
@algorandskiy algorandskiy force-pushed the pavel/ws-relays-p2p-backup branch from 127b66b to 4f7a7ca Compare August 15, 2025 19:48
@algorandskiy algorandskiy changed the title WIP: network: wsnet with p2p backup meshing strategy network: wsnet with p2p backup meshing strategy Aug 15, 2025
@algorandskiy algorandskiy requested review from cce and jannotti August 15, 2025 19:52
libp2p pubsub can connect to PX streams so that we could end up with
more than GossipFanout outgoing peers.
This PR marks explicitly dialed peers using peerstore metadata
so that each connection can be checked and vetted
@algorandskiy algorandskiy force-pushed the pavel/ws-relays-p2p-backup branch from 4f7a7ca to dd1293d Compare August 19, 2025 17:58
@algorandskiy algorandskiy marked this pull request as ready for review August 25, 2025 16:21
@algorandskiy algorandskiy requested a review from gmalouf August 25, 2025 16:21
@algorandskiy algorandskiy force-pushed the pavel/ws-relays-p2p-backup branch from c47fd05 to 7187dfe Compare August 25, 2025 22:29
@cce cce requested a review from Copilot September 11, 2025 13:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a WS-network-with-P2P-backup meshing strategy for hybrid topology networks, along with various network improvements and test fixes.

  • Updated hybrid relay mesh creator to prioritize WebSocket connections and use P2P as backup to reach target connection counts
  • Added validation for non-dialed outgoing P2P connections to prevent interference from sub-components like pubsub
  • Updated libp2p-pubsub parameters to align with GossipFanout configuration

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
network/mesh.go Implements hybrid WS-priority meshing strategy with P2P backup and updates mesh function signatures
network/wsNetwork.go Updates mesh functions to accept target connection count parameter and pass it through
network/p2pNetwork.go Updates P2P mesh functions and adds dialed peer tracking functionality
network/p2p/streams.go Adds validation to ignore non-dialed outgoing connections in stream manager
network/p2p/pubsub.go Derives GossipSub parameters based on GossipFanout configuration
network/p2p/peerstore/peerstore.go Renames addressData key constant to use standardized psmdk prefix
*_test.go files Adds test coverage and fixes flaky tests by setting DNSBootstrapID = ""

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

network/mesh.go Outdated
if wsConnections <= targetConnCount {
// note "less or equal". Even if p2pTarget is zero it makes sense to call
// p2p meshThreadInner to fetch DHT peers
p2pTarget := targetConnCount - wsConnections
Copy link
Contributor

@cce cce Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking at a high level, how quickly do you want the "backup" to kick in. for example if you lose one of your outgoing connections to a WS node and it is quickly replaced, do we want to go from 0 to 1 here and run all the p2p machinery just for it to soon go back to 0 again, or do we want to wait a little bit to see if the issue is persistent.. or should "backup" mean only kick in if you have 0 wsConnections, for example?

Copy link
Contributor

@gmalouf gmalouf Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted backup to be "kick-in if we have 0 wsConnections" - probably should pick some period for that to remain true (and some grace for handling when it turns back to false).

A minimum period of time before P2P kicks in if WS connection count is > 0 (or some min connection count) but < target connection count is fine too, but it adds complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the following:

  1. Skip p2p for the very first time.
  2. implement timeout/delay as use of wsConnections from previous iteration?
@@ -239,12 +239,14 @@ func (c hybridRelayMeshCreator) create(opts ...meshOption) (mesher, error) {

        out := make(chan meshRequest, 5)
        var wg sync.WaitGroup
+       var prevWsConnections int = -1

        meshFn := func(targetConnCount int) int {
                wsConnections := cfg.wsnet.meshThreadInner(targetConnCount)

                var p2pConnections int
-               if wsConnections <= targetConnCount {
+               // skip p2p mesh for the first time to give wsnet to establish connections
+               if prevWsConnections != -1 && prevWsConnections <= targetConnCount {
                        // note "less or equal". Even if p2pTarget is zero it makes sense to call
                        // p2p meshThreadInner to fetch DHT peers
                        p2pTarget := targetConnCount - wsConnections
@@ -255,6 +257,7 @@ func (c hybridRelayMeshCreator) create(opts ...meshOption) (mesher, error) {
                                        wsConnections, p2pConnections, targetConnCount)
                        }
                }
+               prevWsConnections = wsConnections
                return wsConnections + p2pConnections
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, this is a good idea/implementation, I'll redo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated:

  1. ws mesher now takes (target - p2pConns) as a target
  2. p2p mesher uses (target - wsConns, as before) as a target

No delay added because of the following rationale:

  • wsConns count includes active + pending conns so it could be a bit inflated
  • the mesher is called every 1 minute and having ws net attempting first, it will always prefer ws net conns

gmalouf
gmalouf previously approved these changes Sep 29, 2025
Copy link
Contributor

@gmalouf gmalouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. In practice, it would be good to run a node off of Mainnet and watch who it maintains connections to over time for additional confidence.

// subset of config.Local needed here
type nodeSubConfig interface {
IsHybridServer() bool
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's kind of neat, we should add more accessors to config.Local and do this more often rather than passing a giant config around, so we know what flags are being used by packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, can we be even simpler, and have makeStreamManager take a boolean isHybrid parameter? Done this way, we're still passing the config.Local by value, so the hybridness can't change, and that's all we care about.

if err := s.host.Peerstore().Put(peer.ID, psmdkDialed, true); err != nil { // mark this peer as explicitly dialed
return err
if s.subcfg.IsHybridServer() {
if err := s.host.Peerstore().Put(peer.ID, psmdkDialed, true); err != nil { // mark this peer as explicitly dialed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when not in hybrid mode with just plain p2p, we don't set this because later we don't care about Outbound/Inbound for streamManager.Connected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run extra evaluations for non-hybrid, maybe we'll end up with a similar logic but GossipFanout matching to the default GossipSubD

gmalouf
gmalouf previously approved these changes Oct 9, 2025
@algorandskiy algorandskiy requested a review from cce October 10, 2025 21:24
wsPeer := stat.peer.(*wsPeer)
wsPeer.peerMessageDelay = stat.peerDelay
cc.log.Infof("network performance monitor - peer '%s' delay %d first message portion %d%%", wsPeer.GetAddress(), stat.peerDelay, int(stat.peerFirstMessage*100))
if wsPeer.throttledOutgoingConnection && leastPerformingPeer == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down a rabbit hole trying to understand what "throttled connection" means and why a peer will be throttled and it was really helpful to read the PR description here: #668

basically, it is a "best of both worlds" to avoid clique formation but still allow for some peers to be performance-ranked, so relays keep half of their peers to be random, stable connections (don't disconnect unless an error occurs). the other half are performance-ranked and the least-performing peer is kicked out periodically.

it's not clear whether we need to cargo-cult all of this over to the P2PNetwork use case, but we can get rid of it once we are fully using pubsub for gossip..

a better name might be monitoredOutgoingConnection or performanceMonitoredOutgoingConnection perhaps

@cce
Copy link
Contributor

cce commented Oct 16, 2025

for posterity, reading this PR #668 helps you understand the motivation behind the code that is being made more generic here

@algorandskiy algorandskiy requested a review from cce October 17, 2025 13:05
case req := <-cfg.wsnet.meshUpdateRequests:
out <- req
}
}()
Copy link
Contributor

@cce cce Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to run in a for loop, or is it intended to just forward a single message from wsnet.meshUpdateRequests to out, then the goroutine ends?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I thought about the channel as a measure to run initiate a mesh creator so we needed a single message from either source. Later I actually wired RequestConnectOutgoing so the channel have to be a constant source of event and have to be a loop. Updated.

@gmalouf gmalouf requested review from cce and gmalouf October 24, 2025 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement p2p Work related to the p2p project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants