-
Notifications
You must be signed in to change notification settings - Fork 191
health monitoring #915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
health monitoring #915
Conversation
|
Hi @seetadev this ci keeps failing but has nothing to do with the implementation i made, and also i had to create a new pr because there was a lot of changes that affected me when i git pulled |
|
Hello @sumanjeet0012 @lla-dane @sukhman-sukh please can i get a review on this, thanks |
|
@bomanaps Could you please share the link to the specification file or a reference to this feature in other libp2p implementations? It would be easier for me to understand what exactly we want to implement through this PR. |
|
|
@acul71 i had to create a new pr because there was a lot of changes that affected me when i git pulled |
|
Hello @acul71 @sumanjeet0012 @lla-dane @sukhman-sukh please can i get a review on this, thanks |
|
@bomanaps This is a fairly substantial PR, and there have already been extensive discussions surrounding it. I’ll need some additional time to review it thoroughly. I’m planning to go through it over the weekend. |
sumanjeet0012
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bomanaps Would you be able to share a basic screencast demonstrating the execution of health_monitoring_example.py? I'm currently facing some issues with dependency installation and haven't been able to run the example locally.
|
@bomanaps 🎉 Conclusion |
|
Hello @acul71 @sumanjeet0012 @seetadev am done addressing this Screen.Recording.2025-09-17.at.07.53.50.mov |
Hey. Good try, but It's difficult to understands what's going on, without audio that comments over or subs or another format like presentation with slides Isn't there something like a panel with real time info or something like that Kidding, just something more clear at a glance |
|
@bomanaps Are the failing test related or not ? do they fail on your local box with |
If we can re run the ci that will be good the logs iis not from my end make pr passes too, but the ci fail is showing test_provide_and_find_providers is checking that envelope_a |
|
@sumanjeet0012 LGTM see if you dig it. |
|
@bomanaps : Great, thanks Mercy. This PR is indeed ready for final review + merge. CCing @acul71 and @sumanjeet0012 for their feedback and pointers. Will merge this soon once they give a thumbs up to this comment. @bomanaps : In the meantime, please try and complete the other PRs that you are currently contributing before the maintainer's call. We plan to do a py-libp2p release once web sockets PR is complete. Would like your opened PRs to be reviewed, merged and landed in the new release too. |
|
@bomanaps : Hi Mercy. Wish if you could resolve the merge conflicts and ping me once completed. Will do a final review + merge. |
|
@seetadev The PR looks good to me, |
|
@sumanjeet0012 : I spend some considerable time trying to understand the CI/CD issue. There is only one failure, and prevalent across a majority of PRs. So, spent sometime understanding the root cause. Wish to share that out of 700 tests, only 1 test failed: tests/core/kad_dht/test_kad_dht.py::test_find_node The failure is an AssertionError. The real issue is in the Kademlia DHT test, specifically test_find_node. That test usually:
An AssertionError here typically means one of:
@sumanjeet0012 , @yashksaini-coder and @acul71 : Alongside, I investigated how kad-dht should behave while running in CI where timing/network scheduling is tighter. Probably, we are having race around conditions when using kad-dht over CI. I also attended nim-libp2p calls today where we discussed on this aspect today. I'll share some solutions available for fixing it on CI/CD at a discussion page today. I'll re-share the link here. |
|
@sumanjeet0012 , @yashksaini-coder , @bomanaps and @acul71 : Please refer to #954 @bomanaps : Please also resolve the merge conflicts. |
| # If both connection_config and quic_transport_opt are provided, | ||
| # merge health monitoring settings | ||
| if connection_config is not None: | ||
| # Merge health monitoring settings from connection_config | ||
| # into quic_transport_opt | ||
| if hasattr(connection_config, "enable_health_monitoring"): | ||
| quic_transport_opt.enable_health_monitoring = ( | ||
| connection_config.enable_health_monitoring | ||
| ) | ||
| if hasattr(connection_config, "health_check_interval"): | ||
| quic_transport_opt.health_check_interval = ( | ||
| connection_config.health_check_interval | ||
| ) | ||
| if hasattr(connection_config, "load_balancing_strategy"): | ||
| quic_transport_opt.load_balancing_strategy = ( | ||
| connection_config.load_balancing_strategy | ||
| ) | ||
| if hasattr(connection_config, "max_connections_per_peer"): | ||
| quic_transport_opt.max_connections_per_peer = ( | ||
| connection_config.max_connections_per_peer | ||
| ) | ||
| logger.info( | ||
| "Merged health monitoring settings from " | ||
| "connection_config into QUIC config" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only copying over these 4 attributes when a ConnectdionConfig has so many more?
|
We need a newsfragment. Overall the code looks good, but there are 0 tests included here. We need automated testing to verify it works as expected today and with any future code added. |
Am working on the test |
@bomanaps could you add me to this project please |
No need I will resolve this, thanks |
|
The job failed due to a documentation build error caused by an "Unexpected indentation" in the docstring for the method Pubsub.write_msg in libp2p/pubsub/pubsub.py. The error is detected by docutils, which is used by Sphinx to build docs, and warnings are treated as errors in CI. Solution: Review and correct the indentation in the docstring for the write_msg method at the following location: libp2p/pubsub/pubsub.py. |
I have fixed this and CI pass now |
|
@bomanaps : Great, thank you so much for fixing the CI/CD issues and for incorporating our feedback points. Appreciate it. Reviewing the PR in detail. Will share feedback soon. CCing @acul71 for his thoughts and pointers too. Also, inviting feedback and thoughts from @yashksaini-coder, @sukhman-sukh, @Winter-Soren and @sumanjeet0012. |
|
@bomanaps : From a release notes perspective, wish if we could change the title of the PR to: Enhance Connection Reliability with Health Metrics, Monitoring in Py-libp2p. Just a suggestion. |
PR #915 Review: Health Monitoring ImplementationOverviewPR #915 implements connection health monitoring and load balancing for py-libp2p. Addresses the 1-1 connection mapping limitation from Discussion #717. Stats: 4,373 additions, 29 deletions across 80 files, 25 commits Core ImplementationHealth Monitoring Module# libp2p/network/health/data_structures.py
@dataclass
class ConnectionHealth:
latency_ms: float
ping_success_rate: float
bandwidth_usage: float
error_history: list[ConnectionError]
# ... 80+ fields totalMonitoring Service# libp2p/network/health/monitor.py
class ConnectionHealthMonitor(Service):
async def _replace_unhealthy_connection(self, peer_id: ID, conn: INetConn):
# Replaces unhealthy connections while maintaining minimum countAPI Fix# libp2p/__init__.py
def new_host(connection_config: ConnectionConfig = None):
# Now accepts connection_config parameter (was missing before)@pacrob's Critical Issues1. Connection Replacement Logic BugIssue: # libp2p/network/health/monitor.py
async def _replace_unhealthy_connection(self, peer_id: ID, conn: INetConn):
# Close unhealthy connection
await self.swarm.close_connection(conn)
# ❌ PROBLEM: This returns existing connections, doesn't create new ones
new_conn = await self.swarm.dial_peer(peer_id)Impact: Unhealthy connections closed but not replaced. 2. Dead Code in Warmup LogicIssue: Code checks for non-existent # libp2p/network/health/monitor.py
if health.established_at and (time.time() - health.established_at) < warmup:
# ❌ PROBLEM: established_at doesn't exist on INetConn or SwarmConn
returnImpact: Warmup window logic is non-functional. 3. Race Condition in Connection ManagementIssue: Edge case handling when at minimum connections. # libp2p/network/health/monitor.py
if len(connections) <= self.config.min_connections_per_peer:
# ❌ PROBLEM: Won't replace bad connection, just leaves it
return FalseImpact: Could leave unhealthy connections indefinitely. 4. Missing @AbstractMethod DecoratorsIssue: Methods in abc.py should be abstract. # libp2p/abc.py
def get_peer_health_summary(self, peer_id: ID) -> dict[str, Any]:
# ❌ PROBLEM: Should be @abstractmethod
pass5. Unnecessary Type IgnoresIssue: Type ignores may not be needed. # tests/core/transport/test_websocket.py
ExceptionGroup = builtins.ExceptionGroup # type: ignore[misc]
# ❌ QUESTION: Are these still necessary?Review Status✅ Approved by:
Resolution Status@bomanaps addressed:
Still needs:
RecommendationCONDITIONAL APPROVAL - Core implementation is solid but critical issues from @pacrob must be resolved. Priority Actions:
The feature provides significant value but needs these fixes for production readiness. |
8a5d898 to
b52bcdc
Compare
Hello @acul71 I have addresssed this please can I get a review on it again, thank you. |
|
@bomanaps : Wish if you could resolve merge conflicts listed above. |
…_manager The health_data attribute was only being initialized in set_resource_manager, but it needs to be available immediately after Swarm creation for tests and other code that accesses it before set_resource_manager is called.
19585c3 to
9e1976b
Compare
|
@bomanaps I've merged with origin main, see if everything is in order. This is a detailed review, let's talk about that, and write what you'll address.. AI Pull Request Review: PR #915 - Health MonitoringReview Date: 2025-11-18 1. Summary of ChangesThis PR implements comprehensive connection health monitoring and intelligent load balancing for py-libp2p. The implementation addresses the 1-1 connection mapping limitation discussed in Discussion #717. Issues Addressed
Modules Affected
Breaking ChangesNone. The implementation maintains full backward compatibility. Health monitoring is opt-in via configuration. Statistics
2. StrengthsArchitecture & Design
Code Quality
API Design
Documentation
Testing
3. Issues FoundCriticalNone. The critical issues mentioned in previous reviews have been addressed:
Major3.1 Race Condition in Minimum Connection Handling
# Current code:
if remaining_after_removal < self.config.min_connections_per_peer:
logger.warning(...)
return3.2 Missing @AbstractMethod Decorators (Design Question)
3.3 Connection Replacement Not Added to Swarm Connections
Minor3.4 Inconsistent Error Handling in Ping
3.5 Potential Memory Leak in Error History
3.6 Missing Validation in Health Score Calculation
4. Security ReviewSecurity Assessment: Low RiskNo significant security vulnerabilities identified. The implementation follows secure coding practices:
Minor Security Considerations
5. Documentation and ExamplesDocumentation Quality: Excellent
Documentation Suggestions
6. Newsfragment RequirementStatus: ✅ PASS
The newsfragment is well-written and describes the feature from a user perspective, covering:
7. Tests and ValidationTest Results: ✅ ALL PASSING
Test CoverageNew Test Files:
Test Quality:
Test Recommendations
8. Recommendations for ImprovementHigh Priority
Medium Priority
Low Priority
9. Questions for the Author
10. Overall AssessmentQuality Rating: GoodThe implementation is well-structured, comprehensive, and addresses the core requirements. The code quality is high with good test coverage and documentation. Most critical issues from previous reviews have been addressed. Security Impact: NoneNo security vulnerabilities identified. The implementation follows secure coding practices. Merge Readiness: Needs Minor FixesThe PR is close to merge-ready but should address:
These are relatively minor issues that can be addressed quickly. Confidence: HighThe implementation is solid and well-tested. The remaining issues are edge cases and design clarifications rather than fundamental problems. Additional NotesPrevious Review Feedback StatusFrom @pacrob's review, the following issues have been RESOLVED:
The following issues remain as DESIGN DECISIONS or MINOR IMPROVEMENTS:
Commendations
|

What was wrong?
Issue #
How was it fixed?
Comprehensive Health Metrics: Implemented ConnectionHealth dataclass with latency tracking, success rates, bandwidth monitoring, and weighted health scoring.
Proactive Monitoring Service: Built ConnectionHealthMonitor with periodic health checks, real-time ping measurements, and automatic unhealthy connection replacement.
Health-Aware Load Balancing: Added four intelligent connection selection strategies, including health_based and latency_based, for optimal traffic routing.
API Consistency Fix: Extended new_host() to accept connection_config, resolving previous API inconsistency while maintaining full backward compatibility.
Summary of approach.
To-Do
Cute Animal Picture