-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[DO NOT MERGE] Revert "[TT-9234] regression fixes for failing mdcb readiness check (… #7231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
💔 The detected issue is not in one of the allowed statuses 💔
Please ensure your jira story is in one of the allowed statuses |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
API Changes no api changes detected |
Dependency Impact Analysis: RPC Emergency Mode in Health ChecksThis PR reverts a change to how RPC (Remote Procedure Call) failures are evaluated in Tyk Gateway's health check system. The change specifically affects MDCB (Multi Data Center Bridge) deployments and their readiness checks. Key Changes
Technical Impact AnalysisEmergency Mode ContextIn Tyk Gateway, "emergency mode" is a resilience mechanism that activates when the RPC connection to MDCB fails. When in emergency mode, the system falls back to using cached data instead of trying to fetch fresh data from the RPC server, allowing the gateway to continue functioning during RPC outages. Health Check Behavior ChangeThe change modifies how health checks report the system status when RPC is down:
MDCB Readiness Check ImpactThis change specifically addresses a regression issue with MDCB readiness checks (TT-9234). The readiness endpoint ( By removing the emergency mode condition, the readiness check will more accurately reflect the actual state of the RPC connection, which is important for proper cluster management in MDCB deployments. This ensures that systems relying on these health checks get accurate information about the gateway's connectivity status. Dependency ImpactThis change affects systems that rely on Tyk Gateway's health/readiness checks:
ConclusionThis change represents a trade-off between nuanced health reporting and accurate connection status reporting. While the gateway can continue to function in emergency mode with cached data, the health check will now report the actual RPC connection status more accurately, which is important for proper orchestration in MDCB deployments. The simplification makes the health check behavior more straightforward and consistent, at the cost of potentially reporting the gateway as unhealthy even when it's operating in a valid fallback state. Tip: Mention me again using |
PR Analysis: Revert RPC Emergency Mode Check in Health CheckThis PR reverts changes related to how the Tyk gateway handles RPC critical failures in health checks. The changes are focused on fixing regression issues with MDCB (Multi Data Center Bridge) readiness checks. Changes Overview
Technical ContextEmergency ModeEmergency mode is a flag that is set when the RPC connection is down and the gateway is configured to use the local cache. The MDCB (Multi Data Center Bridge)MDCB is a component of Tyk that allows for multi-data center deployments. It relies on RPC for communication between data centers. Readiness CheckThe readiness check is a dedicated endpoint that verifies if the gateway is ready to serve requests by checking:
Unlike the liveness check, the readiness check returns a 503 Service Unavailable status if the gateway is not ready to serve requests. Impact AnalysisThis change reverts to a simpler behavior where RPC is considered a critical component whenever the policy source is set to "rpc", regardless of whether the system is in emergency mode or not. Problem Being FixedThe previous implementation was causing issues with MDCB readiness checks. When the system entered emergency mode (due to RPC connection issues), the health check would not consider RPC as critical, which could lead to misleading health status reporting. With the emergency mode check in place, if RPC was down but the system was in emergency mode, the health check would report that everything was fine (since RPC wasn't considered critical in emergency mode). This could mask actual issues with the RPC connection that needed attention. New BehaviorWith this change:
This ensures that the health check accurately reflects the state of the RPC connection, which is particularly important for MDCB deployments where RPC is a critical component for communication between data centers. ConclusionThis PR simplifies the RPC critical failure logic by removing the emergency mode condition, which should fix the regression issues with MDCB readiness checks. The change is well-tested with the removal of test cases that are no longer relevant after this change. The PR title indicates this is a reversion of a previous change that introduced the emergency mode check, suggesting that the original behavior (without the emergency mode check) was more reliable for MDCB readiness checks. Tip: Mention me again using |
Analysis of PR #7231: Revert RPC Emergency Mode Check in Health ChecksOverviewThis PR reverts changes related to RPC emergency mode handling in the health check system. The key changes are:
Detailed Changes1. Modification to
|
Analysis of PR #7231: Revert RPC Emergency Mode Check in Health CheckThis PR reverts changes related to how the Tyk gateway determines if RPC is a critical component during health checks. The changes specifically affect the MDCB (Multi Data Center Bridge) readiness check functionality. Key Changes1. Simplified RPC Critical Failure LogicThe PR removes the emergency mode check from the RPC critical failure determination: Before: // Consider RPC critical only if using RPC and gw not in emergency mode
if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
return true
} After: // Consider RPC critical only if using RPC
if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
return true
} 2. Simplified Test CoverageThe PR significantly simplifies the test code:
3. Technical ImpactEmergency Mode BehaviorThe previous implementation would not consider RPC as critical when in emergency mode, even if the policy source was set to "rpc". This PR changes that behavior so that RPC is always considered critical when the policy source is "rpc", regardless of emergency mode. MDCB Readiness CheckThe PR title mentions "regression fixes for failing mdcb readiness check", indicating that the previous behavior was causing issues with MDCB readiness checks. By reverting to the simpler logic, the PR aims to fix these regression issues. AnalysisRationaleThe emergency mode in Tyk is a fallback mechanism that activates when RPC connections fail, allowing the gateway to operate with locally cached data. The previous implementation likely aimed to prevent the readiness check from failing when in emergency mode, as RPC would naturally be unavailable in that state. However, this appears to have caused regression issues with MDCB readiness checks. The PR reverts to a simpler approach where RPC is always considered critical when configured as the policy source, which may be more appropriate for ensuring system integrity in distributed deployments. Potential ConcernsBy removing the emergency mode check, the system will now treat RPC as critical even during emergency mode. This could potentially lead to readiness checks failing when RPC is unavailable, even though the system is designed to operate in emergency mode with local backups. Benefits
ConclusionThis PR represents a pragmatic approach to fixing regression issues by reverting to a simpler implementation. While it might seem counterintuitive to consider RPC critical even during emergency mode (when RPC is known to be unavailable), this approach likely provides more consistent behavior for MDCB deployments. The PR title's "DO NOT MERGE" prefix suggests this might be a temporary fix or under evaluation, possibly to be replaced with a more comprehensive solution in the future. Tip: Mention me again using |
|
PR Type
Bug fix, Tests
Description
Reverts logic for RPC critical failure to previous behavior
Removes emergency mode check for RPC in health check
Deletes related test cases for RPC emergency mode scenarios
Simplifies
isCriticalFailure
and its test coverageChanges diagram
Changes walkthrough 📝
health_check.go
Remove emergency mode check from RPC critical failure logic
gateway/health_check.go
isCriticalFailure
health_check_test.go
Remove and simplify tests for RPC emergency mode logic
gateway/health_check_test.go
isCriticalFailure
rpc
package in tests