Skip to content

Conversation

lghiur
Copy link
Collaborator

@lghiur lghiur commented Jul 11, 2025

PR Type

Bug fix, Tests


Description

  • Reverts logic for RPC critical failure to previous behavior

  • Removes emergency mode check for RPC in health check

  • Deletes related test cases for RPC emergency mode scenarios

  • Simplifies isCriticalFailure and its test coverage


Changes diagram

flowchart LR
  A["isCriticalFailure logic"] -- "Remove emergency mode check for RPC" --> B["Simplified RPC critical check"]
  B -- "Update tests" --> C["Remove emergency mode test cases"]
Loading

Changes walkthrough 📝

Relevant files
Bug fix
health_check.go
Remove emergency mode check from RPC critical failure logic

gateway/health_check.go

  • Removes emergency mode check for RPC in isCriticalFailure
  • Restores previous logic for determining RPC as critical
  • +2/-2     
    Tests
    health_check_test.go
    Remove and simplify tests for RPC emergency mode logic     

    gateway/health_check_test.go

  • Deletes test cases for RPC emergency mode scenarios
  • Simplifies test table for isCriticalFailure
  • Removes dependency on rpc package in tests
  • +0/-38   

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @buger
    Copy link
    Member

    buger commented Jul 11, 2025

    💔 The detected issue is not in one of the allowed statuses 💔

    Detected Status DoD Check
    Allowed Statuses In Dev,In Code Review,Ready for Testing,In Test,In Progress,In Review ✔️

    Please ensure your jira story is in one of the allowed statuses

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Logic Change

    The logic for determining if the RPC component is a critical failure has been reverted to ignore the emergency mode check. Reviewers should validate that this change aligns with the intended system behavior and does not reintroduce previously fixed issues.

    // Consider RPC critical only if using RPC
    if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
    	return true
    }
    Test Coverage Reduction

    Test cases for RPC emergency mode scenarios have been removed. Ensure that the remaining tests still provide adequate coverage for the critical failure logic, especially for edge cases involving the RPC component.

    			Status:        Warn,
    			ComponentType: System,
    		},
    		setupConfig: func(conf *config.Config) {
    			conf.UseDBAppConfigs = true
    		},
    		expectedResult: true, // Critical based on component and config, not status
    	},
    }
    
    for _, tt := range tests {
    	t.Run(tt.name, func(t *testing.T) {
    		// Create a new gateway instance for each test
    		conf := config.Config{}
    		tt.setupConfig(&conf)
    		gw := NewGateway(conf, nil)
    
    		// Call the function under test
    		result := gw.isCriticalFailure(tt.component)

    Copy link
    Contributor

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Restore emergency mode check for RPC

    Removing the emergency mode check may cause the system to treat the rpc component as
    critical even during emergency mode, potentially leading to unnecessary failures.
    Restore the emergency mode condition to avoid marking rpc as critical when in
    emergency mode.

    gateway/health_check.go [304-307]

    -// Consider RPC critical only if using RPC
    -if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
    +// Consider RPC critical only if using RPC and not in emergency mode
    +if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
     	return true
     }
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion correctly identifies that removing the emergency mode check could cause the system to treat the rpc component as critical even during emergency mode, which may not be the intended behavior. Restoring this check is important for correct system failover logic and prevents unnecessary failures.

    Medium

    Copy link
    Contributor

    API Changes

    no api changes detected

    Copy link
    Contributor

    Dependency Impact Analysis: RPC Emergency Mode in Health Checks

    This PR reverts a change to how RPC (Remote Procedure Call) failures are evaluated in Tyk Gateway's health check system. The change specifically affects MDCB (Multi Data Center Bridge) deployments and their readiness checks.

    Key Changes

    1. Simplified RPC Critical Failure Logic:

      • Before: RPC was considered a critical failure only if the policy source was RPC AND the system was NOT in emergency mode
      • After: RPC is considered a critical failure if the policy source is RPC, regardless of emergency mode
    2. Test Simplification:

      • Removed test cases specifically for RPC emergency mode scenarios
      • Removed dependency on the rpc package in the health check tests
      • Simplified the test table for isCriticalFailure

    Technical Impact Analysis

    Emergency Mode Context

    In Tyk Gateway, "emergency mode" is a resilience mechanism that activates when the RPC connection to MDCB fails. When in emergency mode, the system falls back to using cached data instead of trying to fetch fresh data from the RPC server, allowing the gateway to continue functioning during RPC outages.

    Health Check Behavior Change

    The change modifies how health checks report the system status when RPC is down:

    • Previous behavior: When in emergency mode, RPC failures would NOT be considered critical, so health checks would still pass (return 200 OK)
    • New behavior: RPC failures are ALWAYS considered critical if RPC is the policy source, regardless of emergency mode (will return 503 Service Unavailable)

    MDCB Readiness Check Impact

    This change specifically addresses a regression issue with MDCB readiness checks (TT-9234). The readiness endpoint (/ready) is used by orchestration systems like Kubernetes to determine if a pod is ready to receive traffic.

    By removing the emergency mode condition, the readiness check will more accurately reflect the actual state of the RPC connection, which is important for proper cluster management in MDCB deployments. This ensures that systems relying on these health checks get accurate information about the gateway's connectivity status.

    Dependency Impact

    This change affects systems that rely on Tyk Gateway's health/readiness checks:

    1. Kubernetes/Orchestration Systems: Services that route traffic based on health check results will now see RPC failures as critical, even during emergency mode. This may affect pod availability reporting.

    2. Monitoring Systems: Alerts may trigger for RPC failures even when the system is operating in its designed fallback state.

    3. MDCB Deployments: The PR specifically fixes issues with readiness checks in MDCB scenarios, ensuring more accurate reporting of the gateway's connection status.

    Conclusion

    This change represents a trade-off between nuanced health reporting and accurate connection status reporting. While the gateway can continue to function in emergency mode with cached data, the health check will now report the actual RPC connection status more accurately, which is important for proper orchestration in MDCB deployments.

    The simplification makes the health check behavior more straightforward and consistent, at the cost of potentially reporting the gateway as unhealthy even when it's operating in a valid fallback state.


    Tip: Mention me again using /dependency <request>.
    Powered by Probe AI\nTyk Gateway Dependency Impact Reviewer

    Copy link
    Contributor

    PR Analysis: Revert RPC Emergency Mode Check in Health Check

    This PR reverts changes related to how the Tyk gateway handles RPC critical failures in health checks. The changes are focused on fixing regression issues with MDCB (Multi Data Center Bridge) readiness checks.

    Changes Overview

    1. In gateway/health_check.go:

      • Removes the emergency mode check from the RPC critical failure logic
      • The condition is simplified from:
        // Consider RPC critical only if using RPC and gw not in emergency mode
        if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
        to:
        // Consider RPC critical only if using RPC
        if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
    2. In gateway/health_check_test.go:

      • Removes the import of the rpc package
      • Removes the setupFunc field from the test struct which was used to set up emergency mode
      • Deletes two test cases that were specifically testing RPC emergency mode scenarios:
        • "rpc component is NOT critical when PolicySource is rpc but in emergency mode"
        • "rpc component is critical when PolicySource is rpc and NOT in emergency mode"
      • Removes the code that would call setupFunc in the test

    Technical Context

    Emergency Mode

    Emergency mode is a flag that is set when the RPC connection is down and the gateway is configured to use the local cache. The IsEmergencyMode() function in the rpc package returns this state.

    MDCB (Multi Data Center Bridge)

    MDCB is a component of Tyk that allows for multi-data center deployments. It relies on RPC for communication between data centers.

    Readiness Check

    The readiness check is a dedicated endpoint that verifies if the gateway is ready to serve requests by checking:

    • Redis connection status
    • Successful API reload
    • Other critical components

    Unlike the liveness check, the readiness check returns a 503 Service Unavailable status if the gateway is not ready to serve requests.

    Impact Analysis

    This change reverts to a simpler behavior where RPC is considered a critical component whenever the policy source is set to "rpc", regardless of whether the system is in emergency mode or not.

    Problem Being Fixed

    The previous implementation was causing issues with MDCB readiness checks. When the system entered emergency mode (due to RPC connection issues), the health check would not consider RPC as critical, which could lead to misleading health status reporting.

    With the emergency mode check in place, if RPC was down but the system was in emergency mode, the health check would report that everything was fine (since RPC wasn't considered critical in emergency mode). This could mask actual issues with the RPC connection that needed attention.

    New Behavior

    With this change:

    1. If RPC is configured as the policy source, it will always be considered critical
    2. If RPC fails, it will always be considered a critical failure when RPC is the policy source
    3. The readiness check will now fail appropriately when RPC is down, regardless of emergency mode

    This ensures that the health check accurately reflects the state of the RPC connection, which is particularly important for MDCB deployments where RPC is a critical component for communication between data centers.

    Conclusion

    This PR simplifies the RPC critical failure logic by removing the emergency mode condition, which should fix the regression issues with MDCB readiness checks. The change is well-tested with the removal of test cases that are no longer relevant after this change.

    The PR title indicates this is a reversion of a previous change that introduced the emergency mode check, suggesting that the original behavior (without the emergency mode check) was more reliable for MDCB readiness checks.


    Tip: Mention me again using /performance <request>.
    Powered by Probe AI\nPerformance Impact Reviewer Prompt

    Copy link
    Contributor

    Analysis of PR #7231: Revert RPC Emergency Mode Check in Health Checks

    Overview

    This PR reverts changes related to RPC emergency mode handling in the health check system. The key changes are:

    1. Removal of the emergency mode check for RPC in the isCriticalFailure function
    2. Simplification of related test cases

    Detailed Changes

    1. Modification to isCriticalFailure Function

    The PR reverts the RPC critical failure logic to its previous behavior by removing the emergency mode check:

    Before the PR (current implementation):

    // Consider RPC critical only if using RPC and not in emergency mode
    if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
        return true
    }

    After the PR (reverted to previous behavior):

    // Consider RPC critical only if using RPC
    if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
        return true
    }

    This change means that RPC will be considered a critical failure if it's being used as the policy source, regardless of whether it's in emergency mode or not.

    2. Test Case Simplification

    The PR also removes test cases related to RPC emergency mode scenarios in health_check_test.go, simplifying the test coverage for the isCriticalFailure function. The removed test cases specifically tested the behavior when RPC was in emergency mode.

    Understanding the Context

    Emergency Mode in Tyk

    In Tyk, "emergency mode" is a fallback mechanism for RPC connections. When the RPC connection fails, the system can enter emergency mode, which allows it to continue operating using cached data. This is particularly important for MDCB (Multi Data Center Bridge) deployments.

    The emergency mode is managed by the rpc package and is used in various parts of the codebase to handle RPC failures gracefully. For example:

    • When loading policies from RPC, if in emergency mode, it loads from a backup
    • When checking session details, it handles emergency mode differently
    • During resource synchronization, emergency mode can be triggered if all retries fail

    Health Check System

    The health check system in Tyk monitors various components and reports their status. The isCriticalFailure function determines whether a component failure should be considered critical, which affects the overall health status reported by the system.

    The health check is used for both liveness and readiness probes, which are important for Kubernetes deployments to determine if a pod is healthy and ready to serve traffic.

    Impact Analysis

    Functionality Changes

    1. Stricter Health Checks: By removing the emergency mode check, the system will now report RPC failures as critical even when in emergency mode. This means the health check will report a failure even when the system could potentially still operate using cached data.

    2. MDCB Readiness Check: The PR title mentions "regression fixes for failing mdcb readiness check", suggesting that the previous implementation was causing issues with MDCB readiness checks. The change ensures that RPC failures are consistently treated as critical when RPC is the policy source.

    3. Simplified Logic: The change simplifies the failure detection logic, making it more straightforward and easier to understand.

    Potential Concerns

    1. False Positives: This change might lead to more false positive critical failures being reported, as RPC will be considered critical even when the system could operate in emergency mode.

    2. Emergency Mode Behavior: The emergency mode in RPC is a state where the RPC connection is not functioning properly, but the system can still operate using cached data. By removing this check, the health check system will not take this into account.

    Conclusion

    This PR reverts to a simpler, more conservative approach to health checks for RPC components. While it might result in more stringent failure reporting, it ensures that any RPC issues are properly flagged as critical when RPC is being used as the policy source. This change appears to be addressing specific regression issues with MDCB readiness checks.

    The PR is marked as "DO NOT MERGE" in the title, suggesting it might be a temporary fix or under review, and not intended for immediate merging into the main codebase. This could indicate that a more comprehensive solution is being developed, or that further testing is needed before this change can be safely merged.


    Tip: Mention me again using /connectivity <request>.
    Powered by Probe AI\nConnectivity Issues Reviewer Prompt for Tyk Gateway

    Copy link
    Contributor

    Analysis of PR #7231: Revert RPC Emergency Mode Check in Health Check

    This PR reverts changes related to how the Tyk gateway determines if RPC is a critical component during health checks. The changes specifically affect the MDCB (Multi Data Center Bridge) readiness check functionality.

    Key Changes

    1. Simplified RPC Critical Failure Logic

    The PR removes the emergency mode check from the RPC critical failure determination:

    Before:

    // Consider RPC critical only if using RPC and gw not in emergency mode
    if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
        return true
    }

    After:

    // Consider RPC critical only if using RPC
    if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
        return true
    }

    2. Simplified Test Coverage

    The PR significantly simplifies the test code:

    • Removes the rpc package import as it's no longer needed
    • Removes test cases specifically for RPC emergency mode scenarios
    • Simplifies the test table structure for isCriticalFailure
    • Renames setUpConfig to setupConfig for consistency
    • Removes the setupRPC and setupFunc fields that were used to set up emergency mode

    3. Technical Impact

    Emergency Mode Behavior

    The previous implementation would not consider RPC as critical when in emergency mode, even if the policy source was set to "rpc". This PR changes that behavior so that RPC is always considered critical when the policy source is "rpc", regardless of emergency mode.

    MDCB Readiness Check

    The PR title mentions "regression fixes for failing mdcb readiness check", indicating that the previous behavior was causing issues with MDCB readiness checks. By reverting to the simpler logic, the PR aims to fix these regression issues.

    Analysis

    Rationale

    The emergency mode in Tyk is a fallback mechanism that activates when RPC connections fail, allowing the gateway to operate with locally cached data. The previous implementation likely aimed to prevent the readiness check from failing when in emergency mode, as RPC would naturally be unavailable in that state.

    However, this appears to have caused regression issues with MDCB readiness checks. The PR reverts to a simpler approach where RPC is always considered critical when configured as the policy source, which may be more appropriate for ensuring system integrity in distributed deployments.

    Potential Concerns

    By removing the emergency mode check, the system will now treat RPC as critical even during emergency mode. This could potentially lead to readiness checks failing when RPC is unavailable, even though the system is designed to operate in emergency mode with local backups.

    Benefits

    • Simpler, more straightforward logic for determining critical components
    • Fixes regression issues with MDCB readiness checks
    • Removes dependency on the rpc package in tests, making them more isolated

    Conclusion

    This PR represents a pragmatic approach to fixing regression issues by reverting to a simpler implementation. While it might seem counterintuitive to consider RPC critical even during emergency mode (when RPC is known to be unavailable), this approach likely provides more consistent behavior for MDCB deployments.

    The PR title's "DO NOT MERGE" prefix suggests this might be a temporary fix or under evaluation, possibly to be replaced with a more comprehensive solution in the future.


    Tip: Mention me again using /security <request>.
    Powered by Probe AI\nSecurity Impact Reviewer Prompt

    Copy link

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants