[Hitless Upgrades] React to maintenance events #3345 #3354

tishun · 2025-07-10T15:48:11Z

Maintenance Events Support

Summary

Adds maintenance events support for Redis hitless upgrades. Enables clients to handle server-side maintenance events and adapt connection behavior during maintenance windows.

Changes

New Components

MaintenanceEventsOptions - Configuration for maintenance event handling
MaintenanceAwareConnectionWatchdog - Handles MOVING, MIGRATING, FAILING_OVER events
MaintenanceAwareExpiryWriter - Applies relaxed timeouts during maintenance
RebindState - Connection rebinding state management

Default Configuration

Maintenance events support is enabled by default on the client side.
10-second relaxed timeout during maintenance operations
Auto-resolve the address type source for endpoint resolution

Client Integration

Updated ClientOptions with maintenance events configuration
Modified ConnectionBuilder with maintenance-aware components
Enhanced RedisHandshake for maintenance event subscription

Testing

Unit tests for maintenance components
Functional tests with Fault injection testing

Backward Compatibility

All changes are backward compatible from an API perspective. Existing applications work without modification.
If feature is enabled on the Redis server side, a relaxed timeout with a default 10s will be applied during maintenance

Related Issues

CAE-1303: Enable maintenance events support by default
CAE-1285: Support for none moving-endpoint-type notifications
CAE-1130: Timeout tests and configuration
CAE-633: Functional tests for notifications

Make sure that:

You have read the contribution guidelines.
You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
You submit test cases (unit or integration tests) that back your changes.

ggivo

LGTM

* v0.1 * Simple reconnect now working * Bind address from message is now considered * Self-register the handler * Format code * Filter push messages in a more stable way * (very hacky) Relax comand expire timers globbaly * Configure if timeout relaxing should be applied * Proper way to close channel * Configure the timneout relaxing * Sequential handover implemented * Did not address formatting * Prolong the rebind windwow for relaxed tiemouts * PubSub no longer required; CommandExpiryWriter is now channel aware; Polishing * Use the new MOVING push message from the RE server * Unit test was not chaining delgates in the same way that the RedisClient/RedisClusterClient was * Fix REBIND message validation * Fixed the expiry mechanism * Polishing * Fix NPE. Seems like AttributeMap.attr is not accurate and actually return's null causing some unit test failures. * Add support for MIGRATING/MIGRATED message handling in command expiry This commit adds the ability to listen for MIGRATING and MIGRATED messages and trigger extended command expiry timeouts during Redis shard migration. Key changes: - Enhanced RebindAwareConnectionWatchdog to detect MIGRATING/MIGRATED messages - RebindAwareExpiryWriter to trigger timeout relaxation whenever MIGRATING message is received This feature allows commands to have relaxed timeouts during shard migration operations, preventing unnecessary timeouts when Redis is temporarily busy with migration tasks. * formating * Fix Disabling relaxTimeouts after upgrade can interfere with an ongoing one from re-bind * Additional fix for timeout relaxing disabled * Fix push message listener registered multiple times after rebind. * Fix: Report correct command timeout when relaxTimeout is configured * Disable relaxedTimeout after configured grace period - Introduce a delay before disabling relaxedTimeout - Grace period duration is provided via push notification * Code clean up - Remove reading from pub/sub chanel and relay only on push notifications * Add FAILING_OVER/FAILED_OVER * Polishing : Rename components to use the word 'maintenace' --------- Co-authored-by: Igor Malinovskiy <[email protected]> Co-authored-by: ggivo <[email protected]> # Conflicts: # src/main/java/io/lettuce/core/ClientOptions.java

(#3356) * Unit tests for the maintanence aware classes * Did not format properly * Proper license

…ample.java

* initial WIP, with lots of debugging, and some non-working tests * debug * more attemtps at debugging * Refactor: Move cluster state management methods from MaintenanceNotificationTest to RedisEnterpriseConfig - Moved refreshClusterConfig, recordOriginalClusterState, and restoreOriginalClusterState methods - Updated call sites in MaintenanceNotificationTest to use RedisEnterpriseConfig methods - Added required imports and static variables to RedisEnterpriseConfig - Maintained existing functionality while improving code organization Improvements to RelaxedTimeoutConfigurationTest: - Simplified traffic generation logic by removing complex multi-phase testing - Streamlined BLPOP command execution with better timeout detection - Added relaxed timeout detection and recording during maintenance events - Improved logging and error handling for timeout analysis - Enhanced test assertions to focus on relaxed timeout detection rather than success counts - Added MOVING operation duration tracking for better test analysis * Improve test reliability and cleanup: Add @AfterEach cleanup, enhance endpoint tracking, and improve logging * add un-relaxed tests. will investigate further why they got broken at some point via diff * CAE-1130: Update timeout configuration test and watchdog implementation * Reset MaintenanceAwareConnectionWatchdog.java and log4j2-test.xml to upstream versions * Clean up debug info and outdated comments from timeout tests - Remove large debug block with reflection-based debugging in RelaxedTimeoutConfigurationTest - Simplify excessive debug logging and verbose markers (*** and ===) - Clean up maintenance notification test logging - Improve push notification monitor message formatting - Maintain all test functionality while improving code readability * Refactor: Move inline comments above code and fix string comparisons - Move all inline comments to be above the code they reference in: * RelaxedTimeoutConfigurationTest.java * RedisEnterpriseConfig.java * MaintenanceNotificationTest.java - Replace string != comparisons with .equals() for proper string comparison - Apply code formatting via Maven formatter This improves code readability and follows Java best practices for string comparison.

…t-in (#3380) * Support for Client-side opt-in A client can tell the server if it wants to receive maintenance push notifications via the following command: CLIENT MAINT_NOTIFICATIONS <ON | OFF> [parameter value parameter value ...] * update maintenance events to latest format - MIGRATING <seq_number> <time> <shard_id-s>: A shard migration is going to start within <time> seconds. - MIGRATED <seq_number> <shard_id-s>: A shard migration ended. - FAILING_OVER <seq_number> <time> <shard_id-s>: A shard failover of a healthy shard started. - FAILED_OVER <seq_number> <shard_id-s>: A shard failover of a healthy shard ended. - MOVING <seq_number> <time> <endpoint>: A specific endpoint is going to move to another node within <time> seconds * clean up * Update FAILED_OVER & MIGRATED to include additional time field * update is private reserver check & add unit tests update is private reserver check * add unit tests for handshake with enabled maintenance events * add missing copyrights/docs * format * address review comments * Revert address after rebind operation expires * Update event's validation spec - MIGRATING <seq_number> <time> <shard_id-s>: A shard migration is going to start within <time> seconds. - MIGRATED <seq_number> <shard_id-s>: A shard migration ended. - FAILING_OVER <seq_number> <time> <shard_id-s>: A shard failover of a healthy shard started. - FAILED_OVER <seq_number> <shard_id-s>: A shard failover of a healthy shard ended. - MOVING <seq_number> <time> <endpoint>: A specific endpoint is going to move to another node within <time> seconds * rebase * format after rebase * Apply suggestions from code review Co-authored-by: Tihomir Krasimirov Mateev <[email protected]> * javadoc updated * Update src/main/java/io/lettuce/core/internal/NetUtils.java Co-authored-by: Tihomir Krasimirov Mateev <[email protected]> --------- Co-authored-by: Tihomir Krasimirov Mateev <[email protected]>

…e event notifications (CAE-1285) (#3396) * support MOVING with none none indicates that the MOVING message doesn’t need to contain an endpoint. In such a case, the client is expected to schedule a graceful reconnect to its currently configured endpoint after half of the grace period that was communicated by the server is over. * formatting * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> * Fix NPE * Add test to cover null rebindAddress null for rebind adress can be returned as part of MOVING notification if client is connected using 'moving-endpoint-type=none' * Add java docs to RebindAwareAddressSupplier --------- Co-authored-by: Copilot <[email protected]>

…1303) (#3415) * set default relaxed timeout to 10s * Enable maintenance events support by default * Enable maintenance events support by default * fix tests - ensure MaintenanceAwareExpiryWriter is registered for events when wathcdog is created - command timeout was not applied * fix tests - sporadic failure because of timeout of 50ms RedisHandshakeUnitTests.handshakeDelayedCredentialProvider:153 » ConditionTimeout - new command introduced during handshake, increase the timeout to 100ms

- reset() - removed with issue#3328 - remove deprecated code from issue#907 (#3395)

- no longer needed.

ggivo

LGTM

tishun added this to the 7.0.0.RELEASE milestone Jul 10, 2025

tishun requested review from dmaier-redislabs, kiryazovi-redis and ggivo July 10, 2025 15:48

tishun added the type: feature A new feature label Jul 10, 2025

redis deleted a comment from dengliming Jul 11, 2025

tishun assigned ggivo Jul 11, 2025

ggivo reviewed Jul 21, 2025

View reviewed changes

kiryazovi-redis force-pushed the feature/maintenance-events branch from 7f3c59b to a133028 Compare July 28, 2025 10:04

tishun and others added 8 commits August 29, 2025 10:08

[Hitless upgrades] Add unit tests for the newly introduced classes #3355

cd3bc51

(#3356) * Unit tests for the maintanence aware classes * Did not format properly * Proper license

Cae 633 add functional tests notifications (#3357) - excluding JsonEx…

9980399

…ample.java

resolve errors after rebase on main

d3e55e4

- reset() - removed with issue#3328 - remove deprecated code from issue#907 (#3395)

ggivo force-pushed the feature/maintenance-events branch from 3475792 to d3e55e4 Compare August 29, 2025 07:27

tishun marked this pull request as ready for review August 29, 2025 08:29

Remove LettuceMaintenanceEventsDemo.java

844e343

- no longer needed.

ggivo approved these changes Sep 1, 2025

View reviewed changes

tishun merged commit d7cbe9f into main Sep 1, 2025
15 of 16 checks passed

tishun deleted the feature/maintenance-events branch September 1, 2025 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Hitless Upgrades] React to maintenance events #3345 #3354

[Hitless Upgrades] React to maintenance events #3345 #3354

Uh oh!

tishun commented Jul 10, 2025 •

edited by ggivo

Loading

Uh oh!

ggivo left a comment

Uh oh!

ggivo left a comment

Uh oh!

Uh oh!

Uh oh!

[Hitless Upgrades] React to maintenance events #3345 #3354

[Hitless Upgrades] React to maintenance events #3345 #3354

Uh oh!

Conversation

tishun commented Jul 10, 2025 • edited by ggivo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Maintenance Events Support

Summary

Changes

New Components

Default Configuration

Client Integration

Testing

Backward Compatibility

Related Issues

Uh oh!

ggivo left a comment

Choose a reason for hiding this comment

Uh oh!

ggivo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tishun commented Jul 10, 2025 •

edited by ggivo

Loading