Skip to content

Cluster is flacky, sometimes nodes are getting in a corrupted state #1205

@kerkmann

Description

@kerkmann

Describe the bug

I've setup a cluster with 3 nodes, what I did was stopping, waiting and starting nodes. There was always at least one node online. Restarting some nodes is corrupting the database/state of one node. It's now not able to start the entire database is corrupted on that node.

To Reproduce

  1. Setup a cluster with 3 nodes.
  2. Stop Node 2
  3. Wait a few seconds
  4. Stop node 3
  5. Wait a few seconds
  6. Node 1 was Running. Start Node 3
  7. Stop Node 3
  8. Wait a few seconds
  9. Start Node 2
  10. Wait a few seconds
  11. Start Node 3
  12. Then the Node 3 was corrupted

Expected behavior

A cluster should be more bullet proof in regards to clean restarts.

Additional context

After restarting the leader node, everything worked as intended.

rauthy  |
rauthy  |
rauthy  |                                                   88
rauthy  |                                             ,d    88
rauthy  |                                             88    88
rauthy  |         8b,dPPYba, ,adPPYYba, 88       88 MM88MMM 88,dPPYba,  8b       d8
rauthy  |         88P'   "Y8 ""     `Y8 88       88   88    88P'    "8a `8b     d8'
rauthy  |         88         ,adPPPPP88 88       88   88    88       88  `8b   d8'
rauthy  |         88         88,    ,88 "8a,   ,a88   88,   88       88   `8b,d8'
rauthy  |         88         `"8bbdP"Y8  `"YbbdP'Y8   "Y888 88       88     Y88'
rauthy  |                                                                   d8'
rauthy  |                                                                  d8'
rauthy  |
rauthy  | 2025-09-22T10:47:00.068382Z  INFO rauthy: Starting Rauthy v0.32.3-20250915 (2025-09-15 08:10:42 UTC)
rauthy  | 2025-09-22T10:47:00.068415Z  INFO rauthy: Log Level set to 'INFO'
rauthy  | 2025-09-22T10:47:00.068422Z  INFO rauthy_data::rauthy_config: Public URL: auth.ihate.tech
rauthy  | 2025-09-22T10:47:00.068427Z  INFO rauthy_data::rauthy_config: Issuer: https://auth.ihate.tech/auth/v1
rauthy  | 2025-09-22T10:47:00.068574Z  WARN hiqlite_wal::log_store: LockFile data/logs exists already - this is not a clean start!
rauthy  | 2025-09-22T10:47:00.068989Z  INFO check_integrity:check_repair_data_integrity: hiqlite_wal::wal: Starting detailed WalFile integrity check for data/logs/0000000000000001.wal
rauthy  | 2025-09-22T10:47:00.069171Z  INFO check_integrity:check_repair_data_integrity: hiqlite_wal::wal: No unexpected additional data found in data/logs/0000000000000001.wal
rauthy  | 2025-09-22T10:47:00.069340Z  INFO hiqlite::store::state_machine::sqlite::state_machine: db_exists in state_machine::new(): true
rauthy  | 2025-09-22T10:47:00.069501Z  WARN hiqlite::store::state_machine::sqlite::state_machine: Lock file already exists: data/state_machine/lock
rauthy  | Node did not shut down gracefully - auto-rebuilding State Machine
rauthy  | 2025-09-22T10:47:00.150448Z  INFO hiqlite::store::state_machine::sqlite::writer: Starting snapshot restore from data/state_machine/snapshots/019970fb-9b48-7952-ac10-bc05fcb07e74
rauthy  | Database restore remaining: 50
rauthy  | Database restore remaining: 0
rauthy  | 2025-09-22T10:47:00.154905Z  INFO hiqlite::store::state_machine::sqlite::writer: Snapshot restore finished after 4 ms
rauthy  | 2025-09-22T10:47:00.155753Z  INFO hiqlite::store::state_machine::sqlite::state_machine: applied_state: StateMachineData { last_applied_log_id: Some(LogId { leader_id: LeaderId { term: 1, node_id: 1 }, index: 11 }), last_membership: StoredMembership { log_id: Some(LogId { leader_id: LeaderId { term: 0, node_id: 0 }, index: 0 }), membership: Membership { configs: [{1}], nodes: {1: Node { id: 1, addr_raft: "192.168.1.12:8100", addr_api: "192.168.1.12:8200" }} } }, last_snapshot_id: Some("019970fb-9b48-7952-ac10-bc05fcb07e74") }
rauthy  | 2025-09-22T10:47:00.155810Z  INFO openraft::storage::helper: get_initial_state vote=T4-N3:committed last_purged_log_id=None last_applied=T1-N1-11 committed=None last_log_id=T4-N3-241
rauthy  | 2025-09-22T10:47:00.156070Z  INFO hiqlite::store::state_machine::sqlite::state_machine: applied_state: StateMachineData { last_applied_log_id: Some(LogId { leader_id: LeaderId { term: 1, node_id: 1 }, index: 11 }), last_membership: StoredMembership { log_id: Some(LogId { leader_id: LeaderId { term: 0, node_id: 0 }, index: 0 }), membership: Membership { configs: [{1}], nodes: {1: Node { id: 1, addr_raft: "192.168.1.12:8100", addr_api: "192.168.1.12:8200" }} } }, last_snapshot_id: Some("019970fb-9b48-7952-ac10-bc05fcb07e74") }
rauthy  | 2025-09-22T10:47:00.156104Z  INFO openraft::storage::helper: load membership from log: [12..242)
rauthy  | 2025-09-22T10:47:00.156576Z  INFO openraft::storage::helper: load key log ids from (None,T4-N3-241]
rauthy  | 2025-09-22T10:47:00.156681Z ERROR hiqlite_wal::reader: Error reading logs: Generic("`id_from` is below threshold - id_from: 0 / self.id_from: 12")
rauthy  | 2025-09-22T10:47:00.156721Z  INFO openraft::core::tick: Timer shutdown signal sent: Ok(())
rauthy  |
rauthy  |
rauthy  | thread 'main' panicked at /usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/hiqlite-0.10.1/src/store/mod.rs:92:6:
rauthy  | Raft create failed: StorageError(Defensive { source: DefensiveError { subject: LogIndex(0), violation: LogIndexNotFound { want: 0, got: None }, backtrace: None } })
rauthy  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
rauthy  | 2025-09-22T10:47:00.157246Z  INFO openraft::core::tick: TickLoop received cancel signal, quit

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions