Commit 6075072
authored
Fix auth transition on edge-cases (#321)
# Summary
**Add Not-Ready Handling for Ongoing Auth Transitions**:
This patch refines our readiness logic to correctly reflect the state of
authentication transitions. Previously, we treated
LastGoalVersionAchieved == GoalVersion as a signal that the cluster was
"Running", but this assumption breaks down when auth transitions are
still in progress.
This happened because we returned "ready" during a wait step
(WaitAuthCanUpdate) — and [we generally return ready for all wait
steps](https://github.com/mongodb/mongodb-kubernetes/blob/f0050b8942545701e8cb9e42d54d14f0cb58ee6a/mongodb-community-operator/cmd/readiness/main.go#L139),
regardless of whether auth is fully transitioned. Example status:
```
{
"step": "WaitAuthUpdate",
"stepDoc": "Wait to update Auth",
"isWaitStep": true,
"started": "2025-08-07T14:59:40.213178437Z",
"attempts": 512,
"latestAttempt": "2025-08-07T15:09:20.966699961Z",
"completed": null,
"result": "wait"
}
```
**Why implemented in the operator and not readinessProbe**:
I didn't fix the readinessProbe but rather the operator
* if the readinessProbe blocks new nodes are not coming up
* we want new nodes coming up
* but we also want to block new configurations being applied, which the
automation_status check in the
operator does
**The core idea:**
* Configuration applied ≠ transition fully complete.
**What happened in our tests**:
* we update auth via CR x509 -> scram
* `node-0` completed its auth transition (now uses scram, instead of
x509)
* `Config server` hasn't finished its auth transition yet
* We hit a race condition where clusters were marked as "Running" too
early and thus continued the rolling restart of `nod e-0`
* `node-0` restarted with the old X509 config (see below comment from
the agent code)
* The X509 process couldn’t access the SCRAM automation user
* Leads to Error: "process...doesn't have the automation user"
- in the mms-automation there is also a comment; that indicates thats
they are handling the edge-case if an auth transition was not
successful, they start the process with old auth to "finish" it. But
this is exactly what causes our race condition
```
// If a process went down unexpectedly in the middle of an auth transition,
// we want to restart it with the old auth args.
// Otherwise, it could be upgraded to the new auth state too soon,
// and not be able to communicate with other shard members.
```
tl;dr: first `node-0` moved to new auth, `config` not yet, `node-0`
restarted and during the restart `config` transitioned to the new auth
while `node-0` is again running old auth
## Proof of Work
- auth change tests are passing multiple times in a row:
[Link](http://spruce.mongodb.com/version/6894b98218a2e90007437e99/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC)
- the most flaky auth tests +
[Link2](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_static_mdb_kind_ubi_cloudqa_e2e_sharded_cluster_x509_to_scram_transition_patch_b29fb4ace63eec7102f8f034fd6c553b5d75c1a1_6894c0785c119f0007a58f3c_25_08_07_15_04_26/logs?execution=0)
- from the patch
## Checklist
- [ ] Have you linked a jira ticket and/or is the ticket in the title?
- [x] Have you checked whether your jira ticket required DOCSP changes?
- [x] Have you added changelog file?
- use `skip-changelog` label if not needed
- refer to [Changelog files and Release
Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes)
section in CONTRIBUTING.md for more details1 parent 6f85608 commit 6075072
File tree
3 files changed
+177
-4
lines changed- changelog
- controllers/om
3 files changed
+177
-4
lines changedLines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
| 89 | + | |
88 | 90 | | |
89 | 91 | | |
90 | 92 | | |
91 | 93 | | |
92 | 94 | | |
93 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
94 | 107 | | |
95 | 108 | | |
96 | 109 | | |
| |||
103 | 116 | | |
104 | 117 | | |
105 | 118 | | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
106 | 131 | | |
107 | 132 | | |
108 | 133 | | |
| |||
113 | 138 | | |
114 | 139 | | |
115 | 140 | | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
116 | 153 | | |
117 | 154 | | |
118 | 155 | | |
119 | 156 | | |
120 | 157 | | |
121 | | - | |
| 158 | + | |
122 | 159 | | |
123 | 160 | | |
124 | 161 | | |
125 | 162 | | |
126 | | - | |
| 163 | + | |
127 | 164 | | |
128 | 165 | | |
129 | 166 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
| 78 | + | |
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| |||
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
0 commit comments