-
Notifications
You must be signed in to change notification settings - Fork 775
Description
(ECK version 2.15.0)
Background:
I was recently upgrading a rather large Elasticsearch cluster from 8.16.2 to 8.17.1, but ran into an issue where one of the dedicated Master pods was recreated part way through the upgrade process.
Issue
The problem appears that ECK when it gets an upgrade of the Elasticsearch version, it will automatically update all statefulset versions right away, and then perform the rolling restart. The problem is that if a pod gets killed/recreated part way through the process, there is no longer an "order of operations" applied and things can be upgraded in the wrong order.
Reproduction:
- Create an Elasticsearch cluster with dedicated masters
- Upgrade the Elasticsearch cluster
- Recreate one of the master pods while the upgrade is still working on non-master nodes
- Once the new master node gets created, create an index
- The index will get assigned the new Elasticsearch index version, and won't be allocatable on the lower version non-master nodes
- Observe that the upgrade managed via ECK deadlocks on a yellow state because of allocation issues from step 5.
Expectation:
ECK should only upgrade the statefulset version when its ready to perform the rolling restart of that statefulset, and not so far before in the upgrade process.
Workaround:
To workaround the deadlock, I had to manually (and carefully) delete/recreate each of the remaining non-master pods to allow them to pick up the new version.