@@ -98,3 +98,114 @@ Setting the cluster up for the first time is more involved as there are certain
9898resources where terraform is unable to handle explicit dependencies. This means
9999that we have to set up the GKE cluster before we setup any of the Kubernetes
100100resources as otherwise the Terraform Kubernetes provider will error out.
101+
102+ ## Upgrading/Resetting Github ARC
103+
104+ Updating and resetting the Github Actions Runner Controller (ARC) within the
105+ cluster involves largely the same process. Some special considerations need
106+ to be made with how ARC interacts with kubernetes. The process involves
107+ uninstalling the runner scale set charts, deleting the namespaces to ensure
108+ everything is properly cleaned up, optionally bumping the version number if
109+ this is a version upgrade, and then reinstalling the charts to get the cluster
110+ back to accepting production jobs.
111+
112+ It is important to not just blindly delete controller pods or namespaces as
113+ this (at least empirically) can interrupt the state and custom resources that
114+ ARC manages, then requiring a costly full uninstallation and reinstallation of
115+ at least a runner scale set.
116+
117+ When upgrading/resetting the cluster, jobs will not be lost, but instead remain
118+ queued on the Github side. Running build jobs will complete after the helm charts
119+ are uninstalled unless they are forcibly killed. Note that best practice dictates
120+ the helm charts should just be uninstalled rather than also setting ` maxRunners `
121+ to zero beforehand as that can cause ARC to accept some jobs but not actually
122+ execute them which could prevent failover in HA cluster configurations.
123+
124+ ### Uninstalling the Helm Charts
125+
126+ To begin, start by uninstalling the helm charts by using resource targetting
127+ on a kubernetes destroy command:
128+
129+ ``` bash
130+ terraform destroy -target helm_release.github_actions_runner_set_linux
131+ terraform destroy -target helm_release.github_actions_runner_set_windows
132+ ```
133+
134+ These should complete, but if they do not, we are still able to get things
135+ cleaned up. If everything went smoothly, the commands should complete and leave
136+ runner pods that are still in the process of executing jobs. You will need to
137+ wait for them to complete before moving on. If they are stuck, you will need to
138+ manually delete them with ` kubectl delete ` . Follow up the previous terraform
139+ commands by deleting the kubernetes namespaces all the resources live in:
140+
141+ ``` bash
142+ terraform destroy -target kubernetes_namespace.llvm_premerge_linux_runners
143+ terraform destroy -target kubernetes_namespace.llvm_premerge_windows_runners
144+ ```
145+
146+ If things go smoothly, these should complete quickly. If they do not complete,
147+ there is most likely dangling resources in the namespaces that need to have their
148+ finalizers removed before they can be updated. You can confirm this by running
149+ ` kubectl get namespaces ` . If the namespace is listed as ` Terminating ` , you most
150+ likely need to manually intervene. To find a list of dangling resources that
151+ did not get cleaned up properly, you can run the following command, making sure
152+ to fill in ` <namespace> ` with the actual namespace of interest:
153+
154+ ``` bash
155+ kubectl api-resources --verbs=list --namespaced -o name \
156+ | xargs -n 1 kubectl get --show-kind --ignore-not-found -n < namespace>
157+ ```
158+
159+ This will return the stuck resources. Then you can copy each resource, and edit
160+ the YAML configuration of the kubernetes object to remove the finalizers:
161+
162+ ``` bash
163+ kubectl edit < resource name> -n < namespace name>
164+ ```
165+
166+ Just deleting the finalizers key along with any entries should be sufficient.
167+ After rerunning the command to find dangling resources, you should see it get
168+ removed. After doing this for all dangling resources, the namespace should
169+ then delete automatically. This can be confirmed by running
170+ ` kubectl get namespaces ` .
171+
172+ If you are performing these steps as part of an incident response, you can
173+ skip to the section [ Bringing the Cluster Back Up] ( #bringing-the-cluster-back-up ) .
174+ If you are bumping the version you still need to uninstall the controller and
175+ bump the version number beforehand.
176+
177+ ### Uninstalling the Controller Helm Chart
178+
179+ Next, the controller helm chart needs to be uninstalled. If you are performing
180+ these steps as part of dealing with an incident, you most likely do not need to
181+ perform this step. Usually it is sufficient to destroy and recreate the runner
182+ scale sets to resolve incidents. Uninstalling the controller is necessary for
183+ version upgrades however.
184+
185+ Start by destroying the helm chart:
186+ ``` bash
187+ terraform destroy -target helm_release.github_actions_runner_controller
188+ ```
189+
190+ Then delete the namespace to ensure there are no dangling resources
191+ ``` bash
192+ terraform destroy -target kubernetes_namespace.llvm_premerge_controller
193+ ```
194+
195+ ### Bumping the Version Number
196+
197+ This is not necessary only for bumping the version of ARC. This involves simply
198+ updating the version field for the ` helm_release ` objects in ` main.tf ` . Make sure
199+ to commit the changes and push them to ` llvm-zorg ` to ensure others working on
200+ the terraform configuration have an up to date state when they pull the repository.
201+
202+ ### Bringing the Cluster Back Up
203+
204+ To get the cluster back up and accepting production jobs again, simply run
205+ ` terraform apply ` . It will recreate all the resource previously destroyed and
206+ ensure they are in a state consistent with the terraform IaC definitions.
207+
208+ ### External Resources
209+
210+ [ Strategies for Upgrading ARC] ( https://www.kenmuse.com/blog/strategies-for-upgrading-arc/ )
211+ outlines how ARC should be upgraded and why.
0 commit comments