-
Notifications
You must be signed in to change notification settings - Fork 116
[CI] Write Routine Maintenance Document #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# Routine Maintenance | ||
|
||
The cluster requires routine maintenance to ensure it stays functioning. | ||
Ideally, this maintenance is proactive, being performed before any issues arise | ||
from neglect. This document aims to describe the routine maintenance needed | ||
and how to perform it. | ||
|
||
## Version Updates | ||
|
||
The only routine maintenance that we currently do on the premerge | ||
infrastructure are version updates. The infrastructure utilizes a lot of | ||
different software, and all of it needs to be kept reasonably up to date | ||
to ensure things keep working smoothly and that we are not vulernable to | ||
security issues. | ||
|
||
### Getting Notified of Version Updates | ||
|
||
There are several pieces of software that we want to upgrade relatively | ||
quickly (like the Github Actions Runner binary). Because of that, knowing | ||
when a new version is released is important. The easiest way to do this is | ||
to subscribe to new release notifications on Github. If you go to a | ||
repository, you can click on the watch button, select custom, and then | ||
select releases. Any new releases for that repository will show up in | ||
your Github notifications. | ||
|
||
Releases from the following repositories generally require an update on the | ||
premerge infrastructure side: | ||
|
||
1. https://github.com/actions/actions-runner-controller | ||
2. https://github.com/actions/runner | ||
3. https://github.com/llvm/llvm-project | ||
|
||
### Github Actions Runner Binary | ||
|
||
The runner binary is what runs inside the containers on the cluster to | ||
execute jobs and report status results back to Github. The runner binary | ||
has a relatively short time horizon (about six months) before it becomes | ||
unsupported by Github and it will no longer work. | ||
|
||
When a new runner binary is released, there are three places that need to | ||
be updated in a PR against the LLVM monorepo: | ||
|
||
1. The Linux CI container - The `Dockerfile` at | ||
`.github/workflows/containers/github-action-ci/Dockerfile` has an environment | ||
variable towards the bottom of the file called `GITHUB_RUNNER_VERSION` that | ||
needs to be updated to the new version. | ||
2. The Windows CI container - The `Dockerfile` at | ||
`.github/workflows/containers/github-action-ci-windows/Dockerfile` has an | ||
argument called `RUNNER_VERSION` near the bottom of the file that needs to | ||
be updated to the new version. | ||
3. The libc++ CI container - The `docker-compose` manifest at | ||
`libcxx/utils/ci/docker-compose.yml` needs to be updated to pull in the latest | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the entire #3 set of instructions is a bit confusing. The libc++ instructions you reference mention how to grab new container image that already contains a particular runner version, but they don't explicitly state how to create a container with the correct runner version? Which is what #1 and #2 seem to be about? |
||
runner images using the [libc++ instructions](https://libcxx.llvm.org/Contributing.html#updating-the-ci-testing-container-images) | ||
|
||
### Other Container Image Software | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which repo alert would cause trigger this set of instructions? (It's not clear that any of the repos mentioned above correspond to this). |
||
|
||
The container images also contain many other pieces of software critical | ||
for building LLVM, like CMake, ninja, and the toolchain itself. Keeping | ||
most of these up to date is ideal. | ||
|
||
A large amount of the software comes from the distribution and thus does not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the distribution" -- What distribution? And how do you get notified that it's time to update it? |
||
need to be explicitly updated. We prefer to install software from the | ||
distribution when possible. However, this does mean that distribution | ||
updates are quite important. To update the distribution for the Linux container, | ||
perform the following steps: | ||
|
||
1. Modify `.github/workflows/containers/github-action-ci/Dockerfile` locally | ||
to use the latest `ubuntu:xx.04` image. | ||
2. Ensure that `monolithic-linux.sh` with all options enabled runs successfully. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there instructions for how to test monolithic-linux.sh with all options enabled? |
||
Push any needed changes. | ||
3. Push the updated `Dockerfile` and update the workflow in | ||
`.github/workflows/build-ci-container.yml` to use the correct image name. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For #3: Can you be more explicit about what to update in build-ci-container.yml? Do you mean the 'container_name" in the bulid-ci-container job? or something else? What's the right value to update it to? |
||
4. Update the runner configuration in zorg (`premerge/linux_runners_values.yaml`) | ||
to point to the new image. | ||
|
||
Updating explicitly versioned software (just LLVM in the Linux container, but | ||
most software in the Windows container) just requires bumping the version number | ||
and pushing the new image to the monorepo. In the Linux container, the LLVM | ||
version can be bumped by changing the `LLVM_VERSION` environment variable. In | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "containers" you're talking about here refer to the Dockerfiles mentioned above? Under .github/workflows/containers/github-actions-ci{-windows}? |
||
the Windows container, versions are controlled by the `--version` flag passed | ||
to `choco install` commands. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Windows Dockerfile doesn't contain any 'choco install' commands for installing LLVM that I can see? |
||
|
||
### Actions Runner Controller | ||
|
||
The actions runner controller orchestrates all of the jobs on the cluster. | ||
Given its key role, upgrading it needs to be done carefully using the | ||
[described steps](cluster-management.md#upgradingresetting-github-arc) to | ||
avoid any downtime. | ||
|
||
It is advised to do this during a portion of the day with light traffic as | ||
any upgrade will involve having an entire cluster down at times, which reduces | ||
capacity in half. | ||
|
||
### Windows Edition | ||
|
||
Whenever a new Windows Server datacenter edition (eg 2025) is supported by GKE, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you get notifications about this? |
||
all of the windows infrastructure needs to be updated. This involves several | ||
steps that will take significant time: | ||
|
||
1. Modify `.github/workflows/containers/github-action-ci-windows/Dockerfile` | ||
locally to get it building on the new Windows Server version. This requires | ||
having a host at that version as Windows Server containers can only run on a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "having a host at that version"? A host windows machine? Where/how do you get a local windows machine for this? |
||
host with the same edition. Changing the `FROM` line at the top of the | ||
`Dockerfile` has previously been enough. | ||
2. When that is ready, push your changes in a PR along with changes to | ||
`.github/workflows/build-ci-container-windows.yml` so that it uses the new | ||
Windows Server edition. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, this involved changing all the "windows-20xx" in that file (both the runs-on values and the container_name) to use the new "windows-20yy" version? |
||
3. Test locally that the pushed container image can run the | ||
`monolithic-windows.sh` script with all projects enabled successfully. Make | ||
and submit any changes that are needed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Roughly same question as above: How do you test monolithic-windows.sh with all projects enabled locally? |
||
4. Duplicate all of the versioned Windows resources in the terraform | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, for example, if we're upgrading from windows-2022 to windows-2025, I would KEEP all the windows-2022 resources defined in the all the *.tf files (under premerge), but would now also have windows-2025 versions of them? Also I assume I would need to copy windows_2022_runner_values.yaml to windows_2025_runner_values.yaml, and update the values inside to use 2025 instead of 2022? Also, both architecture.md and cluster-management.md explicitly mention 'windows_2022' or 'windows-2022', so if the version gets updated then the docs need to be updated too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, should probably mention that this is in the llvm-zorg repo, since most of the other steps are in the llvm-project repo. |
||
configuration and apply it. | ||
5. Switch the workflow in `.github/workflows/premerge.yaml` to the new | ||
runner set. | ||
6. Sunset the existing sunset in 1-2 weeks to give time for people utilizing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Sunset the existing sunset" seems like a typo. Also, you should be more explicit about what 'sunsetting' means here. I assume you mean: Delete all the old resources that you KEPT in step 4 when you duplicated everything. Are there other steps beyond that? |
||
stacked PRs to rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what you mean by "The
docker-compose
manifest". Are you talking about the GITHUB_RUNNER_VERSION (in the docker-compose.yml file)? Or something else?