Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions premerge/routine-maintenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Routine Maintenance

The cluster requires routine maintenance to ensure it stays functioning.
Ideally, this maintenance is proactive, being performed before any issues arise
from neglect. This document aims to describe the routine maintenance needed
and how to perform it.

## Version Updates

The only routine maintenance that we currently do on the premerge
infrastructure are version updates. The infrastructure utilizes a lot of
different software, and all of it needs to be kept reasonably up to date
to ensure things keep working smoothly and that we are not vulernable to
security issues.

### Getting Notified of Version Updates

There are several pieces of software that we want to upgrade relatively
quickly (like the Github Actions Runner binary). Because of that, knowing
when a new version is released is important. The easiest way to do this is
to subscribe to new release notifications on Github. If you go to a
repository, you can click on the watch button, select custom, and then
select releases. Any new releases for that repository will show up in
your Github notifications.

Releases from the following repositories generally require an update on the
premerge infrastructure side:

1. https://github.com/actions/actions-runner-controller
2. https://github.com/actions/runner
3. https://github.com/llvm/llvm-project

### Github Actions Runner Binary

The runner binary is what runs inside the containers on the cluster to
execute jobs and report status results back to Github. The runner binary
has a relatively short time horizon (about six months) before it becomes
unsupported by Github and it will no longer work.

When a new runner binary is released, there are three places that need to
be updated in a PR against the LLVM monorepo:

1. The Linux CI container - The `Dockerfile` at
`.github/workflows/containers/github-action-ci/Dockerfile` has an environment
variable towards the bottom of the file called `GITHUB_RUNNER_VERSION` that
needs to be updated to the new version.
2. The Windows CI container - The `Dockerfile` at
`.github/workflows/containers/github-action-ci-windows/Dockerfile` has an
argument called `RUNNER_VERSION` near the bottom of the file that needs to
be updated to the new version.
3. The libc++ CI container - The `docker-compose` manifest at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what you mean by "The docker-compose manifest". Are you talking about the GITHUB_RUNNER_VERSION (in the docker-compose.yml file)? Or something else?

`libcxx/utils/ci/docker-compose.yml` needs to be updated to pull in the latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the entire #3 set of instructions is a bit confusing. The libc++ instructions you reference mention how to grab new container image that already contains a particular runner version, but they don't explicitly state how to create a container with the correct runner version? Which is what #1 and #2 seem to be about?

runner images using the [libc++ instructions](https://libcxx.llvm.org/Contributing.html#updating-the-ci-testing-container-images)

### Other Container Image Software
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which repo alert would cause trigger this set of instructions? (It's not clear that any of the repos mentioned above correspond to this).


The container images also contain many other pieces of software critical
for building LLVM, like CMake, ninja, and the toolchain itself. Keeping
most of these up to date is ideal.

A large amount of the software comes from the distribution and thus does not
Copy link
Contributor

@cmtice cmtice Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the distribution" -- What distribution? And how do you get notified that it's time to update it?

need to be explicitly updated. We prefer to install software from the
distribution when possible. However, this does mean that distribution
updates are quite important. To update the distribution for the Linux container,
perform the following steps:

1. Modify `.github/workflows/containers/github-action-ci/Dockerfile` locally
to use the latest `ubuntu:xx.04` image.
2. Ensure that `monolithic-linux.sh` with all options enabled runs successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there instructions for how to test monolithic-linux.sh with all options enabled?

Push any needed changes.
3. Push the updated `Dockerfile` and update the workflow in
`.github/workflows/build-ci-container.yml` to use the correct image name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #3: Can you be more explicit about what to update in build-ci-container.yml? Do you mean the 'container_name" in the bulid-ci-container job? or something else? What's the right value to update it to?

4. Update the runner configuration in zorg (`premerge/linux_runners_values.yaml`)
to point to the new image.

Updating explicitly versioned software (just LLVM in the Linux container, but
most software in the Windows container) just requires bumping the version number
and pushing the new image to the monorepo. In the Linux container, the LLVM
version can be bumped by changing the `LLVM_VERSION` environment variable. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "containers" you're talking about here refer to the Dockerfiles mentioned above? Under .github/workflows/containers/github-actions-ci{-windows}?

the Windows container, versions are controlled by the `--version` flag passed
to `choco install` commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Windows Dockerfile doesn't contain any 'choco install' commands for installing LLVM that I can see?


### Actions Runner Controller

The actions runner controller orchestrates all of the jobs on the cluster.
Given its key role, upgrading it needs to be done carefully using the
[described steps](cluster-management.md#upgradingresetting-github-arc) to
avoid any downtime.

It is advised to do this during a portion of the day with light traffic as
any upgrade will involve having an entire cluster down at times, which reduces
capacity in half.

### Windows Edition

Whenever a new Windows Server datacenter edition (eg 2025) is supported by GKE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you get notifications about this?

all of the windows infrastructure needs to be updated. This involves several
steps that will take significant time:

1. Modify `.github/workflows/containers/github-action-ci-windows/Dockerfile`
locally to get it building on the new Windows Server version. This requires
having a host at that version as Windows Server containers can only run on a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"having a host at that version"? A host windows machine? Where/how do you get a local windows machine for this?

host with the same edition. Changing the `FROM` line at the top of the
`Dockerfile` has previously been enough.
2. When that is ready, push your changes in a PR along with changes to
`.github/workflows/build-ci-container-windows.yml` so that it uses the new
Windows Server edition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this involved changing all the "windows-20xx" in that file (both the runs-on values and the container_name) to use the new "windows-20yy" version?

3. Test locally that the pushed container image can run the
`monolithic-windows.sh` script with all projects enabled successfully. Make
and submit any changes that are needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly same question as above: How do you test monolithic-windows.sh with all projects enabled locally?

4. Duplicate all of the versioned Windows resources in the terraform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for example, if we're upgrading from windows-2022 to windows-2025, I would KEEP all the windows-2022 resources defined in the all the *.tf files (under premerge), but would now also have windows-2025 versions of them?

Also I assume I would need to copy windows_2022_runner_values.yaml to windows_2025_runner_values.yaml, and update the values inside to use 2025 instead of 2022?

Also, both architecture.md and cluster-management.md explicitly mention 'windows_2022' or 'windows-2022', so if the version gets updated then the docs need to be updated too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should probably mention that this is in the llvm-zorg repo, since most of the other steps are in the llvm-project repo.

configuration and apply it.
5. Switch the workflow in `.github/workflows/premerge.yaml` to the new
runner set.
6. Sunset the existing sunset in 1-2 weeks to give time for people utilizing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Sunset the existing sunset" seems like a typo. Also, you should be more explicit about what 'sunsetting' means here. I assume you mean: Delete all the old resources that you KEPT in step 4 when you duplicated everything. Are there other steps beyond that?

stacked PRs to rebase.