Skip to content

Commit 3e365fe

Browse files
feedback
1 parent 41c3fe9 commit 3e365fe

File tree

1 file changed

+111
-56
lines changed

1 file changed

+111
-56
lines changed

premerge/post-submit-testing.md

Lines changed: 111 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -10,69 +10,77 @@ how we plan on implementing this to ensure we get fast feedback scalably.
1010

1111
## Background/Motivation
1212

13-
It is important that we test the premerge configuration postcommit as well.
14-
This enables easily checking the state of `main` at any given time and also
15-
easily pinpointing where exactly `main` has broken which enables figuring
16-
out which commit to revert/fix forward to get everything back to green.
17-
Having information on the state of `main` is also important for certain kinds
18-
of automation, like the planned premerge testing advisor that will let
19-
contributors know if tests failing in their PR are already failing at `main`
20-
and that it should be safe to merge despite the given failures.
21-
22-
Originally, we were looking at running postcommit testing through Github
23-
Actions as well. This is primarily due to it being easy (a single line
24-
change in the Github Actions workflow config) and also easy integration
25-
with the Github API for implementation of the premerge testing advisor.
26-
More detailed motivation for the doing postcommit testing directly through
27-
Github is available in the [discourse RFC thread](https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124)
28-
where we proposed doing this. We eventually decided against implementation in
29-
this way for a couple of reasons:
30-
31-
1. Nonstandard - The standard postcommit testing infrastructure for LLVM is
32-
through Buildbot. Doing postcommit testing for the premerge configuration
33-
through Github would represent a significant deparature from this. This means
34-
we are leaving behind some common infrastructure and are also forcing a new
35-
unfamiliar postcommit interface on LLVM contributors.
36-
2. Notifications - This is the biggest issue. Github currently gives very
37-
little control over the notifications that are sent out when the build
38-
fails or gets cancelled. This is specifically a problem with Github sending
39-
out notifications for build failures even if the previous build has failed.
40-
This can easily create a lot of warning fatigue which is something we are
41-
putting a lot of effort in to avoid so that the premerge system is
42-
percieved as reliable, people trust its results, and most importantly,
43-
people pay attention to failures when they do occur because they are
44-
caused by the patch getting the notification and are actionable.
45-
3. Customization - Buildbot can be customized around issues like notifications
46-
whereas Github cannot. Github is not particularly responsive on feature
47-
requests and their notification story has been poor for a while, so their
48-
lack of customization is a strategic risk.
13+
LLVM has two types of testing upstream: premerge and postcommit. The premerge
14+
testing is performed using Github Actions every time a PR is updated before it
15+
is merged. Premerge testing is performed using this infrastructure (specifically
16+
the `./premerge` folder in llvm-zorg). Landing a PR consists of squashing the
17+
changes into a single commit and adding that commit to the `main` branch in the
18+
monorepo. We care specifically about the state of the `main` branch because it
19+
is what the community considers the canonical tree. Currently, commits can also
20+
be added to the `main` branch by directly pushing to the main branch. After a
21+
new commit lands in the `main` branch, postcommit testing is performed. Most
22+
postcommit testing is performed through the Buildbot infrastructure. The main
23+
Buildbot instance for LLVM has a web instance hosted at
24+
[lab.llvm.org](https:/lab.llvm.org). When a new commit lands in `main` the
25+
Buildbot instance (sometimes referred to as the Buildbot master) will trigger
26+
many different build configurations. These configurations are defined in the
27+
llvm-zorg repository under the `buildbot/` folder. These configurations are run
28+
on Buildbot workers that are hosted by the community.
29+
30+
It is important that we test the premerge configuration postcommit as well. We
31+
need to be able to determine the state of `main` (in terms of whether the build
32+
passed/failed and what tests failed, if any) for certain types of automation.
33+
Postcommit testing enables easily checking the state of `main` at any given point
34+
in time. This data is crucial for figuring out which commit to revert/fix
35+
forward to get everything back to green. Having information on the state of
36+
`main` is also important for certain kinds of automation, like the planned
37+
premerge testing advisor that will let contributors know if tests failing in
38+
their PR are already failing at `main` and that it should be safe to merge
39+
despite the given failures.
4940

5041
## Design
5142

52-
The overall design involves using an annotated builder that will be deployed
53-
on both the central and west clusters for a HA configuration. The annotated
54-
builder will consist of a script that runs builds inside kubernetes pods
55-
to enable autoscaling.
56-
57-
In terms of the full flow, a commit pushed to the LLVM monorepo will get
58-
detected by the buildbot master. The Buildbot master will invoke Buildbot
59-
workers running on our clusters. These Buildbot workers will use annotated
60-
builders to launch a build wrapped in a kubernetes pod and report the results
61-
back to the buildbot master. When the job is finished, the pod will complete
62-
and capacity will be available for another build, or if there is nothing
63-
left to test GKE will see that there is nothing running on one of the nodes
64-
and downscale the node pool.
43+
The LLVM Premerge system has two clusters, namely the central cluster in the GCP
44+
zone `us-central1-a` and the west cluster in the GCP zone `us-west1`. We run
45+
two clusters in different zones for redundancy so that if one fails, we can
46+
still run jobs. For postcommit testing, we plan on setting up builders attached
47+
to the Buildbot master described above. We will run one builder on the central
48+
cluster and one in the west cluster. This ensures the configuration is highly
49+
available (able to tolerate an entire cluster going down), similar to the
50+
premerge testing. The builders will be configured to use a script that will
51+
launch builds on each commit to `main` in a similar configuration to the one run
52+
premerge. The test configuration is intended to be close to the premerge
53+
configuration but will be different in some key ways. The differences and
54+
motivation for them is described more thoroughly in the
55+
[testing configuration](#testing-configuration) section. These builds will be
56+
run inside containers that are distributed onto the cluster inside kubernetes
57+
pods (the fundamental schedulable unit inside kubernetes). This allows for
58+
kubernetes to handle details like what machine a build should run on. Allowing
59+
kubernetes to handle these details also enables GKE to autoscale the node pools
60+
so we are not paying for uneeded capacity. Launching builds inside pods
61+
also allows for each builder to handle multiple builds at the same time.
62+
63+
In terms of the full flow, any commit (which can be from direct pushes or
64+
merging pull requests) pushed to the LLVM monorepo will get detected by the
65+
buildbot master. The Buildbot master will invoke Buildbot workers running on our
66+
clusters. These Buildbot workers will use custom builders to launch a build
67+
wrapped in a kubernetes pod and report the results back to the buildbot master.
68+
When the job is finished, the pod will complete and capacity will be available
69+
for another build, or if there is nothing left to test GKE will see that there
70+
is nothing running on one of the nodes and downscale the node pool.
6571

6672
### Annotated Builder
6773

6874
llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder.
69-
We need the flexibility of the AnnotatedBuilder (essentially a custom python
70-
file that runs the build) to deploy jobs on the cluster. AnnotatedBuilder based
71-
builders also enable deploying changes without needing to restart the buildbot
72-
master. Without this, we have to wait for an administrator of the LLVM buildbot
73-
master to restart it before our changes get deployed. This could significantly
74-
delay updates or responses to incidents, especially before the system is fully
75-
stable and we need to modify it relatively frequently.
75+
AnnotatedBuilders allow for the build to be driven using a custom python script
76+
rather than directly dictating the shell commands that should be run to perform
77+
the build. We need the flexibility of the AnnotatedBuilder to deploy jobs on the
78+
cluster. AnnotatedBuilder based builders also enable deploying changes without
79+
needing to restart the buildbot master. Without this, we have to wait for an
80+
administrator of the LLVM buildbot master to restart it before our changes get
81+
deployed. This could significantly delay updates or responses to incidents,
82+
especially before the system is fully stable and we need to modify it relatively
83+
frequently.
7684

7785
### Build Distribution
7886

@@ -112,3 +120,50 @@ test certain projects even though their dependencies change, and while we do
112120
this because we suspect interactions resulting in test failures would be quite
113121
rare, it is possible, and having a postcommit configuration catch these rare
114122
failures would be useful.
123+
124+
### Data Storage
125+
126+
The hosted Buildbot master instance at [lab.llvm.org](https://lab.llvm.org)
127+
contains results for all recent postcommit runs. We plan on quetying the results
128+
from the buildbot master because they are already available and that is where
129+
they will natively be reported after the infrastructure is set up. Buildbot
130+
supports a [REST API](https://docs.buildbot.net/latest/developer/rest.html) that
131+
would allow for easily querying the state of a commit in `main`.
132+
133+
For the proposed premerge advisor that tells the user what tests/build failures
134+
they can safely ignore, we need to know what is currently failing on `main`.
135+
Each pull request is tested as if it was merged into main, which means the
136+
commit underneath the PR is very recent. If a premerge run fails, the premerge
137+
advisor will find the commit from `main` the PR is being tested on. It will then
138+
query the Buildbot master using the REST API for the status of that commit.
139+
It can then report the appropriate status to the user.
140+
141+
## Alternatives Considered
142+
143+
Originally, we were looking at running postcommit testing through Github
144+
Actions as well. This is primarily due to it being easy (a single line
145+
change in the Github Actions workflow config) and also easy integration
146+
with the Github API for implementation of the premerge testing advisor.
147+
More detailed motivation for the doing postcommit testing directly through
148+
Github is available in the [discourse RFC thread](https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124)
149+
where we proposed doing this. We eventually decided against implementation in
150+
this way for a couple of reasons:
151+
152+
1. Nonstandard - The standard postcommit testing infrastructure for LLVM is
153+
through Buildbot. Doing postcommit testing for the premerge configuration
154+
through Github would represent a significant deparature from this. This means
155+
we are leaving behind some common infrastructure and are also forcing a new
156+
unfamiliar postcommit interface on LLVM contributors.
157+
2. Notifications - This is the biggest issue. Github currently gives very
158+
little control over the notifications that are sent out when the build
159+
fails or gets cancelled. This is specifically a problem with Github sending
160+
out notifications for build failures even if the previous build has failed.
161+
This can easily create a lot of warning fatigue which is something we are
162+
putting a lot of effort in to avoid so that the premerge system is
163+
percieved as reliable, people trust its results, and most importantly,
164+
people pay attention to failures when they do occur because they are
165+
caused by the patch getting the notification and are actionable.
166+
3. Customization - Buildbot can be customized around issues like notifications
167+
whereas Github cannot. Github is not particularly responsive on feature
168+
requests and their notification story has been poor for a while, so their
169+
lack of customization is a strategic risk.

0 commit comments

Comments
 (0)