@@ -10,69 +10,77 @@ how we plan on implementing this to ensure we get fast feedback scalably.
1010
1111## Background/Motivation
1212
13- It is important that we test the premerge configuration postcommit as well.
14- This enables easily checking the state of ` main ` at any given time and also
15- easily pinpointing where exactly ` main ` has broken which enables figuring
16- out which commit to revert/fix forward to get everything back to green.
17- Having information on the state of ` main ` is also important for certain kinds
18- of automation, like the planned premerge testing advisor that will let
19- contributors know if tests failing in their PR are already failing at ` main `
20- and that it should be safe to merge despite the given failures.
21-
22- Originally, we were looking at running postcommit testing through Github
23- Actions as well. This is primarily due to it being easy (a single line
24- change in the Github Actions workflow config) and also easy integration
25- with the Github API for implementation of the premerge testing advisor.
26- More detailed motivation for the doing postcommit testing directly through
27- Github is available in the [ discourse RFC thread] ( https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124 )
28- where we proposed doing this. We eventually decided against implementation in
29- this way for a couple of reasons:
30-
31- 1 . Nonstandard - The standard postcommit testing infrastructure for LLVM is
32- through Buildbot. Doing postcommit testing for the premerge configuration
33- through Github would represent a significant deparature from this. This means
34- we are leaving behind some common infrastructure and are also forcing a new
35- unfamiliar postcommit interface on LLVM contributors.
36- 2 . Notifications - This is the biggest issue. Github currently gives very
37- little control over the notifications that are sent out when the build
38- fails or gets cancelled. This is specifically a problem with Github sending
39- out notifications for build failures even if the previous build has failed.
40- This can easily create a lot of warning fatigue which is something we are
41- putting a lot of effort in to avoid so that the premerge system is
42- percieved as reliable, people trust its results, and most importantly,
43- people pay attention to failures when they do occur because they are
44- caused by the patch getting the notification and are actionable.
45- 3 . Customization - Buildbot can be customized around issues like notifications
46- whereas Github cannot. Github is not particularly responsive on feature
47- requests and their notification story has been poor for a while, so their
48- lack of customization is a strategic risk.
13+ LLVM has two types of testing upstream: premerge and postcommit. The premerge
14+ testing is performed using Github Actions every time a PR is updated before it
15+ is merged. Premerge testing is performed using this infrastructure (specifically
16+ the ` ./premerge ` folder in llvm-zorg). Landing a PR consists of squashing the
17+ changes into a single commit and adding that commit to the ` main ` branch in the
18+ monorepo. We care specifically about the state of the ` main ` branch because it
19+ is what the community considers the canonical tree. Currently, commits can also
20+ be added to the ` main ` branch by directly pushing to the main branch. After a
21+ new commit lands in the ` main ` branch, postcommit testing is performed. Most
22+ postcommit testing is performed through the Buildbot infrastructure. The main
23+ Buildbot instance for LLVM has a web instance hosted at
24+ [ lab.llvm.org] ( https:/lab.llvm.org ) . When a new commit lands in ` main ` the
25+ Buildbot instance (sometimes referred to as the Buildbot master) will trigger
26+ many different build configurations. These configurations are defined in the
27+ llvm-zorg repository under the ` buildbot/ ` folder. These configurations are run
28+ on Buildbot workers that are hosted by the community.
29+
30+ It is important that we test the premerge configuration postcommit as well. We
31+ need to be able to determine the state of ` main ` (in terms of whether the build
32+ passed/failed and what tests failed, if any) for certain types of automation.
33+ Postcommit testing enables easily checking the state of ` main ` at any given point
34+ in time. This data is crucial for figuring out which commit to revert/fix
35+ forward to get everything back to green. Having information on the state of
36+ ` main ` is also important for certain kinds of automation, like the planned
37+ premerge testing advisor that will let contributors know if tests failing in
38+ their PR are already failing at ` main ` and that it should be safe to merge
39+ despite the given failures.
4940
5041## Design
5142
52- The overall design involves using an annotated builder that will be deployed
53- on both the central and west clusters for a HA configuration. The annotated
54- builder will consist of a script that runs builds inside kubernetes pods
55- to enable autoscaling.
56-
57- In terms of the full flow, a commit pushed to the LLVM monorepo will get
58- detected by the buildbot master. The Buildbot master will invoke Buildbot
59- workers running on our clusters. These Buildbot workers will use annotated
60- builders to launch a build wrapped in a kubernetes pod and report the results
61- back to the buildbot master. When the job is finished, the pod will complete
62- and capacity will be available for another build, or if there is nothing
63- left to test GKE will see that there is nothing running on one of the nodes
64- and downscale the node pool.
43+ The LLVM Premerge system has two clusters, namely the central cluster in the GCP
44+ zone ` us-central1-a ` and the west cluster in the GCP zone ` us-west1 ` . We run
45+ two clusters in different zones for redundancy so that if one fails, we can
46+ still run jobs. For postcommit testing, we plan on setting up builders attached
47+ to the Buildbot master described above. We will run one builder on the central
48+ cluster and one in the west cluster. This ensures the configuration is highly
49+ available (able to tolerate an entire cluster going down), similar to the
50+ premerge testing. The builders will be configured to use a script that will
51+ launch builds on each commit to ` main ` in a similar configuration to the one run
52+ premerge. The test configuration is intended to be close to the premerge
53+ configuration but will be different in some key ways. The differences and
54+ motivation for them is described more thoroughly in the
55+ [ testing configuration] ( #testing-configuration ) section. These builds will be
56+ run inside containers that are distributed onto the cluster inside kubernetes
57+ pods (the fundamental schedulable unit inside kubernetes). This allows for
58+ kubernetes to handle details like what machine a build should run on. Allowing
59+ kubernetes to handle these details also enables GKE to autoscale the node pools
60+ so we are not paying for uneeded capacity. Launching builds inside pods
61+ also allows for each builder to handle multiple builds at the same time.
62+
63+ In terms of the full flow, any commit (which can be from direct pushes or
64+ merging pull requests) pushed to the LLVM monorepo will get detected by the
65+ buildbot master. The Buildbot master will invoke Buildbot workers running on our
66+ clusters. These Buildbot workers will use custom builders to launch a build
67+ wrapped in a kubernetes pod and report the results back to the buildbot master.
68+ When the job is finished, the pod will complete and capacity will be available
69+ for another build, or if there is nothing left to test GKE will see that there
70+ is nothing running on one of the nodes and downscale the node pool.
6571
6672### Annotated Builder
6773
6874llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder.
69- We need the flexibility of the AnnotatedBuilder (essentially a custom python
70- file that runs the build) to deploy jobs on the cluster. AnnotatedBuilder based
71- builders also enable deploying changes without needing to restart the buildbot
72- master. Without this, we have to wait for an administrator of the LLVM buildbot
73- master to restart it before our changes get deployed. This could significantly
74- delay updates or responses to incidents, especially before the system is fully
75- stable and we need to modify it relatively frequently.
75+ AnnotatedBuilders allow for the build to be driven using a custom python script
76+ rather than directly dictating the shell commands that should be run to perform
77+ the build. We need the flexibility of the AnnotatedBuilder to deploy jobs on the
78+ cluster. AnnotatedBuilder based builders also enable deploying changes without
79+ needing to restart the buildbot master. Without this, we have to wait for an
80+ administrator of the LLVM buildbot master to restart it before our changes get
81+ deployed. This could significantly delay updates or responses to incidents,
82+ especially before the system is fully stable and we need to modify it relatively
83+ frequently.
7684
7785### Build Distribution
7886
@@ -112,3 +120,50 @@ test certain projects even though their dependencies change, and while we do
112120this because we suspect interactions resulting in test failures would be quite
113121rare, it is possible, and having a postcommit configuration catch these rare
114122failures would be useful.
123+
124+ ### Data Storage
125+
126+ The hosted Buildbot master instance at [ lab.llvm.org] ( https://lab.llvm.org )
127+ contains results for all recent postcommit runs. We plan on quetying the results
128+ from the buildbot master because they are already available and that is where
129+ they will natively be reported after the infrastructure is set up. Buildbot
130+ supports a [ REST API] ( https://docs.buildbot.net/latest/developer/rest.html ) that
131+ would allow for easily querying the state of a commit in ` main ` .
132+
133+ For the proposed premerge advisor that tells the user what tests/build failures
134+ they can safely ignore, we need to know what is currently failing on ` main ` .
135+ Each pull request is tested as if it was merged into main, which means the
136+ commit underneath the PR is very recent. If a premerge run fails, the premerge
137+ advisor will find the commit from ` main ` the PR is being tested on. It will then
138+ query the Buildbot master using the REST API for the status of that commit.
139+ It can then report the appropriate status to the user.
140+
141+ ## Alternatives Considered
142+
143+ Originally, we were looking at running postcommit testing through Github
144+ Actions as well. This is primarily due to it being easy (a single line
145+ change in the Github Actions workflow config) and also easy integration
146+ with the Github API for implementation of the premerge testing advisor.
147+ More detailed motivation for the doing postcommit testing directly through
148+ Github is available in the [ discourse RFC thread] ( https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124 )
149+ where we proposed doing this. We eventually decided against implementation in
150+ this way for a couple of reasons:
151+
152+ 1 . Nonstandard - The standard postcommit testing infrastructure for LLVM is
153+ through Buildbot. Doing postcommit testing for the premerge configuration
154+ through Github would represent a significant deparature from this. This means
155+ we are leaving behind some common infrastructure and are also forcing a new
156+ unfamiliar postcommit interface on LLVM contributors.
157+ 2 . Notifications - This is the biggest issue. Github currently gives very
158+ little control over the notifications that are sent out when the build
159+ fails or gets cancelled. This is specifically a problem with Github sending
160+ out notifications for build failures even if the previous build has failed.
161+ This can easily create a lot of warning fatigue which is something we are
162+ putting a lot of effort in to avoid so that the premerge system is
163+ percieved as reliable, people trust its results, and most importantly,
164+ people pay attention to failures when they do occur because they are
165+ caused by the patch getting the notification and are actionable.
166+ 3 . Customization - Buildbot can be customized around issues like notifications
167+ whereas Github cannot. Github is not particularly responsive on feature
168+ requests and their notification story has been poor for a while, so their
169+ lack of customization is a strategic risk.
0 commit comments