-
Notifications
You must be signed in to change notification settings - Fork 116
[CI] Add Design Doc for Premerge Advisor #599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,252 @@ | ||
# Premerge Advisor | ||
|
||
## Introduction | ||
|
||
While the premerge testing infrastructure is generally reliable, tests that are | ||
broken at HEAD, or tests that are flaky can significantly degrade the user | ||
experience. Reporting failures unrelated to the PR being worked on can lead to | ||
warning fatigue. Warning fatigue can self reinforce when people land failing | ||
tests into main that they believe are unrelated. This causes more false positive | ||
failures on other premerge testing invocations, leading to more warning | ||
fatigure. To address these issues we propose to implement a "premerge advisor". | ||
The premerge advisor is designed to run after premerge testing and signal to the | ||
user whether the failures can be safely ignored because they are flaky or broken | ||
at head, or whether they should be investigated further. | ||
|
||
## Background/Motivation | ||
|
||
The usefulness of the premerge system is significantly impacted by how much | ||
people trust it. People trust the system less when the issues reported to them | ||
are unrelated to the changes that they are currently working on. False positives | ||
occur regularly whether due to flaky tests, or due to tests already being broken | ||
at head (perhaps by a previous commit). Efforts to fix these issues at the | ||
source and keep main always green and deflake tests are ongoing, but not a | ||
scalable solution to the problem of false positive failures. It is also not | ||
reasonable to expect PR authors to spend time familiarizing themselves with all | ||
known flaky tests and dig through postcommit testing logs to see if the failures | ||
in their premerge run also occur in main. These alternatives are further | ||
explored in the section on [alternatives considered](#alternatives-considered). | ||
Having tooling to automatically run through the steps that one would otherwise | ||
need to perform manually would ensure the analysis on every failed premerge run | ||
is thorough and likely to be correct. | ||
|
||
## Design | ||
|
||
The premerge advisor will consist of three main parts: jobs uploading failure | ||
information, a web server and database to store and query failure information, | ||
and tooling to write out a verdict about the failures to comments on Github. | ||
When a job runs premerge or postcommit and there are build/test failures, it | ||
will upload information to the web server containing information on the failure | ||
like the test/build action that failed and the exact log. The web server will | ||
then store this information in a format that makes it easy to query for later. | ||
Every premerge run that encounters build/test failures will then query the web | ||
server to see if there are any matching build/test failures for the version of | ||
`main` that the PR has been merged into. If there are, the web server can | ||
respond with the failure information and signal that the failures can be safely | ||
ignored for that premerge run. If the failures are novel, then the web server | ||
can signal that the failures should be investigated more thoroughly. If there | ||
are failures, the premerge advisor can then write out a comment on the PR | ||
explaining its findings. | ||
|
||
### Processing and Storing Failures | ||
|
||
A key part of the premerge advisor is infrastructure to store and process build | ||
failure information so that it can be queried later. We plan on having jobs | ||
extract failure logs and upload them to a web server. This web server in | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
addition to having an endpoint to accept uploads will have an endpoint that will | ||
accept test failure information (logs and filenames) and return whether or not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the difference between accepting "uploads" and accepting "test failure information". I think I know what is meant (PR info vs postsubmit testing info) but that is not really clear here. |
||
they are broken at `main`, flaky, or novel test failures due to the PR. | ||
|
||
For the premerge jobs running through Github Actions, we plan on using the | ||
existing `generate_test_report` scripts that are currently used for generating | ||
summaries on job failures. When the job ends and there is a failure, there would | ||
be a script that runs, utilizing the `generate_test_report` library to extract | ||
failure information, and then uploads the information to the web server. | ||
Information on how the premerge jobs will query the server and display results | ||
about flaky/already broken tests is in | ||
[the section below](#informing-the-user). We plan on having both the premerge | ||
and [postcommit jobs](post-submit-testing.md) upload failure information to the | ||
web server. This enables the web server to know about failures that are not the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you should say something here about the data from the post commit jobs being stored in the DB; failures form the precommit jobs get compared against the DB to determine if the failures are new or already existed. |
||
result of mid-air collisions before postcommit testing has been completed. | ||
Postcommit testing can take upwards of 30 minutes, during which the premerge | ||
advisor would not be able to advise that these failures are due to a broken | ||
`main`. Data from both postcommit and premerge testing will be associated with a | ||
commit SHA for the base commit in main. Premerge testing data will additionally | ||
have a PR associated with it to enable disambiguating between failures occurring | ||
in `main` at a specific commit and failures only occurring in a PR. | ||
|
||
We plan on implementing the web server in python with the `flask` library. All | ||
contributors to the premerge infrastructure are already familiar with Python, | ||
building web servers in Python with `flask` is relatively easy, and we do not | ||
need high performance or advanced features. For storage, we plan on using SQLite | ||
as it has support built in to Python, does not require any additional complexity | ||
in terms of infrastructure setup, and is reliable. | ||
Comment on lines
+78
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we consider going without the DB? What we need is history on the N last commits runs on main. Could this be stored in memory in the flask process? (assuming memory shared across the N workers if we use waitress) It means if we restart the cluster, the first commit won't have history, but this shouldn't be that often and would simplify greatly incremental work on this advisor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whether we keep things in-memory or inside a DB, we still need a schema. The schema that we would need is also not complicated. We just need the file, the workflow run number/commit SHA, the failure message, and maybe a timestamp/otherwise unique ID. Keeping the data for all of eternity also isn't that necessary. We can just dump the existing DB and start fresh, just like we would if we were storing info in memory and the server restarted. Storing the information in memory makes it much harder to keep track of flaky tests over long periods of time, and we lose all information on flaky tests everytime the server restarts, which isn't ideal. Server restarts are normal and expected with k8s, so the design needs to take them into account. |
||
|
||
Given we have two identical clusters that need to be able to function | ||
independently of each other, we also need to duplicate the web server for the | ||
premerge advisor. Queries and failure information uploads will go to the | ||
cluster-local premerge advisor web server. Periodically (eg once every thirty | ||
Comment on lines
+85
to
+88
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By not polling, and requesting each job to upload data to the service/DB, while also having clustering means we now need to handle data inconsistencies. What about:
We could also have a single instance of this service as this is not a critical piece, a failure would just prevent sending helpful comments, but wouldn't back-up the queue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't really need to handle data inconsistencies. The system is designed for eventual consistency on the order of minutes because having slightly inconsistent results doesn't really matter. We can't use Github artifacts because we also need to support uploading data from the postcommit builders running on buildbot. I think this is reasonably critical to the UX of the premerge system and not having two means we probably need to push maintenance to off-hours more often, but ack on looking more into keeping a single instance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should explicitly state in the design doc why the inconsistencies won't be a problem. It would also be a good idea to outline the basic DB schema. Try to design it so that it's easily extensible if/when we decide we need to extend it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, if we're planning on using the DB to help identify flaky tests over time, which your comments imply, then that should be stated in the design doc as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What was the upshot of "looking more into keeping a single instance"? |
||
seconds), the web server will query the web server on the other cluster to see | ||
if there is any new data that has not been propagated back to the other side yet. | ||
It is easy to figure out what one side is missing as Github workflow runs are | ||
numbered sequentially and git commits are also explicitly ordered. One side just | ||
needs to send the latest commit SHA and workflow run it has all previous data | ||
for, and the other side can reply with the data that it has past that point. | ||
Explicitly synchronizing everytime without assumptions about the state of the | ||
other side has benefits over just writing through, like ensuring that a cluster | ||
that has been down for a significant amount of time is seamlessly able to | ||
recover. Synchronization does not need to be perfect as test failures that are | ||
flaky or broken at head will almost certainly show up in both clusters | ||
relatively quickly, and minor discrepancies for queries between the clusters are | ||
not a big deal. | ||
|
||
### Identifying if Failures are Novel | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section needs a better name. Maybe something like "Classifying Test Failures"? |
||
|
||
When the web server is sent a request to explain a list of failures, it needs to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the same thing as the PR just uploading its test results? Or is this actually a separate request? |
||
be able to determine whether a failure has occurred previously. To do this, the | ||
web server will keep a list of currently active failures in `main` and a list of | ||
flaky tests. The list of flaky tests will be computed from historical data on | ||
startup from a rolling window of data. The exact heuristic is left to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have a rough heuristic in mind that you could outline here? |
||
implementation as it will likely require some experimentation. The list of | ||
currently active failures will initially be taken from the last known postcommit | ||
run. If a new postcommit run shows additional failures that are not in the flake | ||
list, they will be added to the currently active failure list. Failures in the | ||
currently active list not present in the latest postcommit run will be removed | ||
from the currently active list. In addition to data from postcommit testing, if | ||
a PR lands with unexplained premerge failures, those failures are also added to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by "unexplained premerge failures" here? |
||
the currently active failure list. If a PR based on a version of main with | ||
currently active failures no longer contains those failures, they will be | ||
removed from the list when that PR lands. We look at information from PRs to | ||
improve the latency of failures/passes making their way into the system with | ||
postcommit testing currently having a minimum latency of 30 minutes. | ||
|
||
Failures will be identified as being the same through a combination of the | ||
test name and likely a fuzzy match of the test error output. The exact details | ||
are left to implementation as they will likely require some experimentation | ||
to get right. | ||
|
||
These tradeoffs do leave open the possibility that we incorrectly identify tests | ||
as being broken at head in certain extenuating circumstances. Consider a | ||
situation where someone lands commit A breaking a test, immediately after lands | ||
commit B fixing that test, and then opens a PR, C, that also breaks the test in | ||
the same way around the same time. If commit B was directly pushed, or merged | ||
without waiting for premerge to finish, then the test failure will still be in | ||
the currently active failure list. The failure for PR C will then be marked as | ||
failing at HEAD despite it not actually failing at HEAD. Given the low | ||
probability of this occurring due to the same test needing to be broken, the | ||
error message needing to be extremely similar, and the exact timing | ||
requirements, we deem this an acceptable consequence of the current design. To | ||
alleviate this issue, we would end up marking many more failures broken at main | ||
as true failures due to the latency of postcommit testing. | ||
|
||
### Informing the User | ||
|
||
Once a workflow has completed, no actions related to the premerge advisor will | ||
be performed if there are no failures. If there are failures, they will be | ||
uploaded to the web server. Afterwards, the premerge workflow then makes a | ||
request to the server asking if it can explain the failures as either existing | ||
in `main` or as flaky. | ||
|
||
After the response from the server has been recieved, the workflow will then | ||
construct a comment. It will contain the failure information, and if relevant, | ||
information/links showing the tests are flaky (and the flakiness rate) or are | ||
broken in `main`. If all of the test failures are due to flakes or failures in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence is nearly identical with the first sentence of the next paragraph. I think maybe you can omit it here. |
||
`main`, the comment will indicate to the user that they can safely merge their | ||
PR despite the test failures. We plan to construct the comment in a manner | ||
cmtice marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "comment" => "review comment" (maybe)?. Also, since (I think), we're also going to put this information into the Summary section, you should mention that. |
||
similar to the code formatting action. We will create one comment on the first | ||
workflow failure and then update that comment everytime we get new data. This | ||
prevents creating much noise. This does mean that the comment might get buried | ||
in long running reviews, but those are not the common case and people should | ||
learn to expect to look for the comment earlier in the thread in such cases. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't the comment about which test failures the user can ignore go into the Summary, at the very top? |
||
|
||
If all of the failures in the workflow were successfully explained by the | ||
premerge advisor as flaky or already broken in `main`, then the premerge | ||
workflow will be marked as successful despite the failure. This will be | ||
achieved by having the build/test step always marked as successful. The | ||
premerge advisor will then exit with a non-zero exit code if it comes | ||
across non-explainable failures. | ||
|
||
## Alternatives Considered | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like there are a number of off-the-shelf commercially available and open source flake detector systems. Here's a quote from Gemini: A unit test flake detector identifies tests that pass and fail inconsistently without code changes by rerunning tests multiple times and analyzing the results for differing outcomes. These tools use historical data and CI/CD pipeline logs to flag unreliable tests, with features like auto-rerunning, historical analysis, and test quarantine for managing flagged tests. Popular examples include features within CI platforms like Azure DevOps, Mergify, and TeamCity, as well as third-party tools like Trunk.io, the flaky-tests-detection PyPI package. I think it's worth discussing the build/buy/re-use tradeoffs here with at least a few of these. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a section to cover this. This actually ended up being a pretty useful exercise. trunk.io looks almost exactly like what we want other than that it (along with all the other alternatives) only work with flaky tests rather than failures at head. We really need both to create a cohesive solution for LLVM. |
||
|
||
There are two main alternatives to this work. One would be to do nothing and let | ||
users figure out these failures on their own, potentially with documentation to | ||
better inform them of the process. The other alternative would be to work on | ||
keeping `main` green all the time and deflake tests rather than work on a | ||
premerge advisor. Both of these alternatives are considered below. | ||
|
||
### Deflaking Tests and Keeping Main Green | ||
|
||
Instead of putting effort into building the premerge advisor, we could also be | ||
putting effort into deflaking tests and making process changes to ensure `main` | ||
is not broken. These fixes have the bonus of being useful for more than just | ||
premerge, also improving reliability for buildbots and any downstream testing. | ||
While we probably could achieve this at least temporarily with process changes | ||
and a concentrated deflaking effort, we do not believe this is feasible or | ||
scalable. | ||
|
||
In order to ensure that main is not broken by new patches, we need to ensure | ||
that every commit is tested directly on top of `main` before landing. This is | ||
not feasible given LLVM's current processes where pushing directly to main is a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you should omit the sentence about direct pushes. |
||
critical component of several developer's workflows. We would also need to | ||
reduce the risk of "mid-air collisions", patches that pass premerge testing, but | ||
fail on `main` when landed due to the patch in its original state not being | ||
compatible with the new state of main. This would most likely involve merge | ||
queues which would introduce new CI load and are also not compatible with LLVM's | ||
existing practices for the same reason requiring premerge checks to be run | ||
before landing are not. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you should omit the final phrase ("for the same reason requiring premerge checks to be run before landing are not"). |
||
|
||
Doing an initial effort for deflaking tests is also not scalable from an | ||
engineering effort perspective. While we might be able to deflake existing | ||
tests, additional flaky tests will get added in the future, and it is likely not | ||
feasible to dedicate enough resources to deflake them. Policy improvements | ||
around reverting patches that introduce flaky tests might make this more | ||
scalable, but relies on quick detection of flaky tests, which might be difficult | ||
for tests that experience flaky failures very rarely. | ||
|
||
### Not Doing Anything | ||
|
||
Alternatively, we could not implement this at all. This system adds quite a bit | ||
of complexity and adds new failure modes. False positive failures also are not | ||
that frequent. However, even a relatively small percentage of failures like we | ||
are observing significantly impacts trust in the premerge system, which | ||
compounds the problem. People learn not to trust the results, ignore true | ||
failures caused by their patch, and then land it, causing many downstream | ||
failures. The frequency of these incidents (typically multiple times per week) | ||
means that it is pretty likely most LLVM developers will run into this class of | ||
issue sooner or later. | ||
|
||
The complexity is also well confined to the components specific to this new | ||
infrastructure, and the new failure modes can be mitigated through proper error | ||
handling at the interface between the existing premerge system and the new | ||
premerge advisor infrastructure. | ||
|
||
### Using an Existing System | ||
|
||
There are several different platforms that offer post workflow test analysis | ||
around issues like flaky tests. These platforms include [trunk.io](trunk.io), | ||
[mergify](mergify.com), and | ||
[DataDog Test Optimization](docs.datadoghq.com/tests/). However, only one of | ||
these platforms, trunk.io, supports marking jobs as green if the only tests that | ||
failed were flaky through their | ||
[quarantining feature](docs.trunk.io/flaky-tests/quarantining). However, it | ||
supports little customization over commenting on PRs (either is always on, even | ||
successful runs, or always off), and our notification story is of high | ||
importance. We want the signal to noise ratio of premerge notifications to be | ||
high. The system should simply say nothing if everything has gone well. | ||
|
||
In addition to being able to mark jobs as green if only flaky tests failed, | ||
we also need to be able to keep track of failures in `main` and not block | ||
merging on them. None of the systems considered support this functionality, | ||
making the assumption that `main` should always be green (through mechanisms | ||
like a merge queue), or should be fixed quickly. Due to the scale and | ||
contribution norms in LLVM, keeping `main` always green is hard to enforce | ||
at scale, so we need to support this feature. | ||
|
||
The cost of using these systems is also nontrivial given how frequently | ||
the premerge pipeline gets run and the number of tests that get run. These | ||
systems are typically priced based on a number of "test spans" (a single | ||
test invocation), with both mergify trunk.io and DataDog pricing ingesting | ||
1M test spans at $3. We run ~300k tests per premerge run. Using a | ||
conservative estimate of 400 premerge runs per weekday, no runs on weekends, | ||
we end up with a cost of almost $100k per year, not including any potential | ||
open source discounts. |
Uh oh!
There was an error while loading. Please reload this page.