2025 Running WPT under CI: Problems and challenges, tips and tricks, and lessons learned

WEH 2025 - Running WPT under CI: Problems and challenges, tips and tricks, and lessons learned

GitHub issue: https://github.com/Igalia/webengineshackfest/issues/43
URL: https://meet.jit.si/WEH2025-WPT

What do you want to discuss?

Dave: problem using the webrunner, some of the tests feeze if you use the test runner, you have to use a regex to skip the ones that are freezing... some browsers crash. I tried to open a bug, and the responses was that we don't support webrunner.
Nico Burns: Web platform tests with custom runner. Not well support... lets specify the format of the tests, easier to interact. Subsetting of the tests that it is possible to run without depending on other parts of the web.
Tim (who works on ladybird): We run a subset of WPT tests on CI. We use our own runner infrastructure that is much faster. I wondering if people have recommendations to doing a more complete run in CI, what do other browsers do? We are constrained on compute resources.
sideshowbarker: For ladybird, we want to run the CI on the every branch. But this can take a really long time (hours?).
Tim: What we do is import tests are relevent to us and that we verify work correctly. They run on the file system, rather than over http. We have also done work to speed up running wpt running on a single machine. I think other browsers will use taskcluster or CI, parallel on 8 or 16 different machines.
Nico: 20 machines.
sideshowbarker: We have our own testharness. Does not run python based wpt web server. The way it works, we import them to run in our harness for regression testing. We have a mechanism for recording the test expectations, then we just compare to test expectations to make sure we don't regress. How long does it take to run the whole local harness?
Tim: 2304 tests, on my local machine with 32 threads, it takes 8 seconds.
Nico: Similar, for Blitz browser, we have a custome runner, compiles into the same binary as the browser. Https server crashes. We can run 30 thousand tests in 40 seconds.
sideshowbarker: on discord we discuss, and a channel that every time we run the full wpt test suite, after a commit, after it gets the results is post the results (pass/fail) into the discord channel. You can see you if you regressed something.
Nico: I'm in the process of writing a way to keep track of test passes, so you can see regressions. So you aren't running all the tests, only certain tests, why not the whole suite?
Tim: I think it's a speed restriction. In CI we run with santiziers, takes a while. if we import everything, we would run a lot of tests with would regularly fail (things we haven't implemented yet). We run WPT test on the next new ladybird commit or wpt commit after the last run ends.
Andrew (work on ladybird): When we first starting looking at wpt 2 years ago, alexs add tons of test expectations to repo. But we weren't stable enough for that information to be useful. Between runs, hundreds of tests could suddenly pass or fail. There is a lot of work to figure out what the subset of tests are relevent, and what failures are relevent to the PR
Ian (works on webkit): Could we learn from other runners of WPT about which tests are flaky generally?
Tim: The main tool that we use to compare ladybird runs to each other is wpt.fyi. I have looked sometimes to idenfity flaky tests. Some of the flakes we see are also on other browsers, the majority are our fault.
Mike: You guys reimplements the server or harness. When we import tests, we copy the test and the associated resources onto the file system.
Andrew: our local test harness just for our development. No webserver, reads from file system.
nico: on the running test from the filesystem. Our runner for blitz -- within blitz, we subvert the network and read from file system as an option for testing. For servo project, had a lot of flaky tests, we have a flakiness tracker. We decide whether we want to record the test or not. Sometimes we change the expectation to allow more than one result. We can classify others as "flaky" tests, and when we make wpt reports, those are reported in that.
nico: My engine doesn't support javascript, we are running ref tests which compare screen shots. We detect what kind of test it is using regex (I'd love a better way to do this). Replace a call back with my custom one.
Andrew: Description of the design of the test harness: when we run the tests through the wpt runner, we use webdriver for everything. Web content process. Our local custom runner spins up mproc, for each file we have. (Can Andrew link or fill this in?)
sideshowbarker: Let me elaborate on how things work: there was a time in WPT when we relied on manual tests so for anything that invovled user interaction with the page ... of course there's all kinds of tests we need to run to find regressions so this is what webdriver aims to achieve. We got webdriver supported in WPT years ago. A lot of tests run through webdriver method calls even if you don't notice it right away. It gets user input and stimulate user interaction. This cannot be done in JavaScript, it's not exposed so you need to do it externally. That's how it works now-a-days.
Dave: how does web runner work?
sideshowbarker: it doesn't, you cannot do it using the web runner. If you do it locally without some special handling, it doesn't work. It needs webdriver.
Luke: two points, there's many tests that rely on webdriver behind the scenes. I went through some where you have <> for JavaScript because you need to test for this behavior. That's something I noticed that you need special setup to make things work. WebKit's test runner relies on the order of execution and it's not actually guaranteed so we had flaky tests for things like trusted types for instance so the tests would run in a slightly different order everytime. Just to be sure, if you're writing a test runner, don't rely on the order of tests or you'd have a lot of flakes to worry about.
Mike: I'm remembering aspects of that feature, maybe people are running wpt tests in an fully automated/unattended? sometimes you need manual intervention in old tests.
sideshowbarker: they can speak to that, in the general case there's people who don't even want to get to that. There's a WPT test and they can figure that you can try to run it on a runner. Why are you interested in a web based runner Dave?
Dave: I have a project at work where they wanted to know for duckduckgo browser, we wanted to know how well some features on all browsers, tried using the web runner. We just wanted to get a snapshot.
sideshowbarker: there's a lot of work that's gone into WPT to make the CLI worker a lot simpler and it's easier to get up and running. It can automatically install dependencies, it's well documented as well. In supported browsers, you can run wpt run, if you don't have the dependencies installed it'll go ahead and do it for you and an important one is the driver binary for various browsers and this binary would handle interacting with the various driver APIs. It obviates the need for the web runner. It's not like it's not useful but still the CLI makes it easier to get up and running in no time.
Nico: Installing some of the wpt tooling is hard work, like related to python, and wpt doesn't support latest python version. It doesn't tell you that is why it is not working. I had issues on macos, and the python libraries have a c depenedency. Room for improvement there :)
sideshowbarker: it's not a WPT issue, it's an issue for specific python releases. We have some smart folks working on addressing this as python experts. We do have these issues though.
Leo from deno: we use javascript for everything, why not use node for the runner?? Maybe we should use a partial re-write in something that people use that use wpt.
ms2ger: I wouldn't underestimate the amount of python dependencies we require. Another issue would be to convince all browser vendors to run Node on their CI infra when Python already works. I don't think a rewrite from scratch is the most efficient way to fix our old issues.
mike: for what it is worth, switching to python3, we had the option to consider other languages, but people decided to keep with python -- we had explict convos about it.
Nico: In a similar aim but a different suggestion, could we work on specifying or documenting what the current runner does for reference? I certainly have interest in writing a new runner in rust which would have better throughput. I can run the python runner in the iterative case.
Ms2ger: We should figure out how to help, there are parts which we might be able to expose better but it's currently out of <>
Tim: There is a list tests subcommand on the wpt test. You can do it with the manifest.
sideshowbarker: The problem with making magic which is what's the issue with WPT is that it's a project with few resources and James and Mike singlehandedly did most of the work. Compared to where we were years ago, there used to be browser projects which used PHP to process test results. Someone still uses Perl. The binding generator code is in Perl but what I mean to say is, it could be worse. The other thing is that it's progressed organically. We never assumed Python would be the perfect solution and the python http server that we use isn't designed for what we use it for. We've definitely overstretched the limits of the tools we used. This isn't out of incompetance however. Wouldn't it be great if we made it better? Sure but that's still substantial work. The thing that helped us is that there's a team at Google and we have to thank them for wpt.fyi and that stuff is a whole different set of magic that we should be thankful for. How do we document this better for new folks so they don't waste their time. A lot of our stuff isn't very discoverable but because there's a ton of obscure features. We should do an intro to WPT. All of us contributing to browser projects already suffered through this but we forget about the new people who came in after you.
Oriol: I'm working on servo, we are trying to cover various topics that have been mentioned. We import new tests every week, we do not run all of them. We have a file we decide, at first all are disabled, on level of directory, we run. When we change tests in PR we export. We run all the tests on every PR when we want to merge. What all the tests in 20 runners, sometimes in github (very long) and sometimes some self hosted runners (only 30 minutes). Use firefox expectation format. We have expectations for subtests. If no expectation it should pass. We attempt three times to match the expectation. We mark sometimes as flaky. We could just run tests we know that pass. But if you only run that previously passed, you might miss some tests that suddenly new pass, and then might potentially regress later. For features you haven't implemented, it makes sense to skip. For the runner, in WPT runner, we have an executor for servo, it is upstream in wpt. There are also executors for other browsers.
sideshowbarker: One thing that Tim alluded to, Ladybird isn't at the point of maturity where a ton of things fail which isn't as big a problem as things timing out. If you want things to run within a reasonable amount of time, the more timeouts the more time you end up wasting and there's regressions. We had this incident where this caused some problems. Tim could you tell a bit more on timeout mitigation?
Tim: I haven't done anything specific about timeouts, just fixing individual ones. Would be great to reduce impact of timeouts, I'm not sure if there is anything other than fixing or skipping?
sideshowbarker: Did you do something like reduce the default timeout so it helped with this issue?
Tim: maybe... I did in the past. but not right now.
sideshowbarker: for clarification, you talked about constraints and in context you meant funding and can you talk about how what we're running under github actions, what it costs us?
Tim: I'm not sure about the cost of our CI. As far as our self hosting infrastructure -- less than 100 months for that.
Andrew: all of linux CI on default github action runners for open source projects and it was free. But the macos runners were too slow. So we have bought a mac mini. y-combinator start up called blacksmith, they asked if we want to put our jobs on their runners. They are 2x faster. We are barely paying for CI. Obviously it doesn't scale, it is just working for now.
sideshowbarker: At the moment it works only on Linux and selected systems, we run into people having fairly esoteric distributions and they'd like some specific advice. On most distros and on MacOS and you can run it on Windows but it's basically Linux because of WSL2. There's people who want us to make it work, some contribute patches for cmake to build it natively on windows so we're making a clear statement that windows is not supported. There's an android build but it's not supported. We don't have to deal with the problem of additionally running windows automation CI.
Nico: I was going to make some m ore comments about servo. I don't think we run wpt on self host bots, because we do run 20 github actions in parallel, hard to beat the throughput of that. It free! Our self host runners are in our own VM, and its 5 times faster that github actions.
sideshowbarker: Another concrete thing that I can talk about is that recently I worked on supporting accessible names for Ladybird. It isn't exposed via JavaScript. You need some external way via webdriver to get this. It works as long as you're running WPT infra. I want to import the things from the WPT test suite on the Ladybird runner but I cannot run it as is because the accessible name method is not exposed. I have to rewrite the whole thing using windows to be able to run it in the ladybird test runners. It's extra work and frustration and ideally we won't have to do that.
Tim: ideally, what we do for ladybird, we would extend our internals mechanism to expose a inernals object on the window, to relevent all the web driver endpoints on that object, use that method to call them from javascript, and we won't need webdriver running.
sideshowbarker: I can say to from contributing to other projects, there must be a better way.
Nico: is there anyone in the room from Safari or Firefox regarding running these tests, how do you handle it?
ms2ger: for firefox, it's a lot of compute. running on several machines.
sideshowbarker: is this funded thorugh just normal mozilla funds? Is there some sponsorship?
Ms2ger: as far as I know mozilla pays for it for now.
Tim: for the main performance limitation, adding to this point, there's an option in the WPT runner to run a specific number of processes from a single runner. It would in theory be able to run 32 processes at the same time but we've found that this approach doesn't scale beyond 8 and if you want to make things faster than you have to <> and trying to do it on a single machine is difficult and messy. In the case when we want to run the tests fast locally, it's very difficult. We have ways that we alleviate that when it comes between chunks and I don't know what the correct answer is.
Nico: perhaps we can rewrite some stuff in Rust, let's catch up offline.
sideshowbarker: we have an active discord for the ladybird project and we have a testing channel for this kind of discussions. We sometimes want to hear from other projects.
Other venues are the WPT matrix room and the webdriver IRC channel.

https://www.webengineshackfest.org/

2025 Running WPT under CI: Problems and challenges, tips and tricks, and lessons learned

WEH 2025 - Running WPT under CI: Problems and challenges, tips and tricks, and lessons learned

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally