Debugging: build an async debugger API on top of `run_concurrent`

In general when designing our guest debugger functionality, we would like to balance a few requirements:
- It should be relatively straightforward to compose debugging functionality on top of an existing embedder / "main Wasm invocation"; i.e., it should not require deep surgery or awkward refactors, or only work in the Wasmtime CLI.
- It should be possible to provide access to the `Store` to the debugger, including mutability. This is needed for eventual "mutable debugger commands" (e.g., updating locals' values) but also even for any access to GC objects (because of the root-set).
- The whole `Store` should pause when the debugger has control, so it can observe state without racing with other tasks.
- The debugger needs to be able to run with access to IO, which in a Rust context almost always means it needs to have async entry points.

We have plans to place the debugger implementation mostly inside a Wasm component, which gives us a little more flexibility to have an "ugly API" underneath, but even still, the closer we get the native host API and paradigm to what the eventual Wasm API describes, the less painful and error-prone the glue will be.

All of these requirements generally push toward a "coroutine"-style async design. In our [RFC](https://github.com/bytecodealliance/rfcs/blob/main/accepted/wasmtime-debugging-v2.md) and in a draft PR (#11826), we have sketched out a general approach to a debug API that contains a "debugger" and "debuggee" as two entities that bounce control back and forth. This is naturally rendered in Rust with an API that literally provides an async API that yields a stream of "debug events", with the debuggee stopped whenever an event is received and running whenever the debugger is polling for the next event. Such an API allows for a nice debugger implementation style: it can keep its main loop in one place, and access the store directly when the debuggee is paused.

Unfortunately, through a bunch of conversations, we have determined that this is not sound as implemented in that draft PR. The PR "teleports" a borrow of the `Store` outward from an async yield point, where it performs a fiber yield, back to a `DebugSession` (wrapping the store) on which an `async fn next()` was invoked to get the next debug event. The idea was that the `next()` invocation exclusively owns the store while we pass control back to the guest; when it returns, we can return ownership of the store back to the debugger; this is more-or-less like passing a mutable reborrow of the store to a hostcall, except that we plumb it back out to the surface. We could even get the provenance right by passing (via a raw pointer) the reborrow outward. However...

Unfortunately, `Future` combinators and dropped futures are a thing, and there is a bad case with a "host code sandwich". Consider: debugger context calls Wasm, calls async host code, calls Wasm. In the second Wasm activation, we hit a debug event. We could yield all the way back up to the debugger and pass a reborrowed `Store`; but that yield control flow passes through the async host code, by way of a `Poll::Pending`. That async host code may implement some arbitrary future combinator that chooses to (for example) drop the future, in which case we have a dangling reference to the store and the rest of the debug state we were supposed to examine (e.g. stack frames). One could try to patch this up by holding fibers via reference counting and keeping the fibers alive when paused for debugging; but at that point, we have discovered that...

... we are reinventing a bunch of mechanisms in the component-model async implementation. In particular, (i) the `Accessor` mechanism allows for ownership passing of the `Store` (timeslicing such that access only exists during one poll, with no borrows persisted across suspends) in a way that is already vetted; and (ii) the task model gives us a first-class way to note that a stack is paused for debugging, and keep it alive. (I'm less sure about the details of (ii), but in principle, the concurrent scheduler is a little tiny OS kernel and we can build the moral equivalent of `ptrace` pauses there, I think.)

Given all that, the eventual plan is something like:
- Build a mechanism to set up a concurrent environment with an async debugger that receives a stream of events and can access the store via `Accessor`. The debugger itself needs to be within the context of the `run_concurrent` invocation, but separate/privileged: all tasks except for the debugger pause when the debugger has control.
- Update any point in the Wasmtime runtime that needs to yield a debug event to use the "new-style" async mechanisms, i.e. `Accessor`, to safely give control of the `Store` back to the debugger.
- Move over the "top half" of the debugger that we plan to temporarily build on #11895 and remove that API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debugging: build an async debugger API on top of `run_concurrent` #11896

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Debugging: build an async debugger API on top of run_concurrent #11896

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Debugging: build an async debugger API on top of `run_concurrent` #11896