diff --git a/rfcs/audio-graph.md b/rfcs/audio-graph.md new file mode 100644 index 00000000..69dff616 --- /dev/null +++ b/rfcs/audio-graph.md @@ -0,0 +1,385 @@ +# Feature Name: `bevy_audio_graph` + +## Summary + +A more-advanced graph-based audio system for Bevy, leveraging relationships to define the graph, a +separate schedule to build and drive the audio, and the Web Audio API to create parity between +desktop and web applications. + +## Motivation + +Right now, audio in Bevy is extremely simple. You add an `AudioPlayer` somewhere in your application, +which contains a handle to an encoded audio file, and the audio file will be decoded on the fly +and sent directly to the master mixbus. This works for simple cases, but it has a lot of issues. Below +I've enumerated some of the issues that I see. + +#### Clipping + +Overlapping audio, or audio played too loud, has the potential to clip the master bus. This is trivial +to solve if it is possible to add effects to the master bus, as you can simply add a +[limiter](https://en.wikipedia.org/wiki/Limiter). In Bevy, however, there is currently no way to do +this. + +#### No control over audio + +Once a sound has been queued for playback, you lose control over it. You can stop the sound by deleting +the `AudioPlayer` component, but there is no way to change the speed of playback, pause and resume, +set a new location to play from, et cetera. + +#### No concept of channels + +It is often very useful to set a maximum number of sounds of a certain type to play at the same time, +or a cooldown on sounds of a certain type. Maybe you have a lot of enemies and you want to prevent them +from all talking over one another, maybe you want to have story-relevant dialogue prevent random chatter +from being played, maybe you just want to have a maximum number of explosion sound effects to be played +at once. Unlike some other aspects of the system that I'm discussing here, this one _can_ be implemented +in terms of Bevy's existing system, but it would be quite nice for this functionality to be built-in. + +#### No audio metadata + +The most common kind of metadata tracked to audio is subtitles. While you can add subtitles to a Bevy +game, by triggering them to start playing at the same time as the audio, this is usually something that +you want built into the audio system itself. That way, subtitles can be affected by the speed of the +playing audio (even when it changes), it can reuse some of the same information used by a channel +system - e.g. importance. + +#### Sample-accurate syncing is impossible + +A very important aspect of modern games is dynamic soundtracks. This is a system that includes multiple +"layers" which can be affected independently, while staying in sync. As audio has no concept of one +another's current playback progress and no metadata is available - particularly timing data - it is +essentially impossible to do this right now. + +#### No effects + +There are kinda-nasty hacks that one can do to add effects to a single sound, but it is rarely the case +that you want a sound to have its own effects. You usually want multiple sounds to be mixed together, +and only then have effects applied. This should not be a strict heirarchy, but a graph. If you want to +modulate the volume of a sound independent from how much reverb is applied to it (a very common operation), +there is no way to do that with just a series of effects unless every sound has its own reverb node - which +is very expensive. + +In my opinion, the most important effect to have is a low-latency limiter on the master bus. This prevents +the audio from clipping if too many sounds are playing simultaneously. + +## User-facing explanation + +Explain the proposal as if it was already included in the engine and you were teaching it to another Bevy user. That generally means: + +- Introducing new named concepts. +- Explaining the feature, ideally through simple examples of solutions to concrete problems. +- Explaining how Bevy users should *think* about the feature, and how it should impact the way they use Bevy. It should explain the impact as concretely as possible. +- If applicable, provide sample error messages, deprecation warnings, or migration guidance. +- If applicable, explain how this feature compares to similar existing features, and in what situations the user would use each one. + +## Implementation strategy + +This RFC contains two parts - the API that will be interacted with by the user, and the internal +implementation details. As there are many ways to potentially handle the latter, I will give an example of +a possible API design first. + +#### API design + +The audiograph should be handled using entity relations, and it should be rebuilt into a more-efficient +form when extracted to the audio world. An example set of components is given below, roughly in order of +importance. + +```rust +/// Mark that an `AudioSource` should have its audio sent to an `AudioSink`. Any intermediate processing +/// is done by an entity with both an `AudioSource` _and_ `AudioSink` trait, with the sink being used +/// for input and the source being used for output. +/// +/// For the purposes of backwards-compatibility and simplicity, it is probably desirable to have a concept +/// of a "global sink", where an `AudioSource` that has no outbound connections is automatically connected +/// to a singleton sink which will output to the system's audio. +#[derive(Component, Relationship)] +struct AudioTo { +} + +/// For each `AudioSink` + `AudioSource` effect node in an audio effect chain. This is not necessary for +/// building the audio graph, but it _is_ useful for conceptually grouping nodes into a single chain. +/// It could be that this is more-effectively represented with the `Parent` relationship, but that probably +/// messes with spatialization. +/// +/// The concept is that a `AudioChain` can be treated as an `AudioSink` + `AudioSource`, with the "sink" +/// and "source" behavior being differing depending on if a node is in that chain or not. If a node +/// that is marked as "in that chain" tries to get a chain as a source it will read from the audio sent +/// to the `AudioSink`, but if a node outside the chain tries to get that chain as a source it will read +/// the _output_ of the chain, after FX processing. This allows the effects in a chain to be modified +/// without updating any audio components that are sending audio to/from that chain, and it also allows +/// reading all the effects that are in a single conceptual chain. +/// +/// > NOTE: While it may be useful for components to send audio to nodes that aren't the chain root or other +/// FX within a chain, I think for now it's probably for the best to deny that. This allows us to +/// potentially cache the graph for a chain as a single unit, which gives us more leeway to do +/// expensive graph optimizations on when building - for example, JIT compiling multiple effects into +/// a single node. While those optimizations are well out-of-scope for now, I think it's best to +/// have more limitations now - they can be always be circumvented by developers by splitting a chain +/// up into multiple pieces. +#[derive(Component, Relationship)] +struct InChain { + // Marker(?) +} + +/// Marker component for a DSP processing chain. This makes it easier to add helper methods for creating +/// and modifying FX chains. +#[derive(Component)] +#[require(AudioSink, AudioSource)] +struct AudioChain { + // Marker(?) +} + +/// A component marking that an entity can have audio sent to it. +#[derive(Component)] +struct AudioSink { + // Marker(?) +} + +/// A component marking that an entity can have audio from it. +#[derive(Component)] +struct AudioSource { + // Marker(?) +} + +/// Exactly the same as the component as it exists in the current version of Bevy. I believe that Bevy +/// already has the limitation that sources need to just be a buffer of audio data (and cannot be +/// arbitrary iterators of samples) - at least for sounds played with the default `AudioPlugin`. This +/// makes the migration path much simpler. +#[derive(Component)] +#[require(AudioSource, AudioPlayback)] +struct AudioPlayer(pub Handle) +where + Source: Asset + Decodable; + +/// A component that tells the audio graph to handle the "ear location" of a sink. Exact method of +/// spatialization TBD, as it is probably useful to have something slightly more configurable than just +/// modulating the gain - developers may want to modify the graph routing to bucket sources into +/// different distance levels that send the audio to progressively more-aggressive reverb configurations, +/// they may want to modulate a filter based on distance, etc. +#[derive(Component)] +#[require(AudioSink)] +struct SpatialSink { +} + +/// A marker component that tells the audio graph to calculate the distance from a `SpatialSink` that +/// this component is routed to. +#[derive(Component)] +#[require(AudioSource)] +struct SpatialSource { +} + +// Effects should, ideally, _not_ be implemented fully generically as DSP processing nodes. They should +// mirror the Web Audio API, and so have a set of core components like a convolver, biquad filter, delay, +// etc. that all high-level effects are built in terms of using required components. It may be useful in +// the future to have a lower-level API that allows writing DSP nodes by hand, but for now this should +// not be implemented to reduce the amount that needs to be designed. +mod effects { + #[derive(Component)] + #[require(AudioSource, AudioSink)] + struct ReverbNode { + // ..fields.. + } + + #[derive(Component)] + #[require(AudioSource, AudioSink)] + struct DelayNode { + // ..fields.. + } + + #[derive(Component)] + #[require(AudioSource, AudioSink)] + struct LimiterNode { + // ..fields.. + } + + // etc +} +``` + +#### Internal design + +Notably, this system requires moving away from Rodio. This is unfortunate, but necessary. Rodio simply +does not support the kind of processing that we want - particularly in a web context. The `web-audio-api` +crate (which provides a pure-Rust implementation of Web Audio) is probably the best option to migrate to, +as it allows us to provide a single interface that works for desktop applications but is guaranteed to +map to the audio system supported on web platforms. For the purpose of backwards-compatibility, it may +be desirable to still use the `Decodable` trait for `AudioPlayer`, but this should be the only aspect of +`rodio` that remains. + +The Web Audio API (and thus, the `web-audio-api` crate) has a built-in way to handle graph construction, +which I propose that we use directly. It may be useful in the future to coalesce chains of effects into +something more-efficient in the future, but as this is a design for the audio engine of a game and not +a piece of audio editing software, I do not expect effects chains to be particularly long or complex and +I think that we should prioritize limiting the amount that needs to be implemented in Bevy itself. I +believe that `web-audio-api` does _not_ automatically use the actual Web Audio API when compiled for +browsers, so a wrapper will need to be created that abstracts over the pure Rust implementation and the +web implementation. + +While not necessary, I believe that it would make the most sense to implement the audio processing in its +own schedule, with its own extraction step. This allows us to decouple audio from the rest of the system, +which allows us to have more control over the transformation from the form of the audiograph in the ECS +to the internal "compiled" form built using the Web Audio API. It also means that we can prevent the audio +thread from stalling if the main thread is overloaded, and it could also potentially allow developers to +specify that they want audio extracted multiple times a frame in order to reduce input-to-sound latency. + +## Drawbacks + +The design of this introduces some more complication into Bevy's audio system, as well as a new crate to +rely on. Conceptually a graph-based system is more difficult for developers to comprehend than a more +limited system, although this can easily be hidden away when a user does not need that level of control. + +## Rationale and alternatives + +- This design would allow external crates to introduce a multitude of new features into the audio system +- While I do not propose a precise method of implementing talkback in this RFC, it lays the groundwork + for adding things like debugging gizmos for sets of sounds - you can send all the sounds you want to + debug to a single `AudioSink` which then sends info to draw a debugging interface back to the main + schedule via some talkback system. +- While an audiograph is conceptually complex when you need its full power, it is trivial to build simpler + systems on top of. A user who is happy with the existing system will never know it exists - they will + simply add an `AudioPlayer` just as before, which will be automatically connected to a singleton global + sink. +- An alternative would be to use Kira - already implemented in `bevy-kira-audio`. While I believe that + Kira is an interesting project, it unfortunately does not provide enough benefits over Bevy's existing + system to justify the switch, in my opinion. + +## Unresolved questions + +- Is `web-audio-api` production-ready? Should we only expose a limited subset in order to reduce possible + unpolished corners of this library? +- How do we ensure that this doesn't break external libraries that interact with Bevy's audio system but + do not fully replace it? + - I believe that Bevy's existing audio system is currently limited enough that this change is unlikely + to break too many things +- Do we want to expose a Web Audio-like API surface, or do we want to have something more low-level even + if it significantly reduces audio performance when compiling for web? + +## Future possibilities + +#### Talkback + +```rust +/// Sent to the main schedule when a buffer has been received. +/// +/// In the future, it may be useful to parameterize this by buffer type, as developers may want to +/// do some custom processing in the audio schedule that results in a type that is not an audio +/// buffer. For example, they may want to compress the buffer in order to send it over the network. +#[derive(Event)] +struct AudioBufferReceived { + pub from: Entity, + pub buffer: Buffer, +} + +#[derive(Component)] +#[require(AudioSink)] +struct Talkback { + // .. some talkback configuration options, such as how often a buffer is sent .. +} +``` + +#### Playback control + +```rust +/// Control the playback of an `AudioPlayer`. +struct AudioPlayback { + /// Set the playback speed, with the default being 1. Future versions could add things like + /// using pitch-independent retiming rather than just slowing down playback (which affects + /// pitch). + pub speed: f32, + // .. fields .. +} + +#[derive(Event)] +struct Play; + +#[derive(Event)] +struct Pause; + +enum AudioTime { + /// This needs to be a `chrono::Duration` to handle negative time. + Time(chrono::Duration), + /// Number of samples. + Sample(u64), + /// A fraction of the total time of the sound. + Fraction(f32), +} + +enum PlaybackPoint { + /// Set playback to a specific time from the current playback point of the sound. + FromCurrent(chrono::Duration), + /// Set playback to a specific time from the start of the sound. + FromStart(chrono::Duration), +} + +#[derive(Event)] +struct Skip { + pub new_playback_point: PlaybackPoint, +} +``` + +#### Source priortization + +As mentioned in Motivation, it can be useful to limit the number of inputs to a sink. This can be done +simply on a first-come-first-served basis, or we could have some system that allows users to control +importance both per-source and based on some set of properties of that source. + +```rust +#[derive(Component)] +struct AudioImportance { + pub importance: f32, +} + +/// Control how to handle sounds that have +#[derive(Default, Component)] +enum IgnoreBehavior { + /// Delete the sound immediately. + #[default] + Drop, + /// Queue the sound to be played, optionally timing out after a certain period. This can be + /// useful for UI sounds. + Queue { + timeout: Option, + }, + /// Play the sound, but ignore its output until a slot for it is found. + Mute, +} + +/// A trait used for limiting the number of inputs to a node. +#[derive(Component)] +#[require(IgnoreBehavior)] +struct MaxInputFilter { + /// The maximum number of inputs allowed to this node at any one time. + pub max_inputs: usize, +} + +/// Multiply importance by a value based on the peak amplitude of a source. +#[derive(Component)] +#[require(MaxInputFilter)] +struct AmplitudeFilter { + pub loudness_map: Box>, +} + +/// Multiply importance by a value based on the perceptual loudness (LUFS) of a source. +#[derive(Component)] +#[require(MaxInputFilter)] +struct LoudnessFilter { + pub loudness_map: Box>, +} + +/// Multiply importance by a value based on the distance that a source is spatially from a listener. +/// Note that this is _not_ the same as modulating the gain of a source based on distance - it is +/// for ignoring sources that are too far away when many sounds are playing at once. +#[derive(Component)] +#[require(MaxInputFilter, SpatialSink)] +struct DistanceFilter { + pub distance_map: Box>, +} +``` + +#### Custom DSP + +Another useful possibility is custom audio processing. This RFC only proposes a set of hard-coded effects +based on what Web Audio provides. Those are enough to implement a huge amount of different effects, as +WebAudio's primitives are very generic, but users may want to implement custom audio processing units. +In my opinion, this would probably best be handled by a system of traits like how the graphics processing +graph is implemented, but the precise design is out of scope for this RFC.