2025 Revisiting in‐band text tracks in MediaSource Extensions

WEH 2025 - Revisiting in-band text tracks in MediaSource Extensions

GitHub issue: https://github.com/Igalia/webengineshackfest/issues/59
URL: https://meet.jit.si/WEH2025-MSE

Audience probing and rounds of introductions: most of the audience works primarily with WebKit, and a small handful with Chromium.

Most of the attendees aren't familiar with MediaSource Extensions, so a very brief overview of MSE is provided.

Most of the attendees aren't familiar with WebVTT, so a brief overview of the WebVTT features that are relevant for demuxers and -- by extension -- MSE is provided, as well as the representation of WebVTT in MP4, using the slides from the W3c Breakouts 2025.

Delan asked what value goes in the ctim box if a text track has been retimed in the container.

Alicia answered: the original start time for the cue, as in the .vtt file. This value is only used to compute a delta between it and any delayed parts.

Delan asked how can you tell if two MP4 samples correspond to the same cue if there is no vlab.

Alicia answered: the spec requires you to introduce a vlab box if any cue would end up in two samples. However muxers like MP4Box currently lack this feature, and implementations could potentially end up needing to resort to comparing contents as a fallback.

E. Ocaña asked: if it's possible to manipulate text tracks from JS, could it be possible to create a polyfill for text track support in MSE.

Alicia answered: it would be non-trivial but it could be possible, and it may even be useful for advancing support of text tracks in MSE to other browsers.

Alicia explains how it's seemingly unclear what a coded frame should be in MSE with regard to WebVTT in MP4: is it each MP4 sample, as it is the case for audio and video, or is it each cue? Note that in WebM, each cue is its own Block (the WebM equivalent of an MP4 sample), as WebM can handle overlapping frames.

E. Ocaña asked: would a JS polyfill implementation of text tracks in MSE avoid the problems in this discussion by not depending on the MSE algorithms?

Alicia answered: for the most part it wouldn't as it would still need to make the same choices about the definition of a coded frame, parse the bytestream and do most of the same logic a browser engine with native support would, but in JS.

Jean-Yves Avenard says that, if feasible it would make sense to choose a definition of coded frame that is consistent among containers.

Przemyslaw Gorszkowski brings up that he remembers Opera shipping support for WebVTT and TTML in an embedded version of their browsers for TVs, so at least one browser engine may have dealt with all of this before. The HbbTV test suite contained tests about text tracks, which could also be a useful resource.

https://www.webengineshackfest.org/

2025 Revisiting in‐band text tracks in MediaSource Extensions

WEH 2025 - Revisiting in-band text tracks in MediaSource Extensions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally