Skip to content

v1.2.0 — Speaker diarization, VAD segmentation, setup command#1

Merged
skitsanos merged 5 commits intomainfrom
develop
Mar 16, 2026
Merged

v1.2.0 — Speaker diarization, VAD segmentation, setup command#1
skitsanos merged 5 commits intomainfrom
develop

Conversation

@skitsanos
Copy link
Copy Markdown
Member

Summary

Major feature release bringing speaker diarization, VAD-based segmentation, self-bootstrapping setup, and dependency upgrades.

New features

  • Speaker diarization via sherpa-onnx C API (pyannote segmentation + speaker embeddings). --speakers N labels transcript segments with speaker identity in VTT, SRT, and manifest.
  • VAD-based segmentation using Silero VAD — speech-aware chunking that avoids mid-word cuts. Replaces FFmpeg silencedetect when --vad-model is set.
  • transcribeit setup command — self-bootstrapping CLI that downloads all components (models, VAD, diarization models, sherpa-onnx shared libraries) with platform auto-detection.
  • Auto-detect model architectures — sherpa-onnx engine detects Whisper, Moonshine, and SenseVoice from model directory contents.
  • BSL 1.1 license — free for non-commercial/evaluation use, commercial license required for production.

Improvements

  • Dependency upgrades: whisper-rs 0.16, reqwest 0.13, indicatif 0.18, bzip2 0.6
  • Fixed whisper-rs set_detect_language(true) bug causing empty transcripts
  • sherpa-onnx is now an optional feature flag (--no-default-features to exclude)
  • C++ stderr suppression for sherpa-onnx warnings
  • Code review fixes: dedup retry loops, static regex, negative timestamp guard
  • download-model extended with --vad and --diarize flags
  • Default output format changed to VTT, added -f short flag

Test results

  • 28 tests passing
  • Zero clippy warnings
  • Both feature configurations build clean
  • Full 31-minute medical interview transcribed successfully (7.5x realtime with large-v3-turbo)

Test plan

  • cargo fmt -- --check passes
  • cargo clippy -- -W clippy::all passes
  • cargo test — 28 tests pass
  • cargo build --no-default-features builds
  • Tested on 5min, 10min, and 31min audio samples
  • Diarization tested with 2-speaker interview
  • VAD segmentation tested vs FFmpeg silencedetect
  • transcribeit setup tested (all components)

skitsanos added 5 commits March 16, 2026 08:50
Two-pass pipeline: whisper.cpp transcribes, then sherpa-onnx diarizes
the same audio and assigns speaker labels by timestamp overlap.

- Raw FFI bindings to sherpa-onnx offline speaker diarization C API
  (not yet exposed by the sherpa-onnx Rust crate)
- Dedicated worker thread for diarization (C types are !Send/!Sync)
- CLI: --speakers N --diarize-segmentation-model --diarize-embedding-model
- Env vars: DIARIZE_SEGMENTATION_MODEL, DIARIZE_EMBEDDING_MODEL
- Speaker labels in VTT (<v Speaker 0>), SRT ([Speaker 0]), and manifest JSON
- Segment struct gains optional speaker field
- Gated behind sherpa-onnx feature flag
VAD segmentation via Silero VAD (sherpa-onnx):
- Detects speech boundaries instead of silence dB thresholds
- 250ms padding protects word boundaries from clipping
- Merges chunks separated by <200ms gaps
- Splits long chunks at lowest-energy points (not arbitrary positions)
- Use --vad-model path/to/silero_vad.onnx to enable
- Falls back to FFmpeg silencedetect when no VAD model

Dependency upgrades:
- whisper-rs 0.12 → 0.16 (iterator API, updated log callback)
- reqwest 0.12 → 0.13
- indicatif 0.17 → 0.18
- bzip2 0.5 → 0.6 (pure Rust)

Comprehensive docs update for VAD, diarization, and env vars.
transcribeit setup — downloads all components for full functionality:
- models: default GGML base model from HuggingFace
- vad: Silero VAD model (~628KB) for speech-aware segmentation
- diarize: pyannote segmentation + wespeaker embedding models
- sherpa-libs: platform-specific sherpa-onnx shared libraries
  (auto-detects macOS/Linux x64/ARM64)

Selective install: transcribeit setup -c vad
Extended download-model: --vad and --diarize flags

Prints env var summary at the end showing what to add to .env.
All downloads are idempotent (skip if already present).
Business Source License 1.1:
- Free for non-commercial and evaluation use
- Commercial/production use requires a separate license
- Converts to Apache 2.0 on 2030-03-16

All dependencies verified compatible (MIT, Apache-2.0, BSD, ISC,
Unlicense — no GPL/copyleft).
@skitsanos skitsanos merged commit c6d60d9 into main Mar 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant