Skip to content

Flemma-Dev/voxize

Repository files navigation

Voxize

Voice-to-text overlay for Linux/Wayland/GNOME.

Important

Actively Evolving.

Voxize is under active development. Pin a commit if you need a stable target. See the implementation journal for progress and design decisions.

  • Global hotkey opens a translucent overlay that floats above your workspace. Start speaking immediately.
  • Real-time streaming transcription via the OpenAI Realtime API (gpt-4o-transcribe). Text appears as you talk.
  • AI text cleanup via GPT-5.4 Mini — fixes spelling, punctuation, and filler words while preserving meaning.
  • Clipboard output — cleaned text is copied automatically. Raw transcript is preserved in clipboard history as a fallback.
  • Crash-safe audio — WAV streamed to disk from the first sample. Audio survives process crashes.
  • Session archive with cost tracking — last 8 sessions stored under ~/.local/state/voxize/ with audio, transcripts, and per-session API costs.

Screenshots

TBD

Recording state

Requirements

Requirement Notes
NixOS (tested on 25.11) NixOS-first — flake.nix + shell.nix handle all system deps
GNOME / Wayland Mutter compositor; D-Bus used for focused-window detection
Python >= 3.11 Managed by uv
OpenAI API key Stored in GNOME Keyring via secret-tool
wl-clipboard wl-copy for clipboard output
PortAudio Audio capture backend
Window Calls (optional) GNOME Shell extension for focused-window detection. Required for WHISPER.txt prompt hints; without it, prompt detection is skipped silently

Installation

Warning

There is no packaged install yet. The steps below are the development workflow — expect rough edges. Proper packaging will be considered once the project stabilizes.

Enter the dev shell (pulls all system dependencies):

nix develop

Store your OpenAI API key in the GNOME Keyring:

secret-tool store --label='OpenAI API Key' service openai key api

Run:

uv run python -m voxize

To bind Voxize to a global hotkey (e.g., GNOME Settings > Keyboard > Custom Shortcuts), use the full command which can be invoked from any directory:

nix develop /path/to/voxize --command bash -c "cd /path/to/voxize && uv run python -m voxize"

Tip

The first nix develop invocation evaluates the full shell derivation, which can take several seconds. To avoid this on every hotkey press, use nix-direnv — it caches the evaluated dev shell so subsequent entries are near-instant. If you manage your environment with Home Manager, enable it with programs.direnv.nix-direnv.enable = true. A proper packaging strategy (e.g., a writeShellScript wrapper with pinned dependencies) would eliminate the cold-start cost entirely and is left as an exercise for the reader.

How it works

Voxize runs as a single Python process per invocation. A state machine drives the session through INITIALIZING, RECORDING, CLEANING, and READY. During recording, microphone audio streams over a WebSocket to the OpenAI Realtime API with semantic VAD (low eagerness, tuned for dictation). Transcription deltas appear live in the overlay. On stop, the accumulated transcript is sent to GPT-5.4 Mini for cleanup, which streams corrected text back into the same overlay. The final text is copied to the clipboard. See docs/design.md for the original specification and docs/journal.md for implementation deviations and decisions made during development.

Architecture

Backend modules (state.py, audio.py, transcribe.py, cleanup.py, storage.py, lock.py, clipboard.py, prompt.py) do not import GTK. They use GLib/Gio for thread-safe callbacks and I/O, but no widget code. Only app.py and ui.py touch GTK4. This separation exists as a seam for a future GNOME Shell extension frontend that would spawn the backend as a subprocess and communicate via stdin/stdout JSON Lines. See the architecture decision record for context.

Configuration

WHISPER.txt — Place a WHISPER.txt file in your working directory. On launch, Voxize resolves the focused window's CWD (via the Window Calls extension and /proc) and loads the file as transcription context, improving accuracy for domain-specific vocabulary. There is no upward search — the file must be in the exact directory the focused process is running from.

VOXIZE_AUTOCLOSE — Seconds before the overlay auto-closes in READY state. Default 30. Set to 0 to disable.

Session data~/.local/state/voxize/. Each session directory contains audio.wav, transcription.txt, cleaned.txt, ws_events.jsonl, and debug.log.

License

AGPL-3.0

About

Voice-to-text tool for Linux (Wayland/GNOME)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages