six-ddc · asturwebs · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/.claude/rules/architecture.md b/.claude/rules/architecture.md
@@ -69,7 +69,8 @@
 
 Additional modules:
   screenshot.py       ─ Terminal text → PNG rendering (ANSI color, font fallback)
-  transcribe.py       ─ Voice-to-text transcription via OpenAI API (gpt-4o-transcribe)
+  transcribe.py       ─ Voice-to-text: local Whisper (faster-whisper + CTranslate2 + CUDA) + OpenAI API fallback
+  tts.py              ─ Text-to-speech: edge-tts (Microsoft Edge neural voices) → OGG voice messages to Telegram
   main.py             ─ CLI entry point
   utils.py            ─ Shared utilities (ccbot_dir, atomic_write_json)
 
@@ -97,6 +98,8 @@ State files (~/.ccbot/ or $CCBOT_DIR/):
 - **Tool use ↔ tool result pairing** — `tool_use_id` tracked across poll cycles; tool result edits the original tool_use Telegram message in-place.
 - **MarkdownV2 with fallback** — All messages go through `safe_reply`/`safe_edit`/`safe_send` which convert via `telegramify-markdown` and fall back to plain text on parse failure.
 - **No truncation at parse layer** — Full content preserved; splitting at send layer respects Telegram's 4096 char limit with expandable quote atomicity.
+- **Local STT with API fallback** — Voice messages transcribed via faster-whisper (CTranslate2 + CUDA, model loaded lazily and resident). Falls back to OpenAI gpt-4o-transcribe API on failure if `OPENAI_API_KEY` is set. Engine selection via `CCBOT_STT_ENGINE` env var.
+- **TTS voice responses** — Final assistant messages sent as Telegram voice notes via edge-tts (Microsoft Edge neural voices). Per-user toggle via `/voice` command. Text always sent first; audio appended after. Configurable voice and global auto-enable via `CCBOT_TTS_VOICE` / `CCBOT_TTS_AUTO`.
 - Only sessions registered in `session_map.json` (via hook) are monitored.
 - Notifications delivered to users via thread bindings (topic → window_id → session).
 - **Startup re-resolution** — Window IDs reset on tmux server restart. On startup, `resolve_stale_ids()` matches persisted display names against live windows to re-map IDs. Old state.json files keyed by window name are auto-migrated.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -2,7 +2,7 @@
 
 ccmux — Telegram bot that bridges Telegram Forum topics to Claude Code sessions via tmux windows. Each topic is bound to one tmux window running one Claude Code instance.
 
-Tech stack: Python, python-telegram-bot, tmux, uv.
+Tech stack: Python, python-telegram-bot, tmux, uv, faster-whisper (CTranslate2 + CUDA), edge-tts (TTS).
 
 ## Common Commands
 
@@ -23,6 +23,8 @@ ccbot hook --install                  # Auto-install Claude Code SessionStart ho
 - **Hook-based session tracking** — `SessionStart` hook writes `session_map.json`; monitor polls it to detect session changes.
 - **Message queue per user** — FIFO ordering, message merging (3800 char limit), tool_use/tool_result pairing.
 - **Rate limiting** — `AIORateLimiter(max_retries=5)` on the Application (30/s global). On restart, the global bucket is pre-filled to avoid burst against Telegram's server-side counter.
+- **Local STT** — Voice messages transcribed via faster-whisper (CTranslate2 + CUDA) by default. OpenAI API as fallback. Model loaded lazily on first voice message, stays resident.
+- **TTS** — Responses sent as Telegram voice messages via edge-tts (Microsoft Edge neural voices). Per-user toggle via `/voice` command. Configurable voice and auto-enable via env vars.
 
 ## Code Conventions
 

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ In fact, CCBot itself was built this way — iterating on itself through Claude
 - **Topic-based sessions** — Each Telegram topic maps 1:1 to a tmux window and Claude session
 - **Real-time notifications** — Get Telegram messages for assistant responses, thinking content, tool use/result, and local command output
 - **Interactive UI** — Navigate AskUserQuestion, ExitPlanMode, and Permission Prompts via inline keyboard
-- **Voice messages** — Voice messages are transcribed via OpenAI and forwarded as text
+- **Voice messages** — Voice messages are transcribed locally via Whisper (faster-whisper + CUDA) and forwarded as text. OpenAI API available as fallback.
 - **Send messages** — Forward text to Claude Code via tmux keystrokes
 - **Slash command forwarding** — Send any `/command` directly to Claude Code (e.g. `/clear`, `/compact`, `/cost`)
 - **Create new sessions** — Start Claude Code sessions from Telegram via directory browser
@@ -95,8 +95,15 @@ ALLOWED_USERS=your_telegram_user_id
 | `CLAUDE_COMMAND`        | `claude`   | Command to run in new windows                    |
 | `MONITOR_POLL_INTERVAL` | `2.0`      | Polling interval in seconds                      |
 | `CCBOT_SHOW_HIDDEN_DIRS` | `false` | Show hidden (dot) directories in directory browser |
-| `OPENAI_API_KEY` | _(none)_ | OpenAI API key for voice message transcription |
+| `CCBOT_STT_ENGINE` | `whisper` | STT engine: `whisper` (local, CUDA) or `openai` (API) |
+| `CCBOT_WHISPER_MODEL` | `large-v3` | Whisper model size (`tiny`, `base`, `small`, `medium`, `large-v3`, `large-v3-turbo`) |
+| `CCBOT_WHISPER_DEVICE` | `cuda` | Compute device: `cuda` or `cpu` |
+| `CCBOT_WHISPER_COMPUTE_TYPE` | `float16` | Compute precision: `float16` (GPU), `int8` (GPU, less VRAM), `int8_float16` (balanced) |
+| `OPENAI_API_KEY` | _(none)_ | OpenAI API key (used when `CCBOT_STT_ENGINE=openai` or as whisper fallback) |
 | `OPENAI_BASE_URL` | `https://api.openai.com/v1` | OpenAI API base URL (for proxies or compatible APIs) |
+| `CCBOT_TTS_ENABLED` | `true` | Enable TTS (text-to-speech) voice message responses |
+| `CCBOT_TTS_AUTO` | `false` | Auto-enable TTS for all users (per-user toggle via `/voice`) |
+| `CCBOT_TTS_VOICE` | `es-ES-ElviraNeural` | Edge TTS voice name (run `edge-tts --list-voices` for options) |
 
 Message formatting is always HTML via `chatgpt-md-converter` (`chatgpt_md_converter` package).
 There is no runtime formatter switch to MarkdownV2.
@@ -151,6 +158,8 @@ uv run ccbot
 | `/history`    | Message history for this topic  |
 | `/screenshot` | Capture terminal screenshot     |
 | `/esc`        | Send Escape to interrupt Claude |
+| `/voice`      | Toggle TTS voice message responses |
+| `/unbind`     | Unbind topic from session (window stays alive) |
 
 **Claude Code commands (forwarded via tmux):**
 
@@ -178,7 +187,34 @@ Any unrecognized `/command` is also forwarded to Claude Code as-is (e.g. `/revie
 
 **Sending messages:**
 
-Once a topic is bound to a session, just send text or voice messages in that topic — text gets forwarded to Claude Code via tmux keystrokes, and voice messages are automatically transcribed and forwarded as text.
+Once a topic is bound to a session, just send text or voice messages in that topic — text gets forwarded to Claude Code via tmux keystrokes, and voice messages are automatically transcribed (locally via Whisper by default) and forwarded as text.
+
+### Voice Messages (STT)
+
+CCBot uses [faster-whisper](https://github.com/Sybren/faster-whisper) with CTranslate2 for **local, GPU-accelerated** speech-to-text. No API key required.
+
+**How it works:**
+1. You send a voice message in a Telegram topic
+2. The bot downloads the OGG audio (in-memory, never written to disk permanently)
+3. faster-whisper transcribes it on the local GPU (CUDA)
+4. The transcribed text is forwarded to Claude Code via tmux
+
+**Supported models** (set via `CCBOT_WHISPER_MODEL`):
+
+| Model | Params | VRAM (float16) | Speed | Accuracy |
+|-------|--------|----------------|-------|----------|
+| `tiny` | 39M | ~1 GB | Fastest | Basic |
+| `base` | 74M | ~1 GB | Very fast | Good |
+| `small` | 244M | ~2 GB | Fast | Good |
+| `medium` | 769M | ~5 GB | Moderate | Very good |
+| `large-v3` | 1550M | ~10 GB | Moderate | Best |
+| `large-v3-turbo` | 809M | ~3 GB | Fast | Near-best |
+
+The default `large-v3` provides the best accuracy. Use `large-v3-turbo` for a good balance of speed and accuracy with less VRAM usage. The model is downloaded once from HuggingFace Hub and cached locally.
+
+**Fallback:** If local Whisper fails and `OPENAI_API_KEY` is set, CCBot automatically falls back to OpenAI's `gpt-4o-transcribe` API.
+
+**VRAM note:** The Whisper model stays loaded in GPU memory after the first voice message. This uses ~3-4 GB VRAM with `large-v3` at `float16`. If GPU memory is limited, use a smaller model or `int8` compute type.
 
 **Killing a session:**
 
@@ -261,7 +297,7 @@ src/ccbot/
 ├── terminal_parser.py     # Terminal pane parsing (interactive UI + status line)
 ├── html_converter.py      # Markdown → Telegram HTML conversion + HTML-aware splitting
 ├── screenshot.py          # Terminal text → PNG image with ANSI color support
-├── transcribe.py          # Voice-to-text transcription via OpenAI API
+├── transcribe.py          # Voice-to-text: local Whisper (CTranslate2+CUDA) + OpenAI fallback
 ├── utils.py               # Shared utilities (atomic JSON writes, JSONL helpers)
 ├── tmux_manager.py        # Tmux window management (list, create, send keys, kill)
 ├── fonts/                 # Bundled fonts for screenshot rendering

diff --git a/pyproject.toml b/pyproject.toml
@@ -12,6 +12,8 @@ dependencies = [
     "Pillow>=10.0.0",
     "aiofiles>=24.0.0",
     "telegramify-markdown>=0.5.0,<1.0.0",
+    "faster-whisper>=1.2.1",
+    "edge-tts>=7.2.8",
 ]
 
 [project.scripts]

diff --git a/src/ccbot/bot.py b/src/ccbot/bot.py
@@ -136,6 +136,7 @@
 from .tmux_manager import tmux_manager
 from .transcribe import close_client as close_transcribe_client
 from .transcribe import transcribe_voice
+from .tts import get_voice, is_tts_enabled, set_voice, toggle_tts
 from .utils import ccbot_dir
 
 logger = logging.getLogger(__name__)
@@ -277,6 +278,134 @@ async def unbind_command(update: Update, context: ContextTypes.DEFAULT_TYPE) ->
     )
 
 
+async def voice_command(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+    """Toggle TTS or change voice.
+
+    Usage:
+      /voice          — Toggle TTS on/off
+      /voice <name>   — Set voice (e.g. /voice es-AR-ElenaNeural)
+    """
+    user = update.effective_user
+    if not user or not is_user_allowed(user.id):
+        return
+    if not update.message:
+        return
+
+    if not config.tts_enabled:
+        await safe_reply(update.message, "❌ TTS is disabled globally (CCBOT_TTS_ENABLED=false).")
+        return
+
+    args = context.args if context.args else []
+
+    # /voice <name> — set voice (auto-enable TTS)
+    if args:
+        voice_name = args[0]
+        try:
+            set_voice(user.id, voice_name)
+        except ValueError as e:
+            await safe_reply(update.message, f"❌ {e}\nUse /voices to see available voices.")
+            return
+        if not is_tts_enabled(user.id):
+            toggle_tts(user.id)
+        await safe_reply(
+            update.message,
+            f"🔊 Voice set to `{voice_name}` — TTS ON\n"
+            "Use /voices to see available options.",
+        )
+        return
+
+    # /voice — toggle
+    new_state = toggle_tts(user.id)
+    status = "ON" if new_state else "OFF"
+    voice_name = get_voice(user.id)
+    await safe_reply(
+        update.message,
+        f"🔊 TTS {status} (voice: {voice_name})\n"
+        "Use /voice <name> to change voice, /voices to list options.",
+    )
+
+
+async def voices_command(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+    """List available TTS voices.
+
+    Usage:
+      /voices          — Compact index of all locales with voice counts
+      /voices <locale> — All voices for a locale (e.g. /voices es, /voices en)
+    """
+    user = update.effective_user
+    if not user or not is_user_allowed(user.id):
+        return
+    if not update.message:
+        return
+
+    args = context.args if context.args else []
+    locale_filter = args[0].lower() if args else ""
+
+    try:
+        import edge_tts
+
+        all_voices = await edge_tts.list_voices()
+
+        if locale_filter:
+            # Detect if user used /voices instead of /voice to set a voice
+            if any(c.isupper() for c in locale_filter):
+                await safe_reply(
+                    update.message,
+                    f"💡 Did you mean `/voice {locale_filter}`?\n\n"
+                    "/voice — Set a voice (also toggles TTS on)\n"
+                    "/voices — List available voices",
+                )
+                return
+
+            filtered = [v for v in all_voices if v["Locale"].lower().startswith(locale_filter)]
+            if not filtered:
+                await safe_reply(
+                    update.message,
+                    f"❌ No voices found for '{locale_filter}'.\n"
+                    "Use /voices to see available locales.",
+                )
+                return
+            lines = []
+            current = get_voice(user.id)
+            for v in sorted(filtered, key=lambda x: (x["Locale"], x["ShortName"])):
+                gender = "♂" if v["Gender"] == "Male" else "♀"
+                tag = " ★" if v["ShortName"] == current else ""
+                lines.append(f"{gender} `{v['ShortName']}` — {v['Locale']}{tag}")
+            header = f"🗣 {locale_filter} — {len(lines)} voices\n\n"
+        else:
+            from collections import Counter
+
+            locale_counts = Counter(v["Locale"] for v in all_voices)
+            locale_flags = {
+                "ar": "🇸🇦", "bg": "🇧🇬", "cs": "🇨🇿", "da": "🇩🇰", "de": "🇩🇪",
+                "el": "🇬🇷", "en": "🇬🇧", "es": "🇪🇸", "et": "🇪🇪", "fi": "🇫🇮",
+                "fr": "🇫🇷", "he": "🇮🇱", "hi": "🇮🇳", "hr": "🇭🇷", "hu": "🇭🇺",
+                "id": "🇮🇩", "it": "🇮🇹", "ja": "🇯🇵", "ko": "🇰🇷", "lt": "🇱🇹",
+                "lv": "🇱🇻", "ms": "🇲🇾", "nl": "🇳🇱", "no": "🇳🇴", "pl": "🇵🇱",
+                "pt": "🇧🇷", "ro": "🇷🇴", "ru": "🇷🇺", "sk": "🇸🇰", "sl": "🇸🇮",
+                "sv": "🇸🇪", "th": "🇹🇭", "tr": "🇹🇷", "uk": "🇺🇦", "vi": "🇻🇳",
+                "zh": "🇨🇳",
+            }
+            lines = []
+            for locale, count in sorted(locale_counts.items()):
+                prefix = locale.split("-")[0]
+                flag = locale_flags.get(prefix, "🌐")
+                lines.append(f"{flag} `{locale}` — {count} voices")
+            header = f"🗣 Available locales ({len(locale_counts)}):\n\n"
+
+        await safe_reply(update.message, header + "\n".join(lines))
+    except Exception as e:
+        err = str(e)
+        if "503" in err or "Service Unavailable" in err:
+            await safe_reply(
+                update.message,
+                "⚠ Microsoft TTS service is temporarily unavailable (503).\n"
+                "Try again in a few seconds.",
+            )
+        else:
+            await safe_reply(update.message, f"❌ Failed to list voices: {e}")
+
+
 async def esc_command(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
     """Send Escape key to interrupt Claude."""
     user = update.effective_user
@@ -642,11 +771,14 @@ async def voice_handler(update: Update, context: ContextTypes.DEFAULT_TYPE) -> N
     if not update.message or not update.message.voice:
         return
 
-    if not config.openai_api_key:
+    stt_available = (
+        config.stt_engine == "whisper" or config.openai_api_key
+    )
+    if not stt_available:
         await safe_reply(
             update.message,
-            "⚠ Voice transcription requires an OpenAI API key.\n"
-            "Set `OPENAI_API_KEY` in your `.env` file and restart the bot.",
+            "⚠ No STT backend available.\n"
+            "Set CCBOT_STT_ENGINE=whisper (local) or OPENAI_API_KEY (API) in .env.",
         )
         return
 
@@ -1792,6 +1924,8 @@ async def handle_new_message(msg: NewMessage, bot: Bot) -> None:
                 text=msg.text,
                 thread_id=thread_id,
                 image_data=msg.image_data,
+                role=msg.role,
+                is_complete=msg.is_complete,
             )
 
             # Update user's read offset to current file position
@@ -1895,6 +2029,8 @@ def create_bot() -> Application:
     application.add_handler(CommandHandler("screenshot", screenshot_command))
     application.add_handler(CommandHandler("esc", esc_command))
     application.add_handler(CommandHandler("unbind", unbind_command))
+    application.add_handler(CommandHandler("voice", voice_command))
+    application.add_handler(CommandHandler("voices", voices_command))
     application.add_handler(CommandHandler("usage", usage_command))
     application.add_handler(CallbackQueryHandler(callback_handler))
     # Topic closed event — auto-kill associated window

diff --git a/src/ccbot/config.py b/src/ccbot/config.py
@@ -101,12 +101,35 @@ def __init__(self) -> None:
             os.getenv("CCBOT_SHOW_HIDDEN_DIRS", "").lower() == "true"
         )
 
-        # OpenAI API for voice message transcription (optional)
+        # STT engine: "whisper" (local, default) or "openai" (API)
+        self.stt_engine: str = os.getenv("CCBOT_STT_ENGINE", "whisper")
+        # Whisper config (local STT via faster-whisper + CTranslate2 + CUDA)
+        self.whisper_model: str = os.getenv("CCBOT_WHISPER_MODEL", "large-v3")
+        self.whisper_device: str = os.getenv("CCBOT_WHISPER_DEVICE", "cuda")
+        self.whisper_compute_type: str = os.getenv(
+            "CCBOT_WHISPER_COMPUTE_TYPE", "float16"
+        )
+        # OpenAI API for voice transcription (fallback when stt_engine=openai)
         self.openai_api_key: str = os.getenv("OPENAI_API_KEY", "")
         self.openai_base_url: str = os.getenv(
             "OPENAI_BASE_URL", "https://api.openai.com/v1"
         )
 
+        # TTS (Text-to-Speech) via edge-tts (Microsoft Edge neural voices)
+        self.tts_enabled: bool = os.getenv("CCBOT_TTS_ENABLED", "true").lower() in (
+            "true",
+            "1",
+            "yes",
+        )
+        self.tts_auto: bool = os.getenv("CCBOT_TTS_AUTO", "false").lower() in (
+            "true",
+            "1",
+            "yes",
+        )
+        self.tts_voice: str = os.getenv(
+            "CCBOT_TTS_VOICE", "es-ES-ElviraNeural"
+        )
+
         # Scrub sensitive vars from os.environ so child processes never inherit them.
         # Values are already captured in Config attributes above.
         for var in SENSITIVE_ENV_VARS: