Skip to content

dokzlo13/blya_bot

Repository files navigation

blya_bot

Build Docker Images Python 3.13+ License: MIT GitHub release Deploy to DO

Blya - Russian expletive "shit" (figuratively). Pronounced like "bla", but with a softer "L". It's what you might say if your car won't start in the morning, and you're going to be late for work. urbandictionary

This as a telegram bot, which transcribes voice and video notes into text, using automatic speech recognition (ASR) models with one purpose - decide, who uses more "curse words"...

How it works

This repo contains two pieces:

  • dictgen/ — a dictionary compiler (DSL parsing + optional morphological expansion via pymorphy3)
  • blya_bot/ — the Telegram bot (ASR + matching + replies)

At runtime the bot has three main components:

  • Speech recognition (ASR): vosk or Whisper via faster-whisper / pywhispercpp
  • Fast word matching: a packed dictionary + Aho-Corasick automaton (via ahocorasick_rs)
  • Telegram plumbing: downloads media, optionally caches transcriptions, formats replies

Startup flow

  1. Load a packed dictionary file (e.g. dict.bb, generated by dictgen)
  2. Build the Aho-Corasick matcher in memory
  3. Load the configured ASR model

Message flow (voice/video notes, including in groups)

  1. Download the note to a temporary file
  2. Transcribe audio into text
  3. Find and count dictionary matches in the transcription
  4. Reply with a per-word breakdown and overall stats; if you reply with /t (configurable), the bot also sends the full transcription with highlights

Limitations

Started as a simple joke, this project was written primarily for Russian.

Although the speech recognition engines support many languages, the "dictionary morphing" step is Russian-specific (it uses pymorphy3). At the moment, only Russian has been tested end-to-end.

If you want to use your own custom dictionary (including non-Russian), generate the packed dictionary with morphing disabled:

  • dictgen ... --no-morphing

This skips pymorphy3-based morphological expansion and keeps only the DSL-generated variants. See the TODO section for the current multi-language roadmap.

Dictionary DSL

#<anything> - Comment
!<word> - to disable morphing for this word. This token is global, and can be placed in any part of word.
  Expansions will also contain this token, and morphing will be disabled for variants too
~<word> - excludes word from dictionary. Applied after variants generation and morphing. Must be placed at word start.
[...|...] - expand to word with extra variants (suffixes, prefixes). 
  This will also include word without this elements, like: he[llo|ll] -> he, hello, hell
  Can be used with single variant only, like: bad[ass] -> bad, badass
{...|...} - expand to variants
  This will not include word without this elements, like: he{llo|ll} -> hello, hell
  Using single element in {} group has no sense, example: he{llo} -> hello

Build & Run

Prerequisites

Install uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

Running Locally (Manual Setup)

The project consists of two packages:

  • dictgen/ — Dictionary generator with morphological expansion
  • blya_bot/ — The Telegram bot itself

Step 1: Generate the dictionary

First, generate the packed dictionary file with morphological word forms:

cd dictgen
uv sync
uv run -m dictgen -i ../fixtures/bad_words.txt -o ../fixtures/dict.bb --morphing

This reads the DSL dictionary from fixtures/bad_words.txt, applies morphological expansion, and saves the packed binary dictionary to fixtures/dict.bb.

Step 2: Configure and run the bot

cd blya_bot
uv sync --all-extras  # Install with all speech recognition engines
# Or install with specific engine only:
# uv sync --extra vosk
# uv sync --extra faster-whisper

# Create .env file with your configuration (see Configuration section below)
# Then run:
uv run -m blya_bot

dictgen CLI Reference

usage: python -m dictgen [-h] -i INPUT -o OUTPUT [-c {zstd,none}] [--morphing | --no-morphing]

Options:
  -i, --input         Path to DSL dictionary text file (required)
  -o, --output        Output packed file path (required)
  -c, --compression   Compression type: zstd (default) or none
  --morphing          Enable morphological expansion (default)
  --no-morphing       Disable morphological expansion

Obtaining speech recognition models

Next, you need to gather speech recognition model files.

Current "blya_bot" implementation supports multiple speech recognition engines:

Vosk

You can download models for vosk on this website - https://alphacephei.com/vosk/models

Place model in folder you like, and specify path to this model during configuration.

Faster-Whisper

Models will be downloaded automatically on first application start.

Default models folder is ~/.cache/huggingface/hub.

Pywhispercpp

Models will be downloaded automatically on first application start.

Default models folder is ~/.local/share/pywhispercpp/models.

Configuration

When all dependencies installed and model files obtained, you need to configure settings.

Create .env file in this directory, and populate required options:

Vosk

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="vosk"
# At now, has only one option - `model_name`. Specify path to downloaded model
RECOGNITION_ENGINE_OPTIONS='{"model_path": "/path/to/vosk-model"}'

Faster-Whisper

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="faster-whisper"
# Required fields: `model` and `language`
RECOGNITION_ENGINE_OPTIONS='{"model": "small", "language": "ru", "device": "cpu", "compute_type": "int8", "beam_size": 5}'

Pywhispercpp

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="pywhispercpp"
# Required fields: `model` and `language`
RECOGNITION_ENGINE_OPTIONS='{"model": "small", "language": "ru"}'

When all required fields configured, run the bot:

cd blya_bot
uv run -m blya_bot

Docker images

This repository also includes some docker-files, which can be used to build all-in-one blya-bot images. This images will contain bot sources and desired speech recognition model.

Images can be distributed and will work without extra volumes.

Vosk - based

Build:

Pass url to vosk model as MODEL_URL build arg.

docker build --build-arg MODEL_URL="https://alphacephei.com/vosk/models/vosk-model-small-ru-0.22.zip" -t blya_bot:vosk -f dockerfiles/vosk.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:vosk

Faster-Whisper - based

Build:

Pass MODEL and LANG build args.

docker build --build-arg MODEL=small --build-arg LANG=ru -t blya_bot:faster-whisper -f dockerfiles/faster-whisper.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:faster-whisper

Pywhispercpp - based

Build:

Pass MODEL and LANG build args.

docker build --build-arg MODEL=small --build-arg LANG=ru -t blya_bot:pywhispercpp -f dockerfiles/pywhispercpp.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:pywhispercpp

TODO

  • Basic STT without external API's
  • Bad words counting and summarization
  • Highlight all bad words
  • Handle voice notes
  • Handle video notes
  • Caching/storing transcriptions
  • Send transcription to chat
  • Telegram webhook
  • Configurable "transcribe" commands for bot
  • Make bot "language independent"
    • Add other languages support for recognition
    • Add other languages support for morphological analysis
    • Summary templates
    • Remove all language-dependent parts from sources

Contributing

  1. Fork it
  2. Clone it: git clone https://github.com/dokzlo13/blya_bot.git
  3. Create your feature branch: git checkout -b my-new-feature
  4. Make changes and add them: git add .
  5. Commit: git commit -m 'My awesome feature'
  6. Push: git push origin my-new-feature
  7. Pull request

About

Telegram bot to count bad words in voice messages

Resources

License

Stars

Watchers

Forks

Packages