blya_bot

Blya - Russian expletive "shit" (figuratively). Pronounced like "bla", but with a softer "L". It's what you might say if your car won't start in the morning, and you're going to be late for work. urbandictionary

This as a telegram bot, which transcribes voice and video notes into text, using automatic speech recognition (ASR) models with one purpose - decide, who uses more "curse words"...

How it works

This repo contains two pieces:

dictgen/ — a dictionary compiler (DSL parsing + optional morphological expansion via pymorphy3)
blya_bot/ — the Telegram bot (ASR + matching + replies)

At runtime the bot has three main components:

Speech recognition (ASR): vosk or Whisper via faster-whisper / pywhispercpp
Fast word matching: a packed dictionary + Aho-Corasick automaton (via ahocorasick_rs)
Telegram plumbing: downloads media, optionally caches transcriptions, formats replies

Startup flow

Load a packed dictionary file (e.g. dict.bb, generated by dictgen)
Build the Aho-Corasick matcher in memory
Load the configured ASR model

Message flow (voice/video notes, including in groups)

Download the note to a temporary file
Transcribe audio into text
Find and count dictionary matches in the transcription
Reply with a per-word breakdown and overall stats; if you reply with /t (configurable), the bot also sends the full transcription with highlights

Limitations

Started as a simple joke, this project was written primarily for Russian.

Although the speech recognition engines support many languages, the "dictionary morphing" step is Russian-specific (it uses pymorphy3). At the moment, only Russian has been tested end-to-end.

If you want to use your own custom dictionary (including non-Russian), generate the packed dictionary with morphing disabled:

dictgen ... --no-morphing

This skips pymorphy3-based morphological expansion and keeps only the DSL-generated variants. See the TODO section for the current multi-language roadmap.

Dictionary DSL

#<anything> - Comment
!<word> - to disable morphing for this word. This token is global, and can be placed in any part of word.
  Expansions will also contain this token, and morphing will be disabled for variants too
~<word> - excludes word from dictionary. Applied after variants generation and morphing. Must be placed at word start.
[...|...] - expand to word with extra variants (suffixes, prefixes). 
  This will also include word without this elements, like: he[llo|ll] -> he, hello, hell
  Can be used with single variant only, like: bad[ass] -> bad, badass
{...|...} - expand to variants
  This will not include word without this elements, like: he{llo|ll} -> hello, hell
  Using single element in {} group has no sense, example: he{llo} -> hello

Build & Run

Prerequisites

Install uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

Running Locally (Manual Setup)

The project consists of two packages:

dictgen/ — Dictionary generator with morphological expansion
blya_bot/ — The Telegram bot itself

Step 1: Generate the dictionary

First, generate the packed dictionary file with morphological word forms:

cd dictgen
uv sync
uv run -m dictgen -i ../fixtures/bad_words.txt -o ../fixtures/dict.bb --morphing

This reads the DSL dictionary from fixtures/bad_words.txt, applies morphological expansion, and saves the packed binary dictionary to fixtures/dict.bb.

Step 2: Configure and run the bot

cd blya_bot
uv sync --all-extras  # Install with all speech recognition engines
# Or install with specific engine only:
# uv sync --extra vosk
# uv sync --extra faster-whisper

# Create .env file with your configuration (see Configuration section below)
# Then run:
uv run -m blya_bot

dictgen CLI Reference

usage: python -m dictgen [-h] -i INPUT -o OUTPUT [-c {zstd,none}] [--morphing | --no-morphing]

Options:
  -i, --input         Path to DSL dictionary text file (required)
  -o, --output        Output packed file path (required)
  -c, --compression   Compression type: zstd (default) or none
  --morphing          Enable morphological expansion (default)
  --no-morphing       Disable morphological expansion

Obtaining speech recognition models

Next, you need to gather speech recognition model files.

Current "blya_bot" implementation supports multiple speech recognition engines:

vosk
whisper via faster-whisper
whisper via pywhispercpp bindings for whisper.cpp

Vosk

You can download models for vosk on this website - https://alphacephei.com/vosk/models

Place model in folder you like, and specify path to this model during configuration.

Faster-Whisper

Models will be downloaded automatically on first application start.

Default models folder is ~/.cache/huggingface/hub.

Pywhispercpp

Models will be downloaded automatically on first application start.

Default models folder is ~/.local/share/pywhispercpp/models.

Configuration

When all dependencies installed and model files obtained, you need to configure settings.

Create .env file in this directory, and populate required options:

Vosk

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="vosk"
# At now, has only one option - `model_name`. Specify path to downloaded model
RECOGNITION_ENGINE_OPTIONS='{"model_path": "/path/to/vosk-model"}'

Faster-Whisper

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="faster-whisper"
# Required fields: `model` and `language`
RECOGNITION_ENGINE_OPTIONS='{"model": "small", "language": "ru", "device": "cpu", "compute_type": "int8", "beam_size": 5}'

Pywhispercpp

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="pywhispercpp"
# Required fields: `model` and `language`
RECOGNITION_ENGINE_OPTIONS='{"model": "small", "language": "ru"}'

When all required fields configured, run the bot:

cd blya_bot
uv run -m blya_bot

Docker images

This repository also includes some docker-files, which can be used to build all-in-one blya-bot images. This images will contain bot sources and desired speech recognition model.

Images can be distributed and will work without extra volumes.

Vosk - based

Build:

Pass url to vosk model as MODEL_URL build arg.

docker build --build-arg MODEL_URL="https://alphacephei.com/vosk/models/vosk-model-small-ru-0.22.zip" -t blya_bot:vosk -f dockerfiles/vosk.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:vosk

Faster-Whisper - based

Build:

Pass MODEL and LANG build args.

docker build --build-arg MODEL=small --build-arg LANG=ru -t blya_bot:faster-whisper -f dockerfiles/faster-whisper.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:faster-whisper

Pywhispercpp - based

Build:

Pass MODEL and LANG build args.

docker build --build-arg MODEL=small --build-arg LANG=ru -t blya_bot:pywhispercpp -f dockerfiles/pywhispercpp.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:pywhispercpp

TODO

Contributing

Fork it
Clone it: git clone https://github.com/dokzlo13/blya_bot.git
Create your feature branch: git checkout -b my-new-feature
Make changes and add them: git add .
Commit: git commit -m 'My awesome feature'
Push: git push origin my-new-feature
Pull request

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.do		.do
.github/workflows		.github/workflows
blya_bot		blya_bot
dictgen		dictgen
dockerfiles		dockerfiles
fixtures		fixtures
models		models
tests		tests
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

blya_bot

How it works

Startup flow

Message flow (voice/video notes, including in groups)

Limitations

Dictionary DSL

Build & Run

Prerequisites

Running Locally (Manual Setup)

dictgen CLI Reference

Obtaining speech recognition models

Vosk

Faster-Whisper

Pywhispercpp

Configuration

Vosk

Faster-Whisper

Pywhispercpp

Docker images

Vosk - based

Faster-Whisper - based

Pywhispercpp - based

TODO

Contributing

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Languages

License

dokzlo13/blya_bot

Folders and files

Latest commit

History

Repository files navigation

blya_bot

How it works

Startup flow

Message flow (voice/video notes, including in groups)

Limitations

Dictionary DSL

Build & Run

Prerequisites

Running Locally (Manual Setup)

dictgen CLI Reference

Obtaining speech recognition models

Vosk

Faster-Whisper

Pywhispercpp

Configuration

Vosk

Faster-Whisper

Pywhispercpp

Docker images

Vosk - based

Faster-Whisper - based

Pywhispercpp - based

TODO

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Languages

Packages