Chelombus

Billion-scale molecular clustering and visualization on commodity hardware.

Chelombus enables interactive exploration of ultra-large chemical datasets (up to billions of molecules) using Product Quantization and nested TMAPs. Process the entire Enamine REAL database (9.6B molecules) on a single workstation.

Live Demo: https://chelombus.gdb.tools

Overview

Chelombus implements the "Nested TMAP" framework for visualizing billion-sized molecular datasets:

SMILES → MQN Fingerprints → PQ Encoding → PQk-means Clustering → Nested TMAPs

Key Features:

Scalability: Stream billions of molecules without loading everything into memory
Efficiency: Compress 42-dimensional MQN vectors to 6-byte PQ codes (28x compression)
GPU acceleration: Optional CUDA support for PQ encoding and cluster assignment (~25x speedup)
Visualization: Navigate from global overview to individual molecules in two clicks
Accessibility: Runs on commodity hardware (tested: AMD Ryzen 7, 64GB RAM)

Installation

From PyPI (recommended)

pip install chelombus

From Source

git clone https://github.com/afloresep/chelombus.git
cd chelombus
pip install -e .

Platform Notes

Apple Silicon (M1/M2/M3): The pqkmeans library is not currently supported on Apple Silicon Macs. My plan is to rewrite pqkmeans with Silicon and GPU support but that's for a future release... For now, clustering functionality requires an x86_64 system.

GPU Acceleration

Both PQEncoder.transform() and PQKMeans.predict() support optional GPU acceleration via the device parameter. When a CUDA GPU is available, device='auto' (the default) uses the GPU transparently; otherwise it falls back to CPU.

Requirements: torch and triton (both installed with pip install torch).

encoder = PQEncoder.load('encoder.joblib')
clusterer = PQKMeans.load('clusterer.joblib')

# GPU is used automatically when available
pq_codes = encoder.transform(fingerprints)    # device='auto' by default
labels = clusterer.predict(pq_codes)          # device='auto' by default

# Or force a specific device
labels_cpu = clusterer.predict(pq_codes, device='cpu')
labels_gpu = clusterer.predict(pq_codes, device='gpu')

Benchmarks (20M molecules, K=100,000 clusters, RTX 4070 Ti 16GB):

Step	GPU	CPU	Speedup
PQ Transform	7.3s	45.3s	6.2x
Cluster Assignment	29.9s	~879s	29.4x

Extrapolated to 9.6B molecules (Enamine REAL):

Step	GPU	CPU
PQ Transform	59 min	6.0 h
Cluster Assignment	4.0 h	117 h
Combined	5.0 h	123 h

The GPU implementation uses a custom Triton kernel for cluster assignment that tiles over centers with an online argmin, never materializing the N x K distance matrix. VRAM usage is ~10 bytes/point, so even an 8 GB GPU can process hundreds of millions of points per batch.

To reproduce the benchmarks:

# Decompress the test SMILES (if using the gzipped version)
gunzip -k data/10M_smiles.txt.gz

# Run benchmark (pre-computes and caches fingerprints on first run)
python scripts/benchmark_gpu_predict.py

Quick Start

from chelombus import DataStreamer, FingerprintCalculator, PQEncoder, PQKMeans

# 1. Stream SMILES in chunks
streamer = DataStreamer(path='molecules.smi', chunksize=100000)

# 2. Calculate MQN fingerprints
fp_calc = FingerprintCalculator()
for smiles_chunk in streamer.parse_input():
    fingerprints = fp_calc.FingerprintFromSmiles(smiles_chunk, fp='mqn')
    # Save fingerprints...

# 3. Train PQ encoder on sample
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)

# 4. Transform all fingerprints to PQ codes
pq_codes = encoder.transform(fingerprints)

# 5. Cluster with PQk-means
clusterer = PQKMeans(encoder, k=100000)
labels = clusterer.fit_predict(pq_codes)

Project Structure

chelombus/
├── chelombus/
│   ├── encoder/          # Product Quantization encoder
│   ├── clustering/       # PQk-means wrapper
│   ├── streamer/         # Memory-efficient data streaming
│   └── utils/            # Fingerprints, visualization, helpers
├── scripts/              # Pipeline scripts
├── examples/             # Tutorial notebooks
└── tests/                # Unit tests

Choosing k (Number of Clusters)

The scripts/select_k.py script sweeps over k values on a subsample to help pick the right number of clusters. It supports checkpointing, if interrupted, rerun the same command and it resumes from where it left off.

python scripts/select_k.py \
    --pq-codes data/pq_codes.npy \
    --encoder models/encoder.joblib \
    --n-subsample 10000000 \
    --k-values 10000 25000 50000 100000 200000 \
    --iterations 10 \
    --output results/k_selection.csv \
    --plot results/k_selection.png

Results on 100M Enamine REAL molecules (AMD Ryzen 7, 64GB RAM):

k	Avg Distance	Empty Clusters	Median Cluster Size	Fit Time
10,000	3.65	6.8%	8,945	1.3 h
25,000	2.74	13.3%	3,673	3.1 h
50,000	2.17	19.6%	1,876	6.2 h
100,000	1.69	26.6%	956	12.6 h
200,000	1.30	34.7%	492	26.4 h

Guidelines:

k = 50,000 is a good default — under 20% empty clusters, median size ~1,900, and the avg distance improvement starts plateauing beyond this point.
k = 100,000 if you need tighter clusters and can tolerate ~27% empty clusters.
Beyond 200K, over a third of clusters are empty — diminishing returns.
Fit time scales linearly with both n and k (e.g., 1B molecules at k=50K ≈ 2.6 days).

Documentation

Full docs: https://chelombus.gdb.tools
Tutorial: See examples/tutorial.ipynb for a hands-on introduction
Large-scale example: See examples/enamine_1B_clustering.ipynb
API Reference: See docs/api.md or the hosted docs

Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_encoder.py -v

Citation

If you use Chelombus in your research, please cite:

@article{chelombus2025,
  title={Nested TMAPs to visualize Billions of Molecules},
  author={Flores Sepulveda, Alejandro and Reymond, Jean-Louis},
  journal={},
  year={2025}
}

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Write tests for new functionality
Submit a pull request

License

MIT License. See LICENSE for details.

Acknowledgments

PQk-means by Matsui et al.
TMAP by Probst & Reymond
RDKit for cheminformatics functionality
Swiss National Science Foundation (grant no. 200020_178998)

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.github/workflows		.github/workflows
chelombus		chelombus
data		data
docs		docs
examples		examples
models		models
scripts		scripts
tests		tests
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chelombus

Overview

Installation

From PyPI (recommended)

From Source

Platform Notes

GPU Acceleration

Quick Start

Project Structure

Choosing k (Number of Clusters)

Documentation

Testing

Citation

Contributing

License

Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chelombus

Overview

Installation

From PyPI (recommended)

From Source

Platform Notes

GPU Acceleration

Quick Start

Project Structure

Choosing k (Number of Clusters)

Documentation

Testing

Citation

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages