Skip to content

afloresep/Chelombus

Repository files navigation

Chelombus

License: MIT Version Python

Billion-scale molecular clustering and visualization on commodity hardware.

Chelombus enables interactive exploration of ultra-large chemical datasets (up to billions of molecules) using Product Quantization and nested TMAPs. Process the entire Enamine REAL database (9.6B molecules) on a single workstation.

Live Demo: https://chelombus.gdb.tools

Overview

Chelombus implements the "Nested TMAP" framework for visualizing billion-sized molecular datasets:

SMILES → MQN Fingerprints → PQ Encoding → PQk-means Clustering → Nested TMAPs

Key Features:

  • Scalability: Stream billions of molecules without loading everything into memory
  • Efficiency: Compress 42-dimensional MQN vectors to 6-byte PQ codes (28x compression)
  • GPU acceleration: Optional CUDA support for PQ encoding and cluster assignment (~25x speedup)
  • Visualization: Navigate from global overview to individual molecules in two clicks
  • Accessibility: Runs on commodity hardware (tested: AMD Ryzen 7, 64GB RAM)

Installation

From PyPI (recommended)

pip install chelombus

From Source

git clone https://github.com/afloresep/chelombus.git
cd chelombus
pip install -e .

Platform Notes

Apple Silicon (M1/M2/M3): The pqkmeans library is not currently supported on Apple Silicon Macs. My plan is to rewrite pqkmeans with Silicon and GPU support but that's for a future release... For now, clustering functionality requires an x86_64 system.

GPU Acceleration

Both PQEncoder.transform() and PQKMeans.predict() support optional GPU acceleration via the device parameter. When a CUDA GPU is available, device='auto' (the default) uses the GPU transparently; otherwise it falls back to CPU.

Requirements: torch and triton (both installed with pip install torch).

encoder = PQEncoder.load('encoder.joblib')
clusterer = PQKMeans.load('clusterer.joblib')

# GPU is used automatically when available
pq_codes = encoder.transform(fingerprints)    # device='auto' by default
labels = clusterer.predict(pq_codes)          # device='auto' by default

# Or force a specific device
labels_cpu = clusterer.predict(pq_codes, device='cpu')
labels_gpu = clusterer.predict(pq_codes, device='gpu')

Benchmarks (20M molecules, K=100,000 clusters, RTX 4070 Ti 16GB):

Step GPU CPU Speedup
PQ Transform 7.3s 45.3s 6.2x
Cluster Assignment 29.9s ~879s 29.4x

Extrapolated to 9.6B molecules (Enamine REAL):

Step GPU CPU
PQ Transform 59 min 6.0 h
Cluster Assignment 4.0 h 117 h
Combined 5.0 h 123 h

The GPU implementation uses a custom Triton kernel for cluster assignment that tiles over centers with an online argmin, never materializing the N x K distance matrix. VRAM usage is ~10 bytes/point, so even an 8 GB GPU can process hundreds of millions of points per batch.

To reproduce the benchmarks:

# Decompress the test SMILES (if using the gzipped version)
gunzip -k data/10M_smiles.txt.gz

# Run benchmark (pre-computes and caches fingerprints on first run)
python scripts/benchmark_gpu_predict.py

Quick Start

from chelombus import DataStreamer, FingerprintCalculator, PQEncoder, PQKMeans

# 1. Stream SMILES in chunks
streamer = DataStreamer(path='molecules.smi', chunksize=100000)

# 2. Calculate MQN fingerprints
fp_calc = FingerprintCalculator()
for smiles_chunk in streamer.parse_input():
    fingerprints = fp_calc.FingerprintFromSmiles(smiles_chunk, fp='mqn')
    # Save fingerprints...

# 3. Train PQ encoder on sample
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)

# 4. Transform all fingerprints to PQ codes
pq_codes = encoder.transform(fingerprints)

# 5. Cluster with PQk-means
clusterer = PQKMeans(encoder, k=100000)
labels = clusterer.fit_predict(pq_codes)

Project Structure

chelombus/
├── chelombus/
│   ├── encoder/          # Product Quantization encoder
│   ├── clustering/       # PQk-means wrapper
│   ├── streamer/         # Memory-efficient data streaming
│   └── utils/            # Fingerprints, visualization, helpers
├── scripts/              # Pipeline scripts
├── examples/             # Tutorial notebooks
└── tests/                # Unit tests

Choosing k (Number of Clusters)

The scripts/select_k.py script sweeps over k values on a subsample to help pick the right number of clusters. It supports checkpointing, if interrupted, rerun the same command and it resumes from where it left off.

python scripts/select_k.py \
    --pq-codes data/pq_codes.npy \
    --encoder models/encoder.joblib \
    --n-subsample 10000000 \
    --k-values 10000 25000 50000 100000 200000 \
    --iterations 10 \
    --output results/k_selection.csv \
    --plot results/k_selection.png

Results on 100M Enamine REAL molecules (AMD Ryzen 7, 64GB RAM):

k Avg Distance Empty Clusters Median Cluster Size Fit Time
10,000 3.65 6.8% 8,945 1.3 h
25,000 2.74 13.3% 3,673 3.1 h
50,000 2.17 19.6% 1,876 6.2 h
100,000 1.69 26.6% 956 12.6 h
200,000 1.30 34.7% 492 26.4 h

Guidelines:

  • k = 50,000 is a good default — under 20% empty clusters, median size ~1,900, and the avg distance improvement starts plateauing beyond this point.
  • k = 100,000 if you need tighter clusters and can tolerate ~27% empty clusters.
  • Beyond 200K, over a third of clusters are empty — diminishing returns.
  • Fit time scales linearly with both n and k (e.g., 1B molecules at k=50K ≈ 2.6 days).

Documentation

  • Full docs: https://chelombus.gdb.tools
  • Tutorial: See examples/tutorial.ipynb for a hands-on introduction
  • Large-scale example: See examples/enamine_1B_clustering.ipynb
  • API Reference: See docs/api.md or the hosted docs

Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_encoder.py -v

Citation

If you use Chelombus in your research, please cite:

@article{chelombus2025,
  title={Nested TMAPs to visualize Billions of Molecules},
  author={Flores Sepulveda, Alejandro and Reymond, Jean-Louis},
  journal={},
  year={2025}
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Write tests for new functionality
  4. Submit a pull request

License

MIT License. See LICENSE for details.

Acknowledgments

  • PQk-means by Matsui et al.
  • TMAP by Probst & Reymond
  • RDKit for cheminformatics functionality
  • Swiss National Science Foundation (grant no. 200020_178998)

About

Chelombus: Streaming Product Quantization for large-scale clustering and visualization of molecular data (or other high-dimensional data)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages