A high-performance FastAPI-based reranking and embedding service with Cohere-compatible endpoints. This service provides state-of-the-art text reranking and embedding capabilities using Hugging Face transformers models.
- π Multiple Reranking Modes: Cross-encoder and bi-encoder support
- π Multilingual Support: Built-in support for multilingual models
- β‘ High Performance: Optimized batch processing and model caching
- π Direct Transformers Support: Advanced models like Qwen3-Embedding-8B with native transformers
- π€ Intelligent Model Detection: Automatic detection of models requiring direct transformers vs sentence-transformers
- β‘ Flash Attention: Automatic flash attention activation for supported models (2-4x speedup)
- π― Precision Control: bfloat16/float16/float32 support for optimal GPU performance
- π Matryoshka (MRL) Downprojection: Stable dimensionality reduction preserving prefix dimensions
- π Extended Context Support: Up to 32k tokens for advanced models
- π§ Advanced PyTorch Optimizations: TF32, high precision matmul, inference mode
- π Cohere-Compatible API: Drop-in replacement for Cohere rerank API
- π‘οΈ Security: API key authentication and CORS configuration
- π Monitoring: Comprehensive logging and diagnostics endpoints with version tracking
- π₯ Model Warmup: Pre-loading and warmup for faster inference
- π³ Container Ready: Docker and Podman support
- βοΈ Flexible Configuration: YAML-based configuration system
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET | Health check endpoint |
/v1/rerank |
POST | Rerank documents (Cohere-compatible) |
/v1/encode |
POST | Generate embeddings |
/v1/models |
GET | List available models |
/v1/models/reload |
POST | Reload models |
/v1/config |
GET | Get current configuration (with version) |
/v1/diagnostics |
GET | System diagnostics (with version) |
- Python 3.8+
- CUDA-compatible GPU (optional, but recommended)
- At least 8GB RAM (16GB+ recommended for large models)
-
Clone the repository
git clone <repository-url> cd simple-reranker
-
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure the service
cp rerank_config.yaml my_config.yaml # Edit my_config.yaml with your settings -
Set environment variables (optional)
cp env.example .env # Edit .env with your API keys and tokens source .env # On Windows: set in environment variables
-
Using Docker
docker build -t rerank-service . docker run -p 8000:8000 -v ./models:/app/models rerank-service -
Using Podman Compose
podman-compose up -d
β οΈ Important: Configuration file is REQUIRED. The service will not start without a valid YAML configuration.
python rerank_service.py --config rerank_config.yaml --serve# Rerank from command line
python rerank_service.py --config rerank_config.yaml \
--query "search query" \
--candidates documents.json
# Using stdin
echo -e "document 1\ndocument 2" | \
python rerank_service.py --config rerank_config.yaml \
--query "search query" --candidates -The service provides helpful error messages for common configuration issues:
# Missing config file
python rerank_service.py --serve
# β ERROR: the following arguments are required: --config/-c
# Wrong file type
python rerank_service.py --config rerank_service.py --serve
# β ERROR: You specified a Python file as configuration
# π‘ Fix: Use the YAML configuration file instead: --config rerank_config.yamlcurl -X POST "http://localhost:8000/v1/rerank" \
-H "Authorization: Bearer change-me-123" \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of AI",
"Python is a programming language",
"Deep learning uses neural networks"
],
"top_k": 2
}'Response:
{
"id": "rerank-abc123",
"model": "BAAI/bge-reranker-v2-m3",
"results": [
{
"index": 0,
"relevance_score": 0.95,
"document": {
"text": "Machine learning is a subset of AI"
}
},
{
"index": 2,
"relevance_score": 0.78,
"document": {
"text": "Deep learning uses neural networks"
}
}
]
}curl -X POST "http://localhost:8000/v1/encode" \
-H "Authorization: Bearer change-me-123" \
-H "Content-Type: application/json" \
-d '{
"texts": ["Hello world", "Machine learning"],
"model": "intfloat/multilingual-e5-base"
}'Response:
{
"id": "encode-def456",
"model": "intfloat/multilingual-e5-base",
"data": [
{
"object": "embedding",
"embedding": [
-0.0123456789,
0.9876543210,
0.5555555555,
-0.7777777777,
"... (remaining 768 dimensions)"
],
"index": 0
},
{
"object": "embedding",
"embedding": [
0.1111111111,
-0.2222222222,
0.8888888888,
0.3333333333,
"... (remaining 768 dimensions)"
],
"index": 1
}
],
"usage": {
"prompt_tokens": 4,
"total_tokens": 4
}
}model:
# Reranking mode: "cross" (cross-encoder) or "bi" (bi-encoder)
mode: cross
# Model names from Hugging Face
cross_name: BAAI/bge-reranker-v2-m3 # Cross-encoder for reranking
bi_name: sentence-transformers/all-MiniLM-L6-v2 # Bi-encoder (alternative)
embedding_name: Qwen/Qwen3-Embedding-8B # Advanced embedding model for /v1/encode
# Batch processing settings
batch_size_cross: 32 # Batch size for cross-encoder
batch_size_bi: 64 # Batch size for bi-encoder
# Model behavior
normalize_embeddings: true # Normalize embedding vectors
trust_remote_code: true # Allow custom model code execution
# Advanced options for direct transformers models (auto-detected for Qwen3, etc.)
dtype: "bfloat16" # Precision: bfloat16, float16, float32 (bfloat16 recommended for RTX 5090/A100)
use_flash_attention: true # Enable flash attention if available (2-4x speedup on modern GPUs)
max_tokens: 16384 # Maximum context length (16k recommended, 32k possible on high-end GPUs)
output_dimension: 1536 # e.g. 1024 (PCA down-projection). Leave null otherwise.huggingface:
# Authentication (optional)
token: null # HF token for private models or rate limits
# Cache settings
cache_dir: null # Custom cache directory (uses HF default if null)
model_dir: /app/models # Directory to store downloaded models
# Pre-download models on startup
prefetch:
- Qwen/Qwen3-Embedding-8B
- BAAI/bge-reranker-v2-m3server:
host: 0.0.0.0 # Listen address
port: 8000 # Listen port
# CORS settings
cors_origins: ["*"] # Allowed origins (use specific domains in production)
# Authentication
api_keys: # List of valid API keys
- change-me-123
# Model warmup on startup
warmup:
enabled: true
load:
cross: true # Pre-load cross-encoder
bi: false # Pre-load bi-encoder
embedding: true # Pre-load embedding model
texts: ["warmup", "test"] # Sample texts for warmup inferencelogging:
level: INFO # Log level: DEBUG, INFO, WARNING, ERROR
format: json # Log format: "json" or "text"
file: null # Log file path (null = stdout)- Configuration method: Set
api_keysin the YAML config - Environment method: Set
RERANK_API_KEYSenvironment variable - Usage: Include in requests as
Authorization: Bearer <your-key>
- Use specific domains in
cors_originsinstead of["*"] - Generate strong, unique API keys
- Consider using HTTPS reverse proxy (nginx, traefik)
- Set appropriate resource limits in containers
- Monitor logs for suspicious activity
curl http://localhost:8000/healthzcurl -H "Authorization: Bearer your-key" \
http://localhost:8000/v1/diagnosticscurl -H "Authorization: Bearer your-key" \
http://localhost:8000/v1/configThe service automatically detects and optimizes advanced embedding models like Qwen3-Embedding-8B that require direct transformers integration, providing significant performance improvements over standard sentence-transformers:
model:
embedding_name: Qwen/Qwen3-Embedding-8B # Auto-detected for direct transformers
dtype: "bfloat16" # Optimal precision for RTX 5090/A100
use_flash_attention: true # Automatic flash attention activation
max_tokens: 16384 # Extended context support (up to 32k)
output_dimension: 1024 # Optional Matryoshka downprojectionπ Automatic Model Detection:
- Qwen Embedding models (
qwen+embeddingin name): Direct transformers - Standard models (BGE, E5, sentence-transformers): Sentence-transformers wrapper
- Zero configuration needed: The service intelligently chooses the best approach
π Key Benefits:
- π― Native Performance: Direct PyTorch operations without sentence-transformers overhead
- β‘ Flash Attention: Up to 2-4x speedup on supported hardware (RTX 4090/5090, A100)
- πΎ Memory Efficiency: bfloat16 reduces memory usage by ~50%
- π Flexible Context: Support for up to 32k tokens on high-end GPUs (16k recommended)
- π Matryoshka Downprojection: Stable dimensionality reduction preserving prefix dimensions
- π§ Advanced Optimizations: TF32 matmul, high precision settings, inference mode
| Setting | Options | Best For | Memory Impact |
|---|---|---|---|
dtype |
bfloat16 |
RTX 4090/5090, A100 | -50% |
dtype |
float16 |
Older GPUs, mobile | -50% |
dtype |
float32 |
CPU, compatibility | baseline |
use_flash_attention |
true |
Modern GPUs | +speed, -memory |
max_tokens |
16384 |
Standard use | baseline |
max_tokens |
32768 |
Long documents | +2x memory |
| Model Type | Integration | Auto-Detection | Flash Attention | Optimal dtype | Max Context |
|---|---|---|---|---|---|
Qwen/Qwen3-Embedding-8B |
Direct transformers | β Auto | β | bfloat16 | 32k tokens |
Qwen/Qwen3-Embedding-4B |
Direct transformers | β Auto | β | bfloat16 | 32k tokens |
BAAI/bge-* |
SentenceTransformers | β Auto | β | bfloat16 | 512 tokens |
intfloat/e5-* |
SentenceTransformers | β Auto | β | bfloat16 | 512 tokens |
sentence-transformers/* |
SentenceTransformers | β Auto | float16 | 512 tokens |
Detection Logic:
- Models with
qwen+embeddingin name β Direct transformers with extended features - All other models β Standard sentence-transformers wrapper
| Variable | Description | Example |
|---|---|---|
RERANK_API_KEYS |
Comma-separated API keys | key1,key2,key3 |
HUGGING_FACE_HUB_TOKEN |
HF authentication token | hf_xxxxx |
HF_HOME |
Hugging Face cache directory | /custom/cache |
You can use any compatible Hugging Face model:
model:
cross_name: your-org/custom-reranker
embedding_name: your-org/custom-embedder
# For advanced models requiring direct transformers:
dtype: "bfloat16"
use_flash_attention: true
max_tokens: 8192Advanced Model Examples:
# For Qwen family models (auto-detected for direct transformers)
model:
embedding_name: Qwen/Qwen3-Embedding-8B
dtype: "bfloat16" # Optimal for modern GPUs
use_flash_attention: true # Auto-enabled for performance
max_tokens: 16384 # Extended context (up to 32k)
output_dimension: 1024 # Optional Matryoshka downprojection
# For multilingual E5 models (auto-detected for sentence-transformers)
model:
embedding_name: intfloat/multilingual-e5-large
dtype: "bfloat16"
max_tokens: 512
normalize_embeddings: true
# For BGE models with flash attention (auto-detected)
model:
cross_name: BAAI/bge-reranker-v2-m3
embedding_name: BAAI/bge-m3
use_flash_attention: true
dtype: "bfloat16"
# Mixed setup: Advanced embedding + Standard reranking
model:
mode: cross
cross_name: BAAI/bge-reranker-v2-m3 # Sentence-transformers
embedding_name: Qwen/Qwen3-Embedding-8B # Direct transformers
dtype: "bfloat16"
max_tokens: 16384For High-End GPUs (RTX 5090, A100):
model:
batch_size_cross: 64
batch_size_bi: 128
dtype: "bfloat16"
use_flash_attention: true
max_tokens: 16384For Mid-Range GPUs (RTX 4080, RTX 3090):
model:
batch_size_cross: 32
batch_size_bi: 64
dtype: "bfloat16"
use_flash_attention: true
max_tokens: 8192For Limited Memory (RTX 3070, RTX 4060):
model:
batch_size_cross: 16
batch_size_bi: 32
dtype: "float16"
use_flash_attention: false
max_tokens: 4096For CPU or Very Limited GPU:
model:
batch_size_cross: 4
batch_size_bi: 8
dtype: "float32"
use_flash_attention: false
max_tokens: 1024Build:
docker build -t rerank-service .Run:
docker run -d \
--name rerank-service \
-p 8000:8000 \
-v ./models:/app/models \
-v ./rerank_config.yaml:/app/config.yaml \
-e RERANK_API_KEYS=your-secure-key \
rerank-service --config /app/config.yaml --serveSee podman-compose.yaml for container orchestration setup.
Test reranking endpoint:
python -c "
import requests
response = requests.post('http://localhost:8000/v1/rerank',
headers={'Authorization': 'Bearer change-me-123'},
json={
'query': 'python programming',
'documents': ['Python is great', 'Java is popular', 'Python for ML'],
'top_k': 2
}
)
print(response.json())
"The service includes a built-in version management system that displays version information across all interfaces.
- Startup Banner: Version appears in the boot summary header
- API Endpoints: Both
/v1/configand/v1/diagnosticsinclude version information - Version Manager Script: Comprehensive version management tool
The version manager automatically generates CHANGELOG.md entries from your git commits using Conventional Commits format.
# Show current version
python version_manager.py current
# Bump version (patch: 1.0.0 β 1.0.1)
python version_manager.py bump patch
# Bump minor version (1.0.0 β 1.1.0)
python version_manager.py bump minor
# Bump major version (1.0.0 β 2.0.0)
python version_manager.py bump major
# Set specific version
python version_manager.py set 2.1.3
# With custom commit message
python version_manager.py bump patch -m "fix: resolve authentication issue"What happens when you bump a version:
- Updates
version.pywith the new version number - Automatically generates CHANGELOG.md entry from git commits since last tag:
feat:commits β Added sectionfix:commits β Fixed sectionrefactor:,perf:,docs:commits β Changed section- Uses today's date automatically
- Displays the generated changelog entry for review
- Provides git commands to commit, tag, and push
Commit Message Format:
To get properly categorized changelog entries, use conventional commit format:
# Feature (goes to "Added" section)
git commit -m "feat: add dark mode support"
git commit -m "feat(api): add new /v1/status endpoint"
# Bug fix (goes to "Fixed" section)
git commit -m "fix: resolve memory leak in model loader"
git commit -m "fix(auth): correct token validation"
# Other changes (go to "Changed" section)
git commit -m "refactor: improve error handling"
git commit -m "perf: optimize batch processing"
git commit -m "docs: update API documentation"AI-Powered Commit Messages with Claude Code:
Use the /commit slash command to automatically generate conventional commit messages:
# Stage your changes
git add .
# In Claude Code, use the slash command
/commitClaude Code will:
- Analyze your git diff
- Generate a properly formatted conventional commit message
- Explain what changes it captures
- Ask for confirmation and create the commit
This ensures all your commits follow the conventional format and will be correctly categorized in the auto-generated changelog.
GET /v1/config response includes:
{
"version": "1.0.0",
"model": { ... },
"huggingface": { ... },
"server": { ... }
}GET /v1/diagnostics response includes:
{
"version": "1.0.0",
"python": "...",
"torch_cuda_available": true,
"huggingface": { ... },
"models_loaded": [...]
}- Ensure
model_dirhas write permissions - Check that
HF_HOMEenvironment variable is set correctly - Verify cache directory persistence in containers
- Reduce
batch_size_crossandbatch_size_bi - Disable model warmup:
warmup.enabled: false - Use smaller models
- Verify API keys in config or environment
- Check
Authorizationheader format:Bearer <key>
- Disable prefetch for faster startup:
prefetch: [] - Use model warmup selectively
- Consider using model mirrors or cache
Enable debug logging for troubleshooting:
logging:
level: DEBUG
format: text # More readable for debugging- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face for transformer models and hub
- Sentence Transformers for embedding models
- FastAPI for the web framework
- BAAI for the BGE reranking models
For issues and questions:
- Check the Troubleshooting section
- Review logs with debug level enabled
- Open an issue with detailed error information and configuration
Happy Reranking! π―