SLAF is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. Built for large-scale single-cell analysis with memory efficiency and production-ready ML capabilities.
Be Lazy (lazy APIs for AnnData and Scanpy) β’ Write SQL (arbitrary SQL to query the tables) β’ Train Foundation Models (with tokenizers and dataloaders)
- β‘ Fast: SQL-level performance for data operations
- πΎ Memory Efficient: Lazy evaluation, only load what you need
- π SQL Native: Direct SQL queries on your data
- 𧬠Scanpy Compatible: Drop-in replacement for AnnData workflows
- βοΈ ML Ready: Ready for ML training with efficient tokenization
- π§ Production Ready: Built for large-scale single-cell analysis
The default installation includes core functionality, CLI tools, and data conversion capabilities:
# Using uv (recommended)
uv add slafdb
# Or pip
pip install slafdbWhat's included by default:
- β Core SLAF functionality (SQL queries, data structures)
- β
CLI tools (
slaf convert,slaf query, etc.) - β Data conversion tools (scanpy, h5py for h5ad files)
- β Rich console output and progress bars
- β Cross-platform compatibility
What's NOT included by default:
Dependencies for:
- β Machine learning features (PyTorch tokenizers)
- β Advanced single-cell tools (igraph, leidenalg)
Polars Compatibility:
- Linux/Windows: Works with standard
polars - macOS (Apple Silicon): May require
polars-lts-cpufor compatibility
If you encounter polars-related issues on macOS, you have several options:
Option 1: Manual platform-specific installation
# For macOS Apple Silicon
pip install "polars-lts-cpu>=1.31.0"
pip install slafdb
# For Linux/Windows
pip install slafdbOption 2: Use uv with manual polars specification
# For macOS Apple Silicon
uv add "polars-lts-cpu>=1.31.0"
uv add slafdb
# For Linux/Windows
uv add slafdbNote: Package managers don't automatically choose between polars and polars-lts-cpu - you may need to specify the correct version for your platform.
Add specific features as needed:
Using uv:
uv add "slafdb[ml]"
uv add "slafdb[advanced]"
uv add "slafdb[full]"
uv add "slafdb[dev]"Using pip:
pip install slafdb[ml]
pip install slafdb[advanced]
pip install slafdb[full]
pip install slafdb[dev]git clone https://github.com/slaf-project/slaf.git
cd slaf
uv add --extra dev --extra test --extra docsConvert your existing single-cell data to SLAF format - no extra dependencies required!
# Convert AnnData (.h5ad) to SLAF
slaf convert input.h5ad output.slaf
# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf
# Convert 10x Genomics data
slaf convert path/to/10x/filtered_feature_bc_matrix output.slaffrom slaf import SLAFArray
# Load a SLAF dataset
slaf = SLAFArray("path/to/dataset.slaf")
# Describe the dataset
print(slaf.info())
# Execute SQL queries directly
results = slaf.query("""
SELECT batch, COUNT(*) as count
FROM cells
GROUP BY batch
ORDER BY count DESC
""")
print(results)# Filter cells by metadata
filtered_cells = slaf.filter_cells(
batch="batch1",
total_counts=">1000"
)
# Filter genes
filtered_genes = slaf.filter_genes(
highly_variable=True
)
# Get expression submatrix
expression = slaf.get_submatrix(
cell_selector=filtered_cells,
gene_selector=filtered_genes
)SLAF provides lazy versions of AnnData and Scanpy operations that only compute when needed:
from slaf.integrations.anndata import read_slaf
import scanpy as sc
# Load as lazy AnnData
adata = read_slaf("path/to/dataset.slaf")
print(f"Type: {type(adata)}") # LazyAnnData
print(f"Expression matrix type: {type(adata.X)}") # LazyExpressionMatrix
# Apply scanpy operations (lazy)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
# Still lazy - no computation yet
print(f"Still lazy: {type(adata.X)}")
# Compute when needed
adata.compute() # Now it's a real AnnData object# Compute specific parts
expression_matrix = adata.X.compute() # Just the expression matrix
cell_metadata = adata.obs # Cell metadata
gene_metadata = adata.var # Gene metadata
# Or compute everything at once
real_adata = adata.compute()# All slicing operations are lazy
subset = adata[:100, :50] # Lazy slice
filtered = adata[adata.obs['n_genes_by_counts'] > 1000] # Lazy filteringSLAF stores data in three main tables that you can query directly with SQL:
cells: Cell metadata and QC metricsgenes: Gene metadata and annotationsexpression: Sparse expression matrix data
# Get expression data for specific cells
cell_expression = slaf.query("""
SELECT
c.cell_id,
c.total_counts,
COUNT(e.gene_id) as genes_expressed,
AVG(e.value) as avg_expression
FROM cells c
JOIN expression e ON c.cell_integer_id = e.cell_integer_id
WHERE c.batch = 'batch1'
GROUP BY c.cell_id, c.total_counts
ORDER BY genes_expressed DESC
LIMIT 10
""")
# Find highly expressed genes
high_expr_genes = slaf.query("""
SELECT
g.gene_id,
COUNT(e.cell_id) as cells_expressing,
AVG(e.value) as avg_expression
FROM genes g
JOIN expression e ON g.gene_integer_id = e.gene_integer_id
GROUP BY g.gene_id
HAVING cells_expressing > 100
ORDER BY avg_expression DESC
LIMIT 10
""")SLAF provides efficient tokenization and dataloaders for training foundation models:
from slaf.ml import SLAFTokenizer
# Create tokenizer for GeneFormer style tokenization
tokenizer = SLAFTokenizer(
slaf_array=slaf,
tokenizer_type="geneformer",
vocab_size=50000,
n_expression_bins=10
)
# Geneformer tokenization (gene sequence only)
gene_sequences = [[1, 2, 3], [4, 5, 6]] # Example gene IDs
input_ids, attention_mask = tokenizer.tokenize(
gene_sequences,
max_genes=2048
)
# Create tokenizer for scGPT style tokenization
tokenizer = SLAFTokenizer(
slaf_array=slaf,
tokenizer_type="scgpt",
vocab_size=50000,
n_expression_bins=10
)
# scGPT tokenization (gene-expression pairs)
gene_sequences = [[1, 2, 3], [4, 5, 6]] # Gene IDs
expr_sequences = [[0.5, 0.8, 0.2], [0.9, 0.1, 0.7]] # Expression values
input_ids, attention_mask = tokenizer.tokenize(
gene_sequences,
expr_sequences=expr_sequences,
max_genes=1024
)from slaf.ml import SLAFDataLoader
# Create DataLoader
dataloader = SLAFDataLoader(
slaf_array=slaf,
tokenizer_type="geneformer", # or "scgpt"
batch_size=32,
max_genes=2048
)
# Use with PyTorch training
for batch in dataloader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
cell_ids = batch["cell_ids"]
# Your training loop here
loss = model(input_ids, attention_mask=attention_mask)
loss.backward()# Convert AnnData to SLAF (included by default)
slaf convert input.h5ad output.slaf
# Convert HDF5 to SLAF
slaf convert input.h5 output.slaf --format hdf5# Execute SQL query
slaf query dataset.slaf "SELECT * FROM cells LIMIT 10"
# Save results to CSV
slaf query dataset.slaf "SELECT * FROM cells" --output cells.csvslaf info dataset.slaf- SLAF Documentation
- Quickstart
- API Reference
- Examples
- User Guide
- Contributing β setup, workflow, and how to contribute
- Maintainers Guide
- Discord β chat, questions, and updates
Built on top of