-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Problem
The transform step generates large semsql SQLite databases next to the downloaded .owl files in data/raw/:
chebi.db(~3.4 GB)ncbitaxon.db(~11 GB)go.db(~385 MB)
This happens because constants.py defines sources as .owl paths and transform code calls get_adapter(f"sqlite:{CHEBI_SOURCE}"), which triggers oaklib to build the .db in-place via semsql.
These files duplicate what oaklib already caches at ~/.data/oaklib/ when using the sqlite:obo: scheme.
Proposed Fix
In transform scripts (e.g., rhea_mappings.py), change:
get_adapter(f"sqlite:{CHEBI_SOURCE}") # builds data/raw/chebi.db from .owlto:
get_adapter("sqlite:obo:chebi") # uses ~/.data/oaklib/chebi.db (pre-built, cached)This avoids ~15 GB of duplicated ontology databases per checkout and uses fresher pre-built databases from the S3 distribution.
Impact
- Saves ~15 GB disk per working copy
- Faster builds (skip semsql compilation of large ontologies)
- Shared cache benefits other oaklib-dependent projects on the same machine
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels