Skip to content

Use oaklib shared cache instead of local .db files in data/raw/ #487

@turbomam

Description

@turbomam

Problem

The transform step generates large semsql SQLite databases next to the downloaded .owl files in data/raw/:

  • chebi.db (~3.4 GB)
  • ncbitaxon.db (~11 GB)
  • go.db (~385 MB)

This happens because constants.py defines sources as .owl paths and transform code calls get_adapter(f"sqlite:{CHEBI_SOURCE}"), which triggers oaklib to build the .db in-place via semsql.

These files duplicate what oaklib already caches at ~/.data/oaklib/ when using the sqlite:obo: scheme.

Proposed Fix

In transform scripts (e.g., rhea_mappings.py), change:

get_adapter(f"sqlite:{CHEBI_SOURCE}")  # builds data/raw/chebi.db from .owl

to:

get_adapter("sqlite:obo:chebi")  # uses ~/.data/oaklib/chebi.db (pre-built, cached)

This avoids ~15 GB of duplicated ontology databases per checkout and uses fresher pre-built databases from the S3 distribution.

Impact

  • Saves ~15 GB disk per working copy
  • Faster builds (skip semsql compilation of large ontologies)
  • Shared cache benefits other oaklib-dependent projects on the same machine

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions