Skip to content

Replace R bedstat pipeline with gtars genomicdist#151

Open
sanghoonio wants to merge 7 commits intomainfrom
bedstats_genomicdist
Open

Replace R bedstat pipeline with gtars genomicdist#151
sanghoonio wants to merge 7 commits intomainfrom
bedstats_genomicdist

Conversation

@sanghoonio
Copy link
Member

bedstat now shells out to gtars genomicdist instead of Rscript regionstat_cli.R. This removes the R runtime dependency entirely.

What changed:

  • bedstat.py: build and run gtars genomicdist CLI command via pypiper, parse its JSON output, extract scalars and partition frequencies into legacy DB columns, store full gtars output as distributions JSONB blob. Open signal matrix auto-download retained as fallback.
  • compress_distributions.py (new): numpy-vectorized compression of raw per-region arrays for DB storage — histogram binning for widths/TSS distances, Gaussian KDE for neighbor distances and GC content (matching bedbase-ui's client-side math), dense count arrays for region distribution. Compresses 14MB raw JSON to ~40KB.
  • gc_content.py: remove matplotlib plot generation (create_gc_plot); GC content is now compressed to a 512-pt KDE in the distributions blob
  • cli.py: add --chrom-sizes to run_all and run_stats, update --ensdb help to mention gtars prep .bin files for batch speed
  • bedboss.py, bbuploader/main.py: remove all RServiceManager references and r_service parameter threading, pass chrom_sizes through to bedstat
  • test_genomicdist.py (new): benchmark script with --breakdown mode for component-level timing (gtars, JSON read, GC calc, KDE, compress)

Deleted:

  • bedboss/bedstat/r_service.py (R subprocess manager)
  • bedboss/bedstat/tools/r-service.R, regionstat.R, regionstat_cli.R

Plot images (BedPlots) are no longer generated. The old UI will show empty plots for newly processed beds. The new bedbase-ui renders client-side from the distributions JSONB blob.

For batch processing, pre-compile reference files with gtars prep and pass .bin paths to --ensdb/--signal-matrix to avoid re-parsing per BED file.

Changes:

  • ...

TODO:

  • Version of pepdbagent updated in __version__.py file
  • Changelog updated

bedstat now shells out to `gtars genomicdist` instead of Rscript
regionstat_cli.R. This removes the R runtime dependency entirely.

What changed:
- bedstat.py: build and run `gtars genomicdist` CLI command via pypiper,
  parse its JSON output, extract scalars and partition frequencies into
  legacy DB columns, store full gtars output as `distributions` JSONB blob.
  Open signal matrix auto-download retained as fallback.
- compress_distributions.py (new): numpy-vectorized compression of raw
  per-region arrays for DB storage — histogram binning for widths/TSS
  distances, Gaussian KDE for neighbor distances and GC content (matching
  bedbase-ui's client-side math), dense count arrays for region distribution.
  Compresses 14MB raw JSON to ~40KB.
- gc_content.py: remove matplotlib plot generation (create_gc_plot);
  GC content is now compressed to a 512-pt KDE in the distributions blob
- cli.py: add --chrom-sizes to run_all and run_stats, update --ensdb help
  to mention gtars prep .bin files for batch speed
- bedboss.py, bbuploader/main.py: remove all RServiceManager references
  and r_service parameter threading, pass chrom_sizes through to bedstat
- test_genomicdist.py (new): benchmark script with --breakdown mode for
  component-level timing (gtars, JSON read, GC calc, KDE, compress)

Deleted:
- bedboss/bedstat/r_service.py (R subprocess manager)
- bedboss/bedstat/tools/r-service.R, regionstat.R, regionstat_cli.R

Plot images (BedPlots) are no longer generated. The old UI will show
empty plots for newly processed beds. The new bedbase-ui renders
client-side from the distributions JSONB blob.

For batch processing, pre-compile reference files with `gtars prep` and
pass .bin paths to --ensdb/--signal-matrix to avoid re-parsing per BED
file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sanghoonio and others added 2 commits February 25, 2026 01:38
- Increase region distribution bins from 100 to 250
- Pass --compact flag for smaller intermediate JSON
- Add --save flag to test script for output inspection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add symmetric [-100kb, +100kb] TSS histogram (compress_tss_histogram)
  for signed distances from calc_feature_distances
- Add overflow bin count to widths histogram for trimmed outlier visibility
- Fix KDE bandwidth: compute from full trimmed data before downsampling
  and use population std (ddof=0) to match UI's gaussianKde
- Pass promoter-upstream, promoter-downstream, and region-dist-bins
  as bedstat() params through to gtars CLI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio
Copy link
Member Author

On some local testing the gtars pipeline seems to take around 5s per bed file, don't know whether that would hold up after deployment, but the compressed bed distributions json you get from bedstat you can test here: https://bedbase-ui.nsheff.workers.dev/debug

sanghoonio and others added 2 commits February 27, 2026 00:27
- Remove R-based bedsetStat.R, check_R_req.R, installRdeps.R and
  all R requirement checks
- Remove heavy param and create_plots from bedbuncher
- Remove requirements_check, check-requirements, install-requirements CLI commands
- Disable S3 uploads everywhere (upload_s3=False)
- Add precision param threaded through CLI -> run_all -> bedstat
- Import round_floats from bbconf instead of duplicating in tests
- Remove dead subprocess import
- Regenerate docs/usage.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add get_gtf_path(): resolves genome name to Ensembl GTF via static map
  (13 genomes / 22 aliases) or REST API (~350 species), downloads from
  Ensembl FTP, and pre-compiles to .bin with gtars prep
- Support multiple Ensembl divisions: main, grch37, plants, fungi, protists
- Add auto-prep to get_osm_path(): downloaded signal matrices are now
  automatically compiled to .bin
- Wire both auto-downloads into bedstat() so ensdb and open_signal_matrix
  are resolved automatically when not provided
- Update bedboss prep CLI: add --genome flag to download + prep all
  reference files for a genome (e.g. bedboss prep --genome hg38)
- Add smoketests for static map, URL construction, download+prep flows
- Regenerate docs/usage.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sanghoonio and others added 2 commits March 3, 2026 16:07
- Add get_chrom_sizes_path() using refget SequenceCollectionClient to
  auto-download chrom.sizes from the seqcol API (resolves genome name
  to digest via refgenie, caches locally)
- Rename get_gtf_path → get_gda_path to reflect GDA binary format
- Log warning when GC content calculation is skipped (was silent)
- Add refget dependency
- Remove unused imports across bedboss, bbuploader, bedbuncher
- Update CLI prep command and tests for chromSizes removal
- Update docs/usage.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace custom Ensembl FTP download and alias resolution with refgenie
rgc.seek/pull for GTF (ensembl_gtf asset) and chrom.sizes (fasta asset).
Keep seqcol API as fallback for chrom.sizes when refgenie doesn't have
the genome. Remove ~150 lines of Ensembl FTP code, static genome map,
and REST API species lookup. Rewrite tests to mock refgenie calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant