Replace R bedstat pipeline with gtars genomicdist#151
Open
sanghoonio wants to merge 7 commits intomainfrom
Open
Replace R bedstat pipeline with gtars genomicdist#151sanghoonio wants to merge 7 commits intomainfrom
sanghoonio wants to merge 7 commits intomainfrom
Conversation
bedstat now shells out to `gtars genomicdist` instead of Rscript regionstat_cli.R. This removes the R runtime dependency entirely. What changed: - bedstat.py: build and run `gtars genomicdist` CLI command via pypiper, parse its JSON output, extract scalars and partition frequencies into legacy DB columns, store full gtars output as `distributions` JSONB blob. Open signal matrix auto-download retained as fallback. - compress_distributions.py (new): numpy-vectorized compression of raw per-region arrays for DB storage — histogram binning for widths/TSS distances, Gaussian KDE for neighbor distances and GC content (matching bedbase-ui's client-side math), dense count arrays for region distribution. Compresses 14MB raw JSON to ~40KB. - gc_content.py: remove matplotlib plot generation (create_gc_plot); GC content is now compressed to a 512-pt KDE in the distributions blob - cli.py: add --chrom-sizes to run_all and run_stats, update --ensdb help to mention gtars prep .bin files for batch speed - bedboss.py, bbuploader/main.py: remove all RServiceManager references and r_service parameter threading, pass chrom_sizes through to bedstat - test_genomicdist.py (new): benchmark script with --breakdown mode for component-level timing (gtars, JSON read, GC calc, KDE, compress) Deleted: - bedboss/bedstat/r_service.py (R subprocess manager) - bedboss/bedstat/tools/r-service.R, regionstat.R, regionstat_cli.R Plot images (BedPlots) are no longer generated. The old UI will show empty plots for newly processed beds. The new bedbase-ui renders client-side from the distributions JSONB blob. For batch processing, pre-compile reference files with `gtars prep` and pass .bin paths to --ensdb/--signal-matrix to avoid re-parsing per BED file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fb665b0 to
a083176
Compare
- Increase region distribution bins from 100 to 250 - Pass --compact flag for smaller intermediate JSON - Add --save flag to test script for output inspection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add symmetric [-100kb, +100kb] TSS histogram (compress_tss_histogram) for signed distances from calc_feature_distances - Add overflow bin count to widths histogram for trimmed outlier visibility - Fix KDE bandwidth: compute from full trimmed data before downsampling and use population std (ddof=0) to match UI's gaussianKde - Pass promoter-upstream, promoter-downstream, and region-dist-bins as bedstat() params through to gtars CLI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Member
Author
|
On some local testing the gtars pipeline seems to take around 5s per bed file, don't know whether that would hold up after deployment, but the compressed bed distributions json you get from bedstat you can test here: https://bedbase-ui.nsheff.workers.dev/debug |
- Remove R-based bedsetStat.R, check_R_req.R, installRdeps.R and all R requirement checks - Remove heavy param and create_plots from bedbuncher - Remove requirements_check, check-requirements, install-requirements CLI commands - Disable S3 uploads everywhere (upload_s3=False) - Add precision param threaded through CLI -> run_all -> bedstat - Import round_floats from bbconf instead of duplicating in tests - Remove dead subprocess import - Regenerate docs/usage.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add get_gtf_path(): resolves genome name to Ensembl GTF via static map (13 genomes / 22 aliases) or REST API (~350 species), downloads from Ensembl FTP, and pre-compiles to .bin with gtars prep - Support multiple Ensembl divisions: main, grch37, plants, fungi, protists - Add auto-prep to get_osm_path(): downloaded signal matrices are now automatically compiled to .bin - Wire both auto-downloads into bedstat() so ensdb and open_signal_matrix are resolved automatically when not provided - Update bedboss prep CLI: add --genome flag to download + prep all reference files for a genome (e.g. bedboss prep --genome hg38) - Add smoketests for static map, URL construction, download+prep flows - Regenerate docs/usage.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e620e8a to
9ef134c
Compare
- Add get_chrom_sizes_path() using refget SequenceCollectionClient to auto-download chrom.sizes from the seqcol API (resolves genome name to digest via refgenie, caches locally) - Rename get_gtf_path → get_gda_path to reflect GDA binary format - Log warning when GC content calculation is skipped (was silent) - Add refget dependency - Remove unused imports across bedboss, bbuploader, bedbuncher - Update CLI prep command and tests for chromSizes removal - Update docs/usage.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace custom Ensembl FTP download and alias resolution with refgenie rgc.seek/pull for GTF (ensembl_gtf asset) and chrom.sizes (fasta asset). Keep seqcol API as fallback for chrom.sizes when refgenie doesn't have the genome. Remove ~150 lines of Ensembl FTP code, static genome map, and REST API species lookup. Rewrite tests to mock refgenie calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
bedstat now shells out to
gtars genomicdistinstead of Rscript regionstat_cli.R. This removes the R runtime dependency entirely.What changed:
gtars genomicdistCLI command via pypiper, parse its JSON output, extract scalars and partition frequencies into legacy DB columns, store full gtars output asdistributionsJSONB blob. Open signal matrix auto-download retained as fallback.Deleted:
Plot images (BedPlots) are no longer generated. The old UI will show empty plots for newly processed beds. The new bedbase-ui renders client-side from the distributions JSONB blob.
For batch processing, pre-compile reference files with
gtars prepand pass .bin paths to --ensdb/--signal-matrix to avoid re-parsing per BED file.Changes:
TODO:
__version__.pyfile