Replace ood-gen with data-packer and data-scrape#3463
Merged
Conversation
24658b0 to
b8c8cc7
Compare
4f0a90c to
5ceb435
Compare
sabine
approved these changes
Feb 25, 2026
Collaborator
There was a problem hiding this comment.
I think the improvement in build times is substantial enough to make this worthwhile and the code is still quite similar to what it used to be.
So I rebased this on main and am inclined to merge this. Let's fix forward if any issues come up later.
I'm running a small test in terms of HTML response parity before I merge.
ETA: Zero differences. The data-packer refactor produces identical HTML output on 26 tested pages (for each type of data).
- data-packer: binary serialization of data/ content (replaces code generation) - data-scrape: separate tool for scraping external sources (planet, youtube, etc.) This eliminates code duplication between the tools. data-scrape depends on data-packer for shared utilities and parsers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ood-gen is replaced by data-packer (binary serialization) and data-scrape (external source scraping). BENCHMARK.md documents how to compare the two approaches: - data-packer: 6.5s wall clock, 6.4s CPU - ood-gen: 7.6s wall clock, 16.3s CPU (2.5x more CPU usage) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace ~35 calls to get_all() with a single deserialization at module load time. While Lazy.force caches the result, this is cleaner and avoids repeated force checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move scrape_report.ml from ood-gen to data-scrape and integrate upstream's Scrape_report tracking into the scraper pipeline: - Scrapers now return Scrape_report.entry lists instead of unit - Add --commit-file and --report-file CLI args for CI reporting - Add project_display_name for platform release titles - Delete remaining ood-gen directory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR replaces the
ood-gencode generation tool with two focused tools:data/content (YAML/Markdown → 17MB binary blob)Motivation
The old
ood-genapproach generated ~119MB of OCaml source code that then needed to be compiled. This PR:Benchmark Results
Changes
tool/data-packer/- New binary packing tool with parsers and serializationtool/data-scrape/- New scraping tool (depends on data-packer for shared code)tool/ood-gen/- Deleted (replaced by above)src/ocamlorg_data/- Updated to use binary blob instead of generated code.github/workflows/scrape*.yml- Updated to use data-scrapeMakefile- Updated scraping targetsTest plan
make start)dune exec tool/data-scrape/bin/scrape.exe -- --help)🤖 Generated with Claude Code