Skip to content

Replace ood-gen with data-packer and data-scrape#3463

Merged
sabine merged 6 commits intomainfrom
data-packer
Feb 25, 2026
Merged

Replace ood-gen with data-packer and data-scrape#3463
sabine merged 6 commits intomainfrom
data-packer

Conversation

@cuihtlauac
Copy link
Collaborator

Summary

This PR replaces the ood-gen code generation tool with two focused tools:

  • data-packer: Binary serialization of data/ content (YAML/Markdown → 17MB binary blob)
  • data-scrape: Separate tool for scraping external sources (OCaml Planet, YouTube, watch.ocaml.org, GitHub releases)

Motivation

The old ood-gen approach generated ~119MB of OCaml source code that then needed to be compiled. This PR:

  • Reduces CPU usage by 2.5x (6.4s vs 16.3s user time)
  • Reduces output size by 7x (17MB binary vs ~119MB generated .ml files)
  • Simplifies the build - no generated .ml files to compile
  • Separates concerns - packing and scraping are now independent tools

Benchmark Results

Metric ood-gen data-packer Improvement
Wall clock 7.6s 6.5s 15% faster
User time 16.25s 6.37s 2.5x less CPU
CPU usage 237% 106% Less parallel work
Output ~119MB .ml 17MB binary 7x smaller

Changes

  • tool/data-packer/ - New binary packing tool with parsers and serialization
  • tool/data-scrape/ - New scraping tool (depends on data-packer for shared code)
  • tool/ood-gen/ - Deleted (replaced by above)
  • src/ocamlorg_data/ - Updated to use binary blob instead of generated code
  • .github/workflows/scrape*.yml - Updated to use data-scrape
  • Makefile - Updated scraping targets

Test plan

  • CI passes on Linux and macOS
  • Site builds and runs correctly (make start)
  • Scrapers work (dune exec tool/data-scrape/bin/scrape.exe -- --help)

🤖 Generated with Claude Code

Copy link
Collaborator

@sabine sabine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the improvement in build times is substantial enough to make this worthwhile and the code is still quite similar to what it used to be.

So I rebased this on main and am inclined to merge this. Let's fix forward if any issues come up later.

I'm running a small test in terms of HTML response parity before I merge.

ETA: Zero differences. The data-packer refactor produces identical HTML output on 26 tested pages (for each type of data).

cuihtlauac and others added 6 commits February 25, 2026 15:48
- data-packer: binary serialization of data/ content (replaces code generation)
- data-scrape: separate tool for scraping external sources (planet, youtube, etc.)

This eliminates code duplication between the tools. data-scrape depends on
data-packer for shared utilities and parsers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ood-gen is replaced by data-packer (binary serialization) and
data-scrape (external source scraping).

BENCHMARK.md documents how to compare the two approaches:
- data-packer: 6.5s wall clock, 6.4s CPU
- ood-gen: 7.6s wall clock, 16.3s CPU (2.5x more CPU usage)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace ~35 calls to get_all() with a single deserialization
at module load time. While Lazy.force caches the result,
this is cleaner and avoids repeated force checks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move scrape_report.ml from ood-gen to data-scrape and integrate
upstream's Scrape_report tracking into the scraper pipeline:
- Scrapers now return Scrape_report.entry lists instead of unit
- Add --commit-file and --report-file CLI args for CI reporting
- Add project_display_name for platform release titles
- Delete remaining ood-gen directory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sabine sabine merged commit e32fa58 into main Feb 25, 2026
2 of 5 checks passed
@sabine sabine deleted the data-packer branch February 25, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants