Conversation
9df4048 to
3847e7e
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the benchmarking infrastructure for the glum library to modernize it and make it easier to maintain and use. The goal is to update benchmarks against newer libraries and simplify the overly complex CLI-based system.
Changes:
- Moved benchmarking code from
src/glum_benchmarksto a top-levelglum_benchmarksdirectory (not part of the distributed package) - Replaced the CLI-based approach with a single self-contained
run_benchmarks.pyscript with configuration at the top - Added new benchmark libraries: Celer, scikit-learn, and skglm
- Removed old benchmark scripts, CLI code, and unused development code
- Simplified problem definitions by removing CV, weights, and offset variants
- Added basic tests for the benchmark infrastructure
Reviewed changes
Copilot reviewed 36 out of 42 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
glum_benchmarks/run_benchmarks.py |
New main benchmarking script with configurable workflow (run, analyze, plot) |
glum_benchmarks/util.py |
Refactored utilities with moved functions from CLI, added standardization |
glum_benchmarks/problems.py |
Updated problem definitions, removed single_precision support |
glum_benchmarks/libraries/ |
New library benchmark implementations for celer, sklearn, skglm |
glum_benchmarks/data/ |
New data generation modules for housing and insurance datasets |
glum_benchmarks/tests/test_benchmarks.py |
New basic tests for benchmark infrastructure |
glum_benchmarks/README.md |
New comprehensive documentation for running benchmarks |
setup.py |
Removed glum_benchmarks from package distribution and CLI entry points |
pixi.toml |
Updated benchmark environment configuration and tasks |
.gitignore |
Updated to track only results/docs/results.csv |
Tests in tests/glm/ |
Updated imports to use new module structure |
| Old files | Deleted CLI code, old scripts, and unused development code |
Comments suppressed due to low confidence (3)
glum_benchmarks/util.py:359
- The import of
diagsfromscipy.sparseis placed inside the function. While this works, it's unconventional and could lead to performance issues if this function is called frequently in a loop. Consider moving this import to the top of the file with the other scipy.sparse imports.
glum_benchmarks/util.py:300 - The
get_params_from_fnamefunction doesn't validate that the number of parts matches the expected parameter count. If a filename has too many or too few underscore-separated parts, this could lead to silent errors where parameters are missing or incorrectly parsed. Consider adding validation to ensure the parts list has the expected length.
glum_benchmarks/util.py:289 - The
parse_valuefunction will raise a ValueError if the value cannot be converted to the expected type (e.g., int or float). This error is not handled and could cause the benchmark analysis to fail. Consider adding try-except error handling to provide more informative error messages when parsing fails.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
stanmart
left a comment
There was a problem hiding this comment.
Super promising, and much more user-friendly than the old benchmark suit. Very good job!
glum_benchmarks/util.py
Outdated
| result = L( | ||
| dat, | ||
| distribution=P.distribution, | ||
| alpha=params.regularization_strength, |
There was a problem hiding this comment.
I find the terminology with regularization a bit confusing:
regularization_strengthis per-observation alpha for unweighted data.reg_multiplieris the inverse weight.alphais the lower level name forregularization_strength.
I think that the following names would be clearer:
alphais set in the config, noting in the documentation that this is per-observation alpha for unweighted data.- Instead of
reg_multiplier, we can just pass aweightcolumn toLand perform thereg_multipliercomputation internally.
There was a problem hiding this comment.
That makes a lof of sense, the old approach was indeed overcomplicated.
The new process is as follows:
alphais set in the configexecute_problem_libraryadjusts alpha once ifsample_weightexist before it is passed toL, so we don't need to this anymore within each benchmarking functionreg_multiplierandregularization_strengthare completely omitted as we don't need them anymore (terminology aligned to usealpha)
Disclaimer:
I know this is a fairly large PR, so if anyone has any questions about it or would like me to explain something in more detail, I am more than happy to do so (also during a code walkthrough meeting).
Summary of changes:
srcfolderMotivation:
The goal is to update the benchmarking to cover more recent libraries so that the outdated results can be updated. Another goal is to simplify the overly complicated CLI setup that hasn't been used recently, to make it easier to generate new benchmark results.
Approach:
The high-level approach I opted for involved separating the two issues that the old CLI/benchmarking system addressed:
These two points should use the same dataset generation and problem definition, but in different ways (e.g. a simple script vs. automated runtime evaluation).
Implementation:
Updated libraries:
As the libraries we were comparing with were quite outdated, I added some more up-to-date ones:
bench_celer.pybench_sklearn.pybench_skglm.pyAll problems supported by the relevant libraries have been implemented. However, none of them can be run on the full set of problems as some features are missing.
Updated problems:
As there were far too many problems to run the benchmarking on, I decided to focus on the most important ones and delete those that only provided limited new insights. The following were useful during development, but don't provide any additional insight into runtime comparisons as they don't affect the underlying computational costs and just complicate things:
I therefore adapted
bench_glum.py,bench_h2o.pyandbench_liblinear.pyso that they no longer cover those cases.New structure of the code:
The benchmarking code was previously loaded into the
src/glum_benchmarksfolder, which made it part of the main package. However, I have now restructured it to a top-levelglum_benchmarksfolder, as this is not distributed. Benchmarks are a development and documentation tool and should not be part of the Glum library itself.New Benchmarking approach:
Replaces the CLI-based approach with a single self-contained script that runs all three benchmarking steps.
The benchmarking can be done by simply running
pixi run -e benchmark run-benchmarks.Configuration:
All settings are defined in
config.yaml:RUN_NAME)Workflow steps:
The different steps can also be separated using the boolean flags. For example, one can regenerate the figures from an existing CSV file without running the benchmarks again.
RUN_BENCHMARKS:Executes each library on each configured problem combination, times the model fitting, and saves detailed results.
results/{RUN_NAME}/picklesdirectory, with one file for each library/problem combinationcli_run.pyRUN_NAMEfolder, so the results can be reproducedANALYZE_RESULTS:Loads all pickle files from the run folder, computes summary metrics, prints a comparison table, and exports to CSV.
results/{RUN_NAME}/pickles/results/{RUN_NAME}/results.csvRUN_BENCHMARKSif pickles already exist)GENERATE_PLOTS:Reads the CSV and generates bar chart comparisons, one per dataset/regularization combination.
results/{RUN_NAME}/results.csvresults/{RUN_NAME}/figures/max_iterGit Tracking:
Only
results/docs/results.csvandconfig.yamlare tracked. Pickles, figures, and other run folders are gitignored. The reason for this is that only the CSV file and the config used to generate the documentation figures is pushed. All other files used for testing or iteration, for example, will not be tracked.Usage:
Once we have agreed on how to set the parameters in the configuration section, the idea is that one can simply run this script periodically to obtain new figures that can be used to update the documentation.
If you want to update the benchmarks, you can simply run the benchmarking script with the pre-configured configuration, update the documentation with the generated figures, and push the CSV so that everyone can recreate it.
Tests:
I also added some basic tests in
test_benchmarks.py(those are AI-generated!!).Cleaning of the code
I deleted some old leftover code snippets that were used during development and are no longer needed due to the new way of running benchmarks.
[benchmark_data.csv, benchmark_figure.py]as we now handle all of this inrun_benchmarks.py[run_intermediate_benchmarks.sh, run_narrow_benchmarks.sh, run_real_benchmarks.sh, run_wide_benchmarks.sh]as those just called CLI commands that are now replaced byrun_benchmarks.py[benchmark_dense_sandwich.py, benchmark_sparse_sandwich.py]which were used during development but are no longer relevant[cli_analyze.py, cli_run.py, test_cli.py]as the CLI is replaced byrun_benchmarks.py. The relevant functions were moved toutil.pyso that they can be reused in the new benchmarking scriptOpen topics:
There are a few things that still need to be tackled:
READMEand documentation explaining how to run the benchmarks