Fix/update benchmarks by DavidEiglspergerQC · Pull Request #968 · Quantco/glum

DavidEiglspergerQC · 2026-01-22T12:52:57Z

Disclaimer:
I know this is a fairly large PR, so if anyone has any questions about it or would like me to explain something in more detail, I am more than happy to do so (also during a code walkthrough meeting).

Summary of changes:

Added new libraries that we are benchmarking against
Simplified benchmarking by creating a configurable script that can easily be run to update the benchmarks
Moved benchmarking out of the src folder
Removed old, unused code

Motivation:
The goal is to update the benchmarking to cover more recent libraries so that the outdated results can be updated. Another goal is to simplify the overly complicated CLI setup that hasn't been used recently, to make it easier to generate new benchmark results.

Approach:
The high-level approach I opted for involved separating the two issues that the old CLI/benchmarking system addressed:

Benchmarking against similar libraries:
- This one does not require a complex CLI in which we test many different problems (e.g. offsets and weights), as this only serves to compare GLUMS's main functionality, and we don't report the benchmarking results for hundreds of problems anyway
- This issue is addressed in this PR by reworking the way we conduct benchmarking
Monitoring GLUMS's runtime performance:
- Independent of the benchmarking we should monitor GLUMS's performance across changes and automatically report them so that we can check the impact of a change on performance directly
- This has not yet been implemented and will be done in a separate PR (to be decided how exactly this will be done)
- For this one, we could consider using the entire suite of problems (e.g. offsets and weights) to ensure that GLUM remains performant for all these use cases
- As this will be automated into the CI, we don't need a complex CLI for this either

These two points should use the same dataset generation and problem definition, but in different ways (e.g. a simple script vs. automated runtime evaluation).

Implementation:

Updated libraries:
As the libraries we were comparing with were quite outdated, I added some more up-to-date ones:

Celer: bench_celer.py
Scikit-learn: bench_sklearn.py
Skglm: bench_skglm.py

All problems supported by the relevant libraries have been implemented. However, none of them can be run on the full set of problems as some features are missing.

Updated problems:
As there were far too many problems to run the benchmarking on, I decided to focus on the most important ones and delete those that only provided limited new insights. The following were useful during development, but don't provide any additional insight into runtime comparisons as they don't affect the underlying computational costs and just complicate things:

Cross-validation problems
Weighted problems
Offset problems

I therefore adapted bench_glum.py, bench_h2o.py and bench_liblinear.py so that they no longer cover those cases.

New structure of the code:
The benchmarking code was previously loaded into the src/glum_benchmarks folder, which made it part of the main package. However, I have now restructured it to a top-level glum_benchmarks folder, as this is not distributed. Benchmarks are a development and documentation tool and should not be part of the Glum library itself.

New Benchmarking approach:
Replaces the CLI-based approach with a single self-contained script that runs all three benchmarking steps.
The benchmarking can be done by simply running pixi run -e benchmark run-benchmarks.

Configuration:
All settings are defined in config.yaml:

Which steps to run
Where to store the results (different runs can be separated by using a different RUN_NAME)
Which libraries/problems to benchmark
Which settings to use for the benchmarks (e.g. number of threads, rows, iterations)

Workflow steps:

The different steps can also be separated using the boolean flags. For example, one can regenerate the figures from an existing CSV file without running the benchmarks again.

RUN_BENCHMARKS:
Executes each library on each configured problem combination, times the model fitting, and saves detailed results.

Input: Configuration variables (e.g. problems, libraries and benchmarking settings)
Output: Pickle files in the results/{RUN_NAME}/pickles directory, with one file for each library/problem combination
The specific benchmarks are run and stored in pickles. This is done in the same way as before with the CLI, as we reuse a slightly adapted version of the old functions in cli_run.py
Dependencies: Requires a benchmark environment with all library dependencies installed
Note: libraries that do not support a given distribution/regularisation combination will skip with a warning
Re-running will overwrite existing pickle files with the same parameters
The config being used to generate the results will also be saved in the RUN_NAME folder, so the results can be reproduced

ANALYZE_RESULTS:
Loads all pickle files from the run folder, computes summary metrics, prints a comparison table, and exports to CSV.

Input: Pickle files from results/{RUN_NAME}/pickles/
Output: results/{RUN_NAME}/results.csv
Computed metrics and results printed: Exact same as before as we reuse an adapted version of the old functions in cli_analyze
Dependencies: Requires existing pickle files in the folder (can run without RUN_BENCHMARKS if pickles already exist)

GENERATE_PLOTS:
Reads the CSV and generates bar chart comparisons, one per dataset/regularization combination.

Input: results/{RUN_NAME}/results.csv
Output: PNG files in results/{RUN_NAME}/figures/
Generates two plots per combination: raw runtime (seconds) and normalized runtime (relative to glum = 1.0)
Bars exceeding 10× the fastest runtime are clipped with the actual value annotated
If a library doesn't support a certain problem this is indicated by hatched bars with a "N/A" label
If a library doesn't converge on a problem, we indicate this by hatching the bar and putting a red "NC" on top. Not converged in this case refers to hitting max_iter
Dependencies: Only requires the CSV file, can run independently of the other steps

Git Tracking:
Only results/docs/results.csv and config.yaml are tracked. Pickles, figures, and other run folders are gitignored. The reason for this is that only the CSV file and the config used to generate the documentation figures is pushed. All other files used for testing or iteration, for example, will not be tracked.

Usage:
Once we have agreed on how to set the parameters in the configuration section, the idea is that one can simply run this script periodically to obtain new figures that can be used to update the documentation.
If you want to update the benchmarks, you can simply run the benchmarking script with the pre-configured configuration, update the documentation with the generated figures, and push the CSV so that everyone can recreate it.

Tests:
I also added some basic tests in test_benchmarks.py (those are AI-generated!!).

Cleaning of the code
I deleted some old leftover code snippets that were used during development and are no longer needed due to the new way of running benchmarks.

[benchmark_data.csv, benchmark_figure.py] as we now handle all of this in run_benchmarks.py
[run_intermediate_benchmarks.sh, run_narrow_benchmarks.sh, run_real_benchmarks.sh, run_wide_benchmarks.sh] as those just called CLI commands that are now replaced by run_benchmarks.py
[benchmark_dense_sandwich.py, benchmark_sparse_sandwich.py] which were used during development but are no longer relevant
[cli_analyze.py, cli_run.py, test_cli.py] as the CLI is replaced by run_benchmarks.py. The relevant functions were moved to util.py so that they can be reused in the new benchmarking script

Open topics:
There are a few things that still need to be tackled:

Once we have agreed on the settings (e.g. whether or not to scale) and the problems we want to run the benchmarks on, we will have to run the benchmarks and update the documentation
Automated replacement of the doc scripts (controlled by a bool)
Add a case where we (almost) equal number of rows and columns
Allow yaml to take more flexible combinations instead of always doing the cross-product
We have to update the section in the README and documentation explaining how to run the benchmarks

Copilot

Pull request overview

This PR refactors the benchmarking infrastructure for the glum library to modernize it and make it easier to maintain and use. The goal is to update benchmarks against newer libraries and simplify the overly complex CLI-based system.

Changes:

Moved benchmarking code from src/glum_benchmarks to a top-level glum_benchmarks directory (not part of the distributed package)
Replaced the CLI-based approach with a single self-contained run_benchmarks.py script with configuration at the top
Added new benchmark libraries: Celer, scikit-learn, and skglm
Removed old benchmark scripts, CLI code, and unused development code
Simplified problem definitions by removing CV, weights, and offset variants
Added basic tests for the benchmark infrastructure

Reviewed changes

Copilot reviewed 36 out of 42 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`glum_benchmarks/run_benchmarks.py`	New main benchmarking script with configurable workflow (run, analyze, plot)
`glum_benchmarks/util.py`	Refactored utilities with moved functions from CLI, added standardization
`glum_benchmarks/problems.py`	Updated problem definitions, removed single_precision support
`glum_benchmarks/libraries/`	New library benchmark implementations for celer, sklearn, skglm
`glum_benchmarks/data/`	New data generation modules for housing and insurance datasets
`glum_benchmarks/tests/test_benchmarks.py`	New basic tests for benchmark infrastructure
`glum_benchmarks/README.md`	New comprehensive documentation for running benchmarks
`setup.py`	Removed glum_benchmarks from package distribution and CLI entry points
`pixi.toml`	Updated benchmark environment configuration and tasks
`.gitignore`	Updated to track only `results/docs/results.csv`
Tests in `tests/glm/`	Updated imports to use new module structure
Old files	Deleted CLI code, old scripts, and unused development code

Comments suppressed due to low confidence (3)

glum_benchmarks/util.py:359

The import of diags from scipy.sparse is placed inside the function. While this works, it's unconventional and could lead to performance issues if this function is called frequently in a loop. Consider moving this import to the top of the file with the other scipy.sparse imports.
glum_benchmarks/util.py:300
The get_params_from_fname function doesn't validate that the number of parts matches the expected parameter count. If a filename has too many or too few underscore-separated parts, this could lead to silent errors where parameters are missing or incorrectly parsed. Consider adding validation to ensure the parts list has the expected length.
glum_benchmarks/util.py:289
The parse_value function will raise a ValueError if the value cannot be converted to the expected type (e.g., int or float). This error is not handled and could cause the benchmark analysis to fail. Consider adding try-except error handling to provide more informative error messages when parsing fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

glum_benchmarks/README.md

glum_benchmarks/libraries/bench_celer.py

glum_benchmarks/run_benchmarks.py

glum_benchmarks/tests/test_benchmarks.py

glum_benchmarks/run_benchmarks.py

glum_benchmarks/config.yaml

pixi.toml

glum_benchmarks/run_benchmarks.py

stanmart

Super promising, and much more user-friendly than the old benchmark suit. Very good job!

MatthiasSchmidtblaicherQC · 2026-02-03T11:03:57Z

glum_benchmarks/util.py

+    result = L(
+        dat,
+        distribution=P.distribution,
+        alpha=params.regularization_strength,


I find the terminology with regularization a bit confusing:

regularization_strength is per-observation alpha for unweighted data.

reg_multiplier is the inverse weight.

alpha is the lower level name for regularization_strength.

I think that the following names would be clearer:

alpha is set in the config, noting in the documentation that this is per-observation alpha for unweighted data.

Instead of reg_multiplier, we can just pass a weight column to L and perform the reg_multiplier computation internally.

That makes a lof of sense, the old approach was indeed overcomplicated.

The new process is as follows:

alpha is set in the config

execute_problem_library adjusts alpha once if sample_weight exist before it is passed to L, so we don't need to this anymore within each benchmarking function

reg_multiplier and regularization_strength are completely omitted as we don't need them anymore (terminology aligned to use alpha)

glum_benchmarks/data/simulated_glm.py

Base automatically changed from fix/housing_data_generation to main January 22, 2026 13:42

DavidEiglspergerQC added 8 commits January 26, 2026 16:49

Fix data leakage in housing data generation

52cf56c

First implementation of additional benchmarks

32d7397

new benchmark adaption

158911f

exclude statsmodels

e0933d5

adjust benchmarks

8b2b49e

adapt parameters

fa27690

Delete cv and adjust plotting

527da4a

Resolve conflicts

3847e7e

DavidEiglspergerQC force-pushed the fix/update_benchmarks branch from 9df4048 to 3847e7e Compare January 26, 2026 16:12

DavidEiglspergerQC added 7 commits January 26, 2026 22:30

Add scaling

93ce467

Scaling only for benchmarks, not for golden master tests

cd88e50

Closed form solution for l2-gaussian

d97edc2

Benchmark CLI replacement and separation of benchmarking

214723a

fix CI

585ed33

Improved plotting and some small adaptions/fixes

1d5d2a1

update comments and fix CI

ebadc5b

DavidEiglspergerQC requested a review from MatthiasSchmidtblaicherQC January 29, 2026 13:34

DavidEiglspergerQC marked this pull request as ready for review January 29, 2026 14:00

DavidEiglspergerQC requested review from MarcAntoineSchmidtQC, jtilly and stanmart as code owners January 29, 2026 14:00

DavidEiglspergerQC requested a review from Copilot January 29, 2026 15:08

Copilot started reviewing on behalf of DavidEiglspergerQC January 29, 2026 15:08 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Config in yaml, advanced plotting & storage=auto for glum

d8a760d