Skip to content

Fix/update benchmarks#968

Open
DavidEiglspergerQC wants to merge 20 commits intomainfrom
fix/update_benchmarks
Open

Fix/update benchmarks#968
DavidEiglspergerQC wants to merge 20 commits intomainfrom
fix/update_benchmarks

Conversation

@DavidEiglspergerQC
Copy link
Contributor

@DavidEiglspergerQC DavidEiglspergerQC commented Jan 22, 2026

Disclaimer:
I know this is a fairly large PR, so if anyone has any questions about it or would like me to explain something in more detail, I am more than happy to do so (also during a code walkthrough meeting).

Summary of changes:

  • Added new libraries that we are benchmarking against
  • Simplified benchmarking by creating a configurable script that can easily be run to update the benchmarks
  • Moved benchmarking out of the src folder
  • Removed old, unused code

Motivation:
The goal is to update the benchmarking to cover more recent libraries so that the outdated results can be updated. Another goal is to simplify the overly complicated CLI setup that hasn't been used recently, to make it easier to generate new benchmark results.

Approach:
The high-level approach I opted for involved separating the two issues that the old CLI/benchmarking system addressed:

  1. Benchmarking against similar libraries:
    • This one does not require a complex CLI in which we test many different problems (e.g. offsets and weights), as this only serves to compare GLUMS's main functionality, and we don't report the benchmarking results for hundreds of problems anyway
    • This issue is addressed in this PR by reworking the way we conduct benchmarking
  2. Monitoring GLUMS's runtime performance:
    • Independent of the benchmarking we should monitor GLUMS's performance across changes and automatically report them so that we can check the impact of a change on performance directly
    • This has not yet been implemented and will be done in a separate PR (to be decided how exactly this will be done)
    • For this one, we could consider using the entire suite of problems (e.g. offsets and weights) to ensure that GLUM remains performant for all these use cases
    • As this will be automated into the CI, we don't need a complex CLI for this either

These two points should use the same dataset generation and problem definition, but in different ways (e.g. a simple script vs. automated runtime evaluation).

Implementation:

Updated libraries:
As the libraries we were comparing with were quite outdated, I added some more up-to-date ones:

All problems supported by the relevant libraries have been implemented. However, none of them can be run on the full set of problems as some features are missing.

Updated problems:
As there were far too many problems to run the benchmarking on, I decided to focus on the most important ones and delete those that only provided limited new insights. The following were useful during development, but don't provide any additional insight into runtime comparisons as they don't affect the underlying computational costs and just complicate things:

  • Cross-validation problems
  • Weighted problems
  • Offset problems

I therefore adapted bench_glum.py, bench_h2o.py and bench_liblinear.py so that they no longer cover those cases.

New structure of the code:
The benchmarking code was previously loaded into the src/glum_benchmarks folder, which made it part of the main package. However, I have now restructured it to a top-level glum_benchmarks folder, as this is not distributed. Benchmarks are a development and documentation tool and should not be part of the Glum library itself.

New Benchmarking approach:
Replaces the CLI-based approach with a single self-contained script that runs all three benchmarking steps.
The benchmarking can be done by simply running pixi run -e benchmark run-benchmarks.

Configuration:
All settings are defined in config.yaml:

  • Which steps to run
  • Where to store the results (different runs can be separated by using a different RUN_NAME)
  • Which libraries/problems to benchmark
  • Which settings to use for the benchmarks (e.g. number of threads, rows, iterations)

Workflow steps:

The different steps can also be separated using the boolean flags. For example, one can regenerate the figures from an existing CSV file without running the benchmarks again.

RUN_BENCHMARKS:
Executes each library on each configured problem combination, times the model fitting, and saves detailed results.

  • Input: Configuration variables (e.g. problems, libraries and benchmarking settings)
  • Output: Pickle files in the results/{RUN_NAME}/pickles directory, with one file for each library/problem combination
  • The specific benchmarks are run and stored in pickles. This is done in the same way as before with the CLI, as we reuse a slightly adapted version of the old functions in cli_run.py
  • Dependencies: Requires a benchmark environment with all library dependencies installed
  • Note: libraries that do not support a given distribution/regularisation combination will skip with a warning
  • Re-running will overwrite existing pickle files with the same parameters
  • The config being used to generate the results will also be saved in the RUN_NAME folder, so the results can be reproduced

ANALYZE_RESULTS:
Loads all pickle files from the run folder, computes summary metrics, prints a comparison table, and exports to CSV.

  • Input: Pickle files from results/{RUN_NAME}/pickles/
  • Output: results/{RUN_NAME}/results.csv
  • Computed metrics and results printed: Exact same as before as we reuse an adapted version of the old functions in cli_analyze
  • Dependencies: Requires existing pickle files in the folder (can run without RUN_BENCHMARKS if pickles already exist)

GENERATE_PLOTS:
Reads the CSV and generates bar chart comparisons, one per dataset/regularization combination.

  • Input: results/{RUN_NAME}/results.csv
  • Output: PNG files in results/{RUN_NAME}/figures/
  • Generates two plots per combination: raw runtime (seconds) and normalized runtime (relative to glum = 1.0)
  • Bars exceeding 10× the fastest runtime are clipped with the actual value annotated
  • If a library doesn't support a certain problem this is indicated by hatched bars with a "N/A" label
  • If a library doesn't converge on a problem, we indicate this by hatching the bar and putting a red "NC" on top. Not converged in this case refers to hitting max_iter
  • Dependencies: Only requires the CSV file, can run independently of the other steps

Git Tracking:
Only results/docs/results.csv and config.yaml are tracked. Pickles, figures, and other run folders are gitignored. The reason for this is that only the CSV file and the config used to generate the documentation figures is pushed. All other files used for testing or iteration, for example, will not be tracked.

Usage:
Once we have agreed on how to set the parameters in the configuration section, the idea is that one can simply run this script periodically to obtain new figures that can be used to update the documentation.
If you want to update the benchmarks, you can simply run the benchmarking script with the pre-configured configuration, update the documentation with the generated figures, and push the CSV so that everyone can recreate it.

Tests:
I also added some basic tests in test_benchmarks.py (those are AI-generated!!).

Cleaning of the code
I deleted some old leftover code snippets that were used during development and are no longer needed due to the new way of running benchmarks.

  • [benchmark_data.csv, benchmark_figure.py] as we now handle all of this in run_benchmarks.py
  • [run_intermediate_benchmarks.sh, run_narrow_benchmarks.sh, run_real_benchmarks.sh, run_wide_benchmarks.sh] as those just called CLI commands that are now replaced by run_benchmarks.py
  • [benchmark_dense_sandwich.py, benchmark_sparse_sandwich.py] which were used during development but are no longer relevant
  • [cli_analyze.py, cli_run.py, test_cli.py] as the CLI is replaced by run_benchmarks.py. The relevant functions were moved to util.py so that they can be reused in the new benchmarking script

Open topics:
There are a few things that still need to be tackled:

  • Once we have agreed on the settings (e.g. whether or not to scale) and the problems we want to run the benchmarks on, we will have to run the benchmarks and update the documentation
  • Automated replacement of the doc scripts (controlled by a bool)
  • Add a case where we (almost) equal number of rows and columns
  • Allow yaml to take more flexible combinations instead of always doing the cross-product
  • We have to update the section in the README and documentation explaining how to run the benchmarks

Base automatically changed from fix/housing_data_generation to main January 22, 2026 13:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the benchmarking infrastructure for the glum library to modernize it and make it easier to maintain and use. The goal is to update benchmarks against newer libraries and simplify the overly complex CLI-based system.

Changes:

  • Moved benchmarking code from src/glum_benchmarks to a top-level glum_benchmarks directory (not part of the distributed package)
  • Replaced the CLI-based approach with a single self-contained run_benchmarks.py script with configuration at the top
  • Added new benchmark libraries: Celer, scikit-learn, and skglm
  • Removed old benchmark scripts, CLI code, and unused development code
  • Simplified problem definitions by removing CV, weights, and offset variants
  • Added basic tests for the benchmark infrastructure

Reviewed changes

Copilot reviewed 36 out of 42 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
glum_benchmarks/run_benchmarks.py New main benchmarking script with configurable workflow (run, analyze, plot)
glum_benchmarks/util.py Refactored utilities with moved functions from CLI, added standardization
glum_benchmarks/problems.py Updated problem definitions, removed single_precision support
glum_benchmarks/libraries/ New library benchmark implementations for celer, sklearn, skglm
glum_benchmarks/data/ New data generation modules for housing and insurance datasets
glum_benchmarks/tests/test_benchmarks.py New basic tests for benchmark infrastructure
glum_benchmarks/README.md New comprehensive documentation for running benchmarks
setup.py Removed glum_benchmarks from package distribution and CLI entry points
pixi.toml Updated benchmark environment configuration and tasks
.gitignore Updated to track only results/docs/results.csv
Tests in tests/glm/ Updated imports to use new module structure
Old files Deleted CLI code, old scripts, and unused development code
Comments suppressed due to low confidence (3)

glum_benchmarks/util.py:359

  • The import of diags from scipy.sparse is placed inside the function. While this works, it's unconventional and could lead to performance issues if this function is called frequently in a loop. Consider moving this import to the top of the file with the other scipy.sparse imports.
    glum_benchmarks/util.py:300
  • The get_params_from_fname function doesn't validate that the number of parts matches the expected parameter count. If a filename has too many or too few underscore-separated parts, this could lead to silent errors where parameters are missing or incorrectly parsed. Consider adding validation to ensure the parts list has the expected length.
    glum_benchmarks/util.py:289
  • The parse_value function will raise a ValueError if the value cannot be converted to the expected type (e.g., int or float). This error is not handled and could cause the benchmark analysis to fail. Consider adding try-except error handling to provide more informative error messages when parsing fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@stanmart stanmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super promising, and much more user-friendly than the old benchmark suit. Very good job!

result = L(
dat,
distribution=P.distribution,
alpha=params.regularization_strength,
Copy link
Contributor

@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the terminology with regularization a bit confusing:

  • regularization_strength is per-observation alpha for unweighted data.
  • reg_multiplier is the inverse weight.
  • alpha is the lower level name for regularization_strength.

I think that the following names would be clearer:

  • alpha is set in the config, noting in the documentation that this is per-observation alpha for unweighted data.
  • Instead of reg_multiplier, we can just pass a weight column to L and perform the reg_multiplier computation internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lof of sense, the old approach was indeed overcomplicated.

The new process is as follows:

  • alpha is set in the config
  • execute_problem_library adjusts alpha once if sample_weight exist before it is passed to L, so we don't need to this anymore within each benchmarking function
  • reg_multiplier and regularization_strength are completely omitted as we don't need them anymore (terminology aligned to use alpha)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants