Benchmarking against other quantized kernels by ncylich · Pull Request #458 · cactus-compute/cactus

ncylich · 2026-02-26T09:29:00Z

Summary

Adds a unified matmul benchmark suite (tests/bench/) that compares Cactus INT8/INT4 kernels against 6 other inference frameworks (GGML, MLX, MLC-LLM, LiteRT, ONNX Runtime, ExecuTorch) on identical 1024x1024 workloads
Moves third-party dependencies from third_party/ to ../third_party/ (outside the repo root) to keep the repository clean
Adds missing PARAKEET model type, num_mel_bins, encoder_hidden_act, pad_token_id, and hann_periodic fields to engine config to fix build on this branch

Benchmark architecture

Each backend implements a single run_kernel(M, weights, activations, act_int8, act_scales, output, reference) function pointer. When output/reference are null (the hot path), the kernel runs with zero capture overhead. When non-null (called once for accuracy), the kernel writes fp32 output and an optional dequantized reference for NRMSE checking.

The timed loop cycles through 64 distinct weight matrices per iteration to force L2 cache misses on every call, matching real inference where each transformer layer has unique weights. On the M4 Pro (16 MB L2), 64 matrices x ~1 MB each = 64 MB, exceeding both L2 and SLC.

Two matmul sizes are tested: 1x1024x1024 (GEMV, single-token decode) and 1024x1024x1024 (GEMM, batched/prefill).

Backends (11 total)

Framework	Backends	Build flag
Cactus	`cactus_int8`, `cactus_int4`	always on
GGML	`ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, `ggml_q8_0_graph`	`-DWITH_GGML=ON`
MLX	`mlx_q4_cpu`, `mlx_q8_cpu`, `mlx_q4_gpu`, `mlx_q8_gpu`	`-DWITH_MLX=ON`
MLC-LLM	`mlc_int4`, `mlc_int8`	`-DWITH_MLC=ON`
LiteRT	`litert_neon`, `ruy`, `litert_4bit_neon`	`-DWITH_LITERT=ON`
ONNX Runtime	`onnxrt_int8`, `onnxrt_int4`	`-DWITH_ONNXRT=ON`
ExecuTorch	`executorch_int8`, `executorch_int4`	`-DWITH_EXECUTORCH=ON`

Key results (M4 Pro, 14 cores)

GEMV (M=1): Cactus INT8 at 32.5 us is 2nd only to LiteRT NEON (23.6 us), beating GGML (57 us), MLX (140 us), and ONNX Runtime (587 us). Cactus INT4 at 33.6 us is 2nd to LiteRT 4bit (10.8 us).

GEMM (M=1024): Cactus INT8 at 828 us (2593 GOPS) is 2nd to MLX (606 us, using Accelerate/AMX), beating Ruy (1348 us), ONNX Runtime (1616 us), and GGML (3269 us). Cactus INT4 at 1192 us is again 2nd to MLX (595 us).

Changes

tests/bench/ — new benchmark suite (matmul_bench.cpp, bench_driver, bench_common, 7 backend files, README with full results)
tests/CMakeLists.txt — build rules for benchmark executable and optional framework backends, updated third_party paths to ../../third_party/
cactus/engine/engine.h — added PARAKEET enum value, num_mel_bins, encoder_hidden_act, pad_token_id, hann_periodic fields (fixes build)

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…e still keeping core lookping/cache flushing logic. properly implemented/fixed mlx. optimized thread counts. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot

Pull request overview

This PR adds comprehensive benchmarking infrastructure to compare Cactus's quantized matrix multiplication kernels (INT4/INT8) against other popular inference frameworks including GGML, LiteRT (TensorFlow Lite), MLX, MLC-LLM, ONNX Runtime, and ExecuTorch/XNNPACK. The benchmark suite measures performance and accuracy on standard matrix sizes (1x1024x1024 for GEMV and 1024x1024x1024 for GEMM), enabling objective comparisons of CPU-based quantized inference performance.

Changes:

Added modular benchmark driver infrastructure with backend registration system
Implemented 7+ backend adapters for external frameworks with optional CMake integration
Added shell script for automated framework detection and build configuration

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
tests/run_benchmark.sh	Shell script for building and running benchmarks with automatic framework detection
tests/bench/matmul_bench.cpp	Main benchmark entry point that orchestrates test execution
tests/bench/bench_driver.{h,cpp}	Core benchmark driver with backend registry and execution logic
tests/bench/bench_common.{h,cpp}	Shared quantization utilities and accuracy checking functions
tests/bench/backend_*.cpp	Backend implementations for Cactus, GGML, LiteRT, MLX, MLC, ONNX Runtime, and ExecuTorch
tests/bench/README.md	Comprehensive documentation with setup instructions and benchmark results
tests/CMakeLists.txt	CMake configuration with conditional framework integration
cactus/engine/engine.h	Model configuration updates (unrelated to benchmarking)

Comments suppressed due to low confidence (1)

tests/run_benchmark.sh:83

The executable path references test_matmul_bench but the CMakeLists.txt defines the executable as matmul_bench (line 241). This will cause the script to fail when attempting to run the benchmark. The path should be "$BUILD_DIR/matmul_bench".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T09:46:21Z

tests/bench/README.md

+cmake .. -DWITH_LITERT=ON
+```
+
+Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends.


The documentation mentions backend names ruy_mc and ruy_1c that are not registered in backend_litert.cpp. The actual registered backends are litert_neon, ruy, and litert_4bit_neon. The documentation should be updated to reflect the actual backend names.

Suggested change

Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends.

Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy`, and `litert_4bit_neon` backends.

Copilot · 2026-02-26T09:46:22Z

tests/bench/README.md

+## Quick Start (Cactus-only, no third-party deps)
+
+```bash
+cactus build


The Quick Start instructions reference a "cactus build" command which is not a standard command in the repository. The run_benchmark.sh script sources "./setup" instead (line 24). The documentation should be consistent with the actual build process, or explain what "cactus build" refers to.

Suggested change

cactus build

./setup

Copilot · 2026-02-26T09:46:22Z

cactus/engine/engine.h

+        float preemphasis = 0.0f;
+        bool hann_periodic = true;


The addition of preemphasis and hann_periodic fields to SpectrogramConfig appears unrelated to the benchmarking PR. These audio processing configuration changes should be in a separate PR.

Copilot · 2026-02-26T09:46:23Z

tests/bench/bench_common.cpp

+static void quantize_per_group(const std::vector<float>& src, size_t N, size_t K,
+                                std::vector<int8_t>& dst, std::vector<float>& scales,
+                                int qmax, int qmin) {
+    const size_t num_groups = K / kGroupSize;
+    dst.resize(N * K);
+    scales.resize(N * num_groups);
+    for (size_t n = 0; n < N; ++n) {
+        for (size_t g = 0; g < num_groups; ++g) {
+            float max_abs = 0.0f;
+            const size_t base = n * K + g * kGroupSize;
+            for (size_t k = 0; k < kGroupSize; ++k)
+                max_abs = std::max(max_abs, std::abs(src[base + k]));
+            float scale = std::max(max_abs / static_cast<float>(qmax), 1e-10f);
+            scales[n * num_groups + g] = scale;
+            for (size_t k = 0; k < kGroupSize; ++k) {
+                int q = static_cast<int>(std::round(src[base + k] / scale));
+                dst[base + k] = static_cast<int8_t>(std::max(qmin, std::min(qmax, q)));
+            }
+        }
+    }
+}


Missing validation: The function assumes K is evenly divisible by kGroupSize (32). If K is not a multiple of kGroupSize, the function will access out-of-bounds memory or produce incorrect results. The code should either validate this precondition or handle partial groups.

Copilot · 2026-02-26T09:46:23Z

tests/bench/bench_common.cpp

+std::vector<uint8_t> pack_int4_pairs(const std::vector<int8_t>& interleaved) {
+    std::vector<uint8_t> packed(interleaved.size() / 2);
+    for (size_t i = 0; i < interleaved.size(); i += 32) {
+        for (size_t j = 0; j < 16; ++j) {
+            const uint8_t lo = static_cast<uint8_t>(interleaved[i + j] & 0x0F);
+            const uint8_t hi = static_cast<uint8_t>((interleaved[i + 16 + j] & 0x0F) << 4);
+            packed[i / 2 + j] = lo | hi;
+        }
+    }
+    return packed;
+}


Missing validation: The function assumes interleaved.size() is evenly divisible by 32. If the size is not a multiple of 32, the loop will access out-of-bounds memory when i + 16 + j exceeds the vector size. The code should either validate this precondition or handle the remainder properly.

Copilot · 2026-02-26T09:46:25Z

tests/CMakeLists.txt

+option(WITH_ONNXRT "Build with ONNX Runtime for matmul benchmarks" OFF)
+
+if(WITH_LITERT)
+    set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert)


Missing directory validation: The LITERT_DIR path is constructed without checking if it exists. If the directory doesn't exist, the build will fail with cryptic errors. Consider adding a check like: if(NOT EXISTS "$${LITERT_DIR}") message(FATAL_ERROR "LiteRT directory not found at $${LITERT_DIR}") endif().

Suggested change

set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert)

set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert)

if(NOT EXISTS "${LITERT_DIR}")

message(FATAL_ERROR "LiteRT directory not found at ${LITERT_DIR}")

endif()

Copilot · 2026-02-26T09:46:26Z

tests/bench/backend_litert.cpp

+    tflite::optimized_4bit::api::Prepack(
+        w->prepacked, litert_source.data(),
+        w->lhs_layout_rows, w->lhs_layout_cols,


Missing error handling: The posix_memalign function can fail (return non-zero), but the code doesn't check the return value. If allocation fails, raw will remain nullptr, and w->prepacked will be set to nullptr. This will likely cause a crash when passed to the Prepack function on line 176. The code should check the return value and handle allocation failures appropriately.

Copilot · 2026-02-26T09:46:26Z

tests/bench/backend_litert.cpp

+        w->lhs_layout_rows, w->lhs_layout_cols,
+        static_cast<int>(N), static_cast<int>(K),


Missing error handling: The posix_memalign function can fail (return non-zero), but the code doesn't check the return value. If allocation fails, raw will remain nullptr, and w->ref_prepacked will be set to nullptr. This will likely cause a crash when passed to ReferencePrepack on line 185. The code should check the return value and handle allocation failures appropriately.

Copilot · 2026-02-26T09:46:26Z

cactus/engine/engine.h

    float default_max_tps = -1.0f;
    float default_cloud_handoff_threshold = 0.0f;

+    uint32_t num_mel_bins = 80;


The addition of num_mel_bins to the Config struct appears unrelated to the benchmarking infrastructure described in the PR title and description. This change should be in a separate PR focused on model configuration updates.

Copilot · 2026-02-26T09:46:26Z

tests/bench/README.md

+cmake .. -DWITH_GGML=ON
+```
+
+Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends.


The documentation mentions backend names ggml_q4_0_graph and ggml_q8_0_graph that are commented out in the actual code (backend_ggml.cpp lines 292-299). These backends are not actually registered or available, which contradicts the documentation. Either remove these from the documentation or uncomment the registration code.

Suggested change

Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends.

Builds GGML from source. Enables `ggml_q4_0` and `ggml_q8_0` backends.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

… ggml, also included dequantization step in litert Signed-off-by: Noah Cylich <noahcylich@gmail.com>

also added proper threading setup (to enable max and default use cases) Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…ct head sizes of 64. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…ormance Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…eaved format Signed-off-by: Noah Cylich <noahcylich@gmail.com>

… and interleave to optimize attn decode hybrid kernel Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Add CSV, PNG, log files, corpus/, kv_profile_results/, tests/results/, and tests/analysis/ to prevent benchmark output from being tracked. Signed-off-by: Noah Cylich <noah@desertai.io>

Evaluates model perplexity across KV cache configurations (FP16, INT8, INT4, K8V4) with cached and uncached scoring modes. Supports configurable context windows, chunk sizes, and per-position accuracy bucketing. Signed-off-by: Noah Cylich <noah@desertai.io>

Structured benchmark suite measuring task success rate alongside decode performance. Covers code generation (20 problems), math reasoning (15 GSM8K-style), and instruction following (10 problems). Supports baseline save/compare modes for regression detection. Signed-off-by: Noah Cylich <noah@desertai.io>

Three experiment scripts for INT4 KV cache analysis: - exp_0c: keys vs values sensitivity (K4V8 vs K8V4 vs K4V4) - exp_1d: long-context error accumulation across decode steps - exp_1e: per-layer sensitivity (INT4 one layer at a time) Signed-off-by: Noah Cylich <noah@desertai.io>

Per-token timing benchmark (bench_e2e_per_token.py) using streaming callback for precise per-token measurements. Branch comparison script (bench_e2e_decode.sh) that builds and benchmarks current branch vs main, outputting CSV with decode_tps, prefill_tps, and latency metrics. Signed-off-by: Noah Cylich <noah@desertai.io>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich added 5 commits February 25, 2026 17:38

simplifying everything into 1 large commit for the baseline benchmark

a6ed741

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

added benchmarking script and updated for moved third_party lib

40f1e82

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

cleaned and consolidated testing logic

2c959f6

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

completely refractored, cleaned, and cleaned/minimized code base whil…

5e246f1

…e still keeping core lookping/cache flushing logic. properly implemented/fixed mlx. optimized thread counts. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

added results

25eaed2

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot AI review requested due to automatic review settings February 26, 2026 09:29

Copilot started reviewing on behalf of ncylich February 26, 2026 09:29 View session

cleaning comments

b876aa2

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot AI reviewed Feb 26, 2026

View reviewed changes

ncylich added 5 commits February 26, 2026 10:22

adding mlx cpu tests and cleaning code

f8ad8be

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Merge branch 'main' into int4-benchmark

8ae6c71

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

created benchmarking doc

5f3e0ba

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

fixed main divergences

3c75cc1

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

making fairer and simpler reference kernel comparisons for litert and…

01bbe89

… ggml, also included dequantization step in litert Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich force-pushed the int4-benchmark branch from fc8bb8c to 01bbe89 Compare February 26, 2026 19:55

ncylich added 3 commits February 26, 2026 15:42

implemented attn backends with optimized backends

b15ad00

also added proper threading setup (to enable max and default use cases) Signed-off-by: Noah Cylich <noahcylich@gmail.com>

cleaned code for better re-use and de-duplication

21149f0

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

re-ran benchmarks for everything

b1b1116

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich force-pushed the int4-benchmark branch from ad73099 to b1b1116 Compare February 27, 2026 01:47

ncylich added 11 commits February 26, 2026 20:10

fixed typo and made analysis clearer

f39e1de

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Merge branch 'main' into int4-benchmark

57bfeaf

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

made the f16 prefill attention implementations more general than stri…

56ca859

…ct head sizes of 64. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Merge branch 'main' into general-fast-attn

f45d732

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Merge branch 'main' into int4-benchmark

e3b7c1e

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Merge branch 'general-fast-attn' into int4-benchmark

06b50b5

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

first round of experiment results, transposed modestly increases perf…

479d0ef

…ormance Signed-off-by: Noah Cylich <noahcylich@gmail.com>

cleaned, optimal ~9% faster attn decode using deferred scale + interl…

516225b

…eaved format Signed-off-by: Noah Cylich <noahcylich@gmail.com>

designed proper attn benchmark and optimized kernel using defer scale…

80d1508

… and interleave to optimize attn decode hybrid kernel Signed-off-by: Noah Cylich <noahcylich@gmail.com>

update .gitignore: exclude benchmark artifacts

0c9e0e4

Add CSV, PNG, log files, corpus/, kv_profile_results/, tests/results/, and tests/analysis/ to prevent benchmark output from being tracked. Signed-off-by: Noah Cylich <noah@desertai.io>

ncylich added 5 commits March 2, 2026 14:24

Merge branch 'main' into int4-benchmark

740793b

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

adding e2e decoding tps benchmark

553caf5

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking against other quantized kernels#458

Benchmarking against other quantized kernels#458
ncylich wants to merge 30 commits intomainfrom
int4-benchmark

ncylich commented Feb 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends.
	Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy`, and `litert_4bit_neon` backends.

		w->lhs_layout_rows, w->lhs_layout_cols,
		static_cast<int>(N), static_cast<int>(K),

	Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends.
	Builds GGML from source. Enables `ggml_q4_0` and `ggml_q8_0` backends.

Conversation

ncylich commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark architecture

Backends (11 total)

Key results (M4 Pro, 14 cores)

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ncylich commented Feb 26, 2026 •

edited

Loading