Skip to content

Benchmarking against other quantized kernels#458

Open
ncylich wants to merge 30 commits intomainfrom
int4-benchmark
Open

Benchmarking against other quantized kernels#458
ncylich wants to merge 30 commits intomainfrom
int4-benchmark

Conversation

@ncylich
Copy link
Collaborator

@ncylich ncylich commented Feb 26, 2026

Summary

  • Adds a unified matmul benchmark suite (tests/bench/) that compares Cactus INT8/INT4 kernels against 6 other inference frameworks (GGML, MLX, MLC-LLM, LiteRT, ONNX Runtime, ExecuTorch) on identical 1024x1024 workloads
  • Moves third-party dependencies from third_party/ to ../third_party/ (outside the repo root) to keep the repository clean
  • Adds missing PARAKEET model type, num_mel_bins, encoder_hidden_act, pad_token_id, and hann_periodic fields to engine config to fix build on this branch

Benchmark architecture

Each backend implements a single run_kernel(M, weights, activations, act_int8, act_scales, output, reference) function pointer. When output/reference are null (the hot path), the kernel runs with zero capture overhead. When non-null (called once for accuracy), the kernel writes fp32 output and an optional dequantized reference for NRMSE checking.

The timed loop cycles through 64 distinct weight matrices per iteration to force L2 cache misses on every call, matching real inference where each transformer layer has unique weights. On the M4 Pro (16 MB L2), 64 matrices x ~1 MB each = 64 MB, exceeding both L2 and SLC.

Two matmul sizes are tested: 1x1024x1024 (GEMV, single-token decode) and 1024x1024x1024 (GEMM, batched/prefill).

Backends (11 total)

Framework Backends Build flag
Cactus cactus_int8, cactus_int4 always on
GGML ggml_q4_0, ggml_q8_0, ggml_q4_0_graph, ggml_q8_0_graph -DWITH_GGML=ON
MLX mlx_q4_cpu, mlx_q8_cpu, mlx_q4_gpu, mlx_q8_gpu -DWITH_MLX=ON
MLC-LLM mlc_int4, mlc_int8 -DWITH_MLC=ON
LiteRT litert_neon, ruy, litert_4bit_neon -DWITH_LITERT=ON
ONNX Runtime onnxrt_int8, onnxrt_int4 -DWITH_ONNXRT=ON
ExecuTorch executorch_int8, executorch_int4 -DWITH_EXECUTORCH=ON

Key results (M4 Pro, 14 cores)

GEMV (M=1): Cactus INT8 at 32.5 us is 2nd only to LiteRT NEON (23.6 us), beating GGML (57 us), MLX (140 us), and ONNX Runtime (587 us). Cactus INT4 at 33.6 us is 2nd to LiteRT 4bit (10.8 us).

GEMM (M=1024): Cactus INT8 at 828 us (2593 GOPS) is 2nd to MLX (606 us, using Accelerate/AMX), beating Ruy (1348 us), ONNX Runtime (1616 us), and GGML (3269 us). Cactus INT4 at 1192 us is again 2nd to MLX (595 us).

Changes

  • tests/bench/ — new benchmark suite (matmul_bench.cpp, bench_driver, bench_common, 7 backend files, README with full results)
  • tests/CMakeLists.txt — build rules for benchmark executable and optional framework backends, updated third_party paths to ../../third_party/
  • cactus/engine/engine.h — added PARAKEET enum value, num_mel_bins, encoder_hidden_act, pad_token_id, hann_periodic fields (fixes build)

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…e still keeping core lookping/cache flushing logic. properly implemented/fixed mlx. optimized thread counts.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Copilot AI review requested due to automatic review settings February 26, 2026 09:29
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive benchmarking infrastructure to compare Cactus's quantized matrix multiplication kernels (INT4/INT8) against other popular inference frameworks including GGML, LiteRT (TensorFlow Lite), MLX, MLC-LLM, ONNX Runtime, and ExecuTorch/XNNPACK. The benchmark suite measures performance and accuracy on standard matrix sizes (1x1024x1024 for GEMV and 1024x1024x1024 for GEMM), enabling objective comparisons of CPU-based quantized inference performance.

Changes:

  • Added modular benchmark driver infrastructure with backend registration system
  • Implemented 7+ backend adapters for external frameworks with optional CMake integration
  • Added shell script for automated framework detection and build configuration

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
tests/run_benchmark.sh Shell script for building and running benchmarks with automatic framework detection
tests/bench/matmul_bench.cpp Main benchmark entry point that orchestrates test execution
tests/bench/bench_driver.{h,cpp} Core benchmark driver with backend registry and execution logic
tests/bench/bench_common.{h,cpp} Shared quantization utilities and accuracy checking functions
tests/bench/backend_*.cpp Backend implementations for Cactus, GGML, LiteRT, MLX, MLC, ONNX Runtime, and ExecuTorch
tests/bench/README.md Comprehensive documentation with setup instructions and benchmark results
tests/CMakeLists.txt CMake configuration with conditional framework integration
cactus/engine/engine.h Model configuration updates (unrelated to benchmarking)
Comments suppressed due to low confidence (1)

tests/run_benchmark.sh:83

  • The executable path references test_matmul_bench but the CMakeLists.txt defines the executable as matmul_bench (line 241). This will cause the script to fail when attempting to run the benchmark. The path should be "$BUILD_DIR/matmul_bench".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmake .. -DWITH_LITERT=ON
```

Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends.
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions backend names ruy_mc and ruy_1c that are not registered in backend_litert.cpp. The actual registered backends are litert_neon, ruy, and litert_4bit_neon. The documentation should be updated to reflect the actual backend names.

Suggested change
Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends.
Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy`, and `litert_4bit_neon` backends.

Copilot uses AI. Check for mistakes.
## Quick Start (Cactus-only, no third-party deps)

```bash
cactus build
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Quick Start instructions reference a "cactus build" command which is not a standard command in the repository. The run_benchmark.sh script sources "./setup" instead (line 24). The documentation should be consistent with the actual build process, or explain what "cactus build" refers to.

Suggested change
cactus build
./setup

Copilot uses AI. Check for mistakes.
Comment on lines +701 to +702
float preemphasis = 0.0f;
bool hann_periodic = true;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of preemphasis and hann_periodic fields to SpectrogramConfig appears unrelated to the benchmarking PR. These audio processing configuration changes should be in a separate PR.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +28
static void quantize_per_group(const std::vector<float>& src, size_t N, size_t K,
std::vector<int8_t>& dst, std::vector<float>& scales,
int qmax, int qmin) {
const size_t num_groups = K / kGroupSize;
dst.resize(N * K);
scales.resize(N * num_groups);
for (size_t n = 0; n < N; ++n) {
for (size_t g = 0; g < num_groups; ++g) {
float max_abs = 0.0f;
const size_t base = n * K + g * kGroupSize;
for (size_t k = 0; k < kGroupSize; ++k)
max_abs = std::max(max_abs, std::abs(src[base + k]));
float scale = std::max(max_abs / static_cast<float>(qmax), 1e-10f);
scales[n * num_groups + g] = scale;
for (size_t k = 0; k < kGroupSize; ++k) {
int q = static_cast<int>(std::round(src[base + k] / scale));
dst[base + k] = static_cast<int8_t>(std::max(qmin, std::min(qmax, q)));
}
}
}
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation: The function assumes K is evenly divisible by kGroupSize (32). If K is not a multiple of kGroupSize, the function will access out-of-bounds memory or produce incorrect results. The code should either validate this precondition or handle partial groups.

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +91
std::vector<uint8_t> pack_int4_pairs(const std::vector<int8_t>& interleaved) {
std::vector<uint8_t> packed(interleaved.size() / 2);
for (size_t i = 0; i < interleaved.size(); i += 32) {
for (size_t j = 0; j < 16; ++j) {
const uint8_t lo = static_cast<uint8_t>(interleaved[i + j] & 0x0F);
const uint8_t hi = static_cast<uint8_t>((interleaved[i + 16 + j] & 0x0F) << 4);
packed[i / 2 + j] = lo | hi;
}
}
return packed;
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation: The function assumes interleaved.size() is evenly divisible by 32. If the size is not a multiple of 32, the loop will access out-of-bounds memory when i + 16 + j exceeds the vector size. The code should either validate this precondition or handle the remainder properly.

Copilot uses AI. Check for mistakes.
option(WITH_ONNXRT "Build with ONNX Runtime for matmul benchmarks" OFF)

if(WITH_LITERT)
set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing directory validation: The LITERT_DIR path is constructed without checking if it exists. If the directory doesn't exist, the build will fail with cryptic errors. Consider adding a check like: if(NOT EXISTS "$${LITERT_DIR}") message(FATAL_ERROR "LiteRT directory not found at $${LITERT_DIR}") endif().

Suggested change
set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert)
set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert)
if(NOT EXISTS "${LITERT_DIR}")
message(FATAL_ERROR "LiteRT directory not found at ${LITERT_DIR}")
endif()

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +174
tflite::optimized_4bit::api::Prepack(
w->prepacked, litert_source.data(),
w->lhs_layout_rows, w->lhs_layout_cols,
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling: The posix_memalign function can fail (return non-zero), but the code doesn't check the return value. If allocation fails, raw will remain nullptr, and w->prepacked will be set to nullptr. This will likely cause a crash when passed to the Prepack function on line 176. The code should check the return value and handle allocation failures appropriately.

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +184
w->lhs_layout_rows, w->lhs_layout_cols,
static_cast<int>(N), static_cast<int>(K),
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling: The posix_memalign function can fail (return non-zero), but the code doesn't check the return value. If allocation fails, raw will remain nullptr, and w->ref_prepacked will be set to nullptr. This will likely cause a crash when passed to ReferencePrepack on line 185. The code should check the return value and handle allocation failures appropriately.

Copilot uses AI. Check for mistakes.
float default_max_tps = -1.0f;
float default_cloud_handoff_threshold = 0.0f;

uint32_t num_mel_bins = 80;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of num_mel_bins to the Config struct appears unrelated to the benchmarking infrastructure described in the PR title and description. This change should be in a separate PR focused on model configuration updates.

Copilot uses AI. Check for mistakes.
cmake .. -DWITH_GGML=ON
```

Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends.
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions backend names ggml_q4_0_graph and ggml_q8_0_graph that are commented out in the actual code (backend_ggml.cpp lines 292-299). These backends are not actually registered or available, which contradicts the documentation. Either remove these from the documentation or uncomment the registration code.

Suggested change
Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends.
Builds GGML from source. Enables `ggml_q4_0` and `ggml_q8_0` backends.

Copilot uses AI. Check for mistakes.
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
… ggml, also included dequantization step in litert

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
also added proper threading setup (to enable max and default use cases)

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…ct head sizes of 64.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…ormance

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…eaved format

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
… and interleave to optimize attn decode hybrid kernel

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Add CSV, PNG, log files, corpus/, kv_profile_results/, tests/results/,
and tests/analysis/ to prevent benchmark output from being tracked.

Signed-off-by: Noah Cylich <noah@desertai.io>
Evaluates model perplexity across KV cache configurations (FP16, INT8,
INT4, K8V4) with cached and uncached scoring modes. Supports configurable
context windows, chunk sizes, and per-position accuracy bucketing.

Signed-off-by: Noah Cylich <noah@desertai.io>
ncylich added 5 commits March 2, 2026 14:24
Structured benchmark suite measuring task success rate alongside decode
performance. Covers code generation (20 problems), math reasoning (15
GSM8K-style), and instruction following (10 problems). Supports baseline
save/compare modes for regression detection.

Signed-off-by: Noah Cylich <noah@desertai.io>
Three experiment scripts for INT4 KV cache analysis:
- exp_0c: keys vs values sensitivity (K4V8 vs K8V4 vs K4V4)
- exp_1d: long-context error accumulation across decode steps
- exp_1e: per-layer sensitivity (INT4 one layer at a time)

Signed-off-by: Noah Cylich <noah@desertai.io>
Per-token timing benchmark (bench_e2e_per_token.py) using streaming
callback for precise per-token measurements. Branch comparison script
(bench_e2e_decode.sh) that builds and benchmarks current branch vs main,
outputting CSV with decode_tps, prefill_tps, and latency metrics.

Signed-off-by: Noah Cylich <noah@desertai.io>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants