Conversation
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…e still keeping core lookping/cache flushing logic. properly implemented/fixed mlx. optimized thread counts. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive benchmarking infrastructure to compare Cactus's quantized matrix multiplication kernels (INT4/INT8) against other popular inference frameworks including GGML, LiteRT (TensorFlow Lite), MLX, MLC-LLM, ONNX Runtime, and ExecuTorch/XNNPACK. The benchmark suite measures performance and accuracy on standard matrix sizes (1x1024x1024 for GEMV and 1024x1024x1024 for GEMM), enabling objective comparisons of CPU-based quantized inference performance.
Changes:
- Added modular benchmark driver infrastructure with backend registration system
- Implemented 7+ backend adapters for external frameworks with optional CMake integration
- Added shell script for automated framework detection and build configuration
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/run_benchmark.sh | Shell script for building and running benchmarks with automatic framework detection |
| tests/bench/matmul_bench.cpp | Main benchmark entry point that orchestrates test execution |
| tests/bench/bench_driver.{h,cpp} | Core benchmark driver with backend registry and execution logic |
| tests/bench/bench_common.{h,cpp} | Shared quantization utilities and accuracy checking functions |
| tests/bench/backend_*.cpp | Backend implementations for Cactus, GGML, LiteRT, MLX, MLC, ONNX Runtime, and ExecuTorch |
| tests/bench/README.md | Comprehensive documentation with setup instructions and benchmark results |
| tests/CMakeLists.txt | CMake configuration with conditional framework integration |
| cactus/engine/engine.h | Model configuration updates (unrelated to benchmarking) |
Comments suppressed due to low confidence (1)
tests/run_benchmark.sh:83
- The executable path references test_matmul_bench but the CMakeLists.txt defines the executable as matmul_bench (line 241). This will cause the script to fail when attempting to run the benchmark. The path should be "$BUILD_DIR/matmul_bench".
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/bench/README.md
Outdated
| cmake .. -DWITH_LITERT=ON | ||
| ``` | ||
|
|
||
| Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends. |
There was a problem hiding this comment.
The documentation mentions backend names ruy_mc and ruy_1c that are not registered in backend_litert.cpp. The actual registered backends are litert_neon, ruy, and litert_4bit_neon. The documentation should be updated to reflect the actual backend names.
| Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy_mc`, `ruy_1c`, and `litert_4bit_neon` backends. | |
| Fetches FlatBuffers + TFLite deps on first build (requires network). Enables `litert_neon`, `ruy`, and `litert_4bit_neon` backends. |
| ## Quick Start (Cactus-only, no third-party deps) | ||
|
|
||
| ```bash | ||
| cactus build |
There was a problem hiding this comment.
The Quick Start instructions reference a "cactus build" command which is not a standard command in the repository. The run_benchmark.sh script sources "./setup" instead (line 24). The documentation should be consistent with the actual build process, or explain what "cactus build" refers to.
| cactus build | |
| ./setup |
| float preemphasis = 0.0f; | ||
| bool hann_periodic = true; |
There was a problem hiding this comment.
The addition of preemphasis and hann_periodic fields to SpectrogramConfig appears unrelated to the benchmarking PR. These audio processing configuration changes should be in a separate PR.
| static void quantize_per_group(const std::vector<float>& src, size_t N, size_t K, | ||
| std::vector<int8_t>& dst, std::vector<float>& scales, | ||
| int qmax, int qmin) { | ||
| const size_t num_groups = K / kGroupSize; | ||
| dst.resize(N * K); | ||
| scales.resize(N * num_groups); | ||
| for (size_t n = 0; n < N; ++n) { | ||
| for (size_t g = 0; g < num_groups; ++g) { | ||
| float max_abs = 0.0f; | ||
| const size_t base = n * K + g * kGroupSize; | ||
| for (size_t k = 0; k < kGroupSize; ++k) | ||
| max_abs = std::max(max_abs, std::abs(src[base + k])); | ||
| float scale = std::max(max_abs / static_cast<float>(qmax), 1e-10f); | ||
| scales[n * num_groups + g] = scale; | ||
| for (size_t k = 0; k < kGroupSize; ++k) { | ||
| int q = static_cast<int>(std::round(src[base + k] / scale)); | ||
| dst[base + k] = static_cast<int8_t>(std::max(qmin, std::min(qmax, q))); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing validation: The function assumes K is evenly divisible by kGroupSize (32). If K is not a multiple of kGroupSize, the function will access out-of-bounds memory or produce incorrect results. The code should either validate this precondition or handle partial groups.
| std::vector<uint8_t> pack_int4_pairs(const std::vector<int8_t>& interleaved) { | ||
| std::vector<uint8_t> packed(interleaved.size() / 2); | ||
| for (size_t i = 0; i < interleaved.size(); i += 32) { | ||
| for (size_t j = 0; j < 16; ++j) { | ||
| const uint8_t lo = static_cast<uint8_t>(interleaved[i + j] & 0x0F); | ||
| const uint8_t hi = static_cast<uint8_t>((interleaved[i + 16 + j] & 0x0F) << 4); | ||
| packed[i / 2 + j] = lo | hi; | ||
| } | ||
| } | ||
| return packed; | ||
| } |
There was a problem hiding this comment.
Missing validation: The function assumes interleaved.size() is evenly divisible by 32. If the size is not a multiple of 32, the loop will access out-of-bounds memory when i + 16 + j exceeds the vector size. The code should either validate this precondition or handle the remainder properly.
| option(WITH_ONNXRT "Build with ONNX Runtime for matmul benchmarks" OFF) | ||
|
|
||
| if(WITH_LITERT) | ||
| set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert) |
There was a problem hiding this comment.
Missing directory validation: The LITERT_DIR path is constructed without checking if it exists. If the directory doesn't exist, the build will fail with cryptic errors. Consider adding a check like: if(NOT EXISTS "$${LITERT_DIR}") message(FATAL_ERROR "LiteRT directory not found at $${LITERT_DIR}") endif().
| set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert) | |
| set(LITERT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../../third_party/litert) | |
| if(NOT EXISTS "${LITERT_DIR}") | |
| message(FATAL_ERROR "LiteRT directory not found at ${LITERT_DIR}") | |
| endif() |
| tflite::optimized_4bit::api::Prepack( | ||
| w->prepacked, litert_source.data(), | ||
| w->lhs_layout_rows, w->lhs_layout_cols, |
There was a problem hiding this comment.
Missing error handling: The posix_memalign function can fail (return non-zero), but the code doesn't check the return value. If allocation fails, raw will remain nullptr, and w->prepacked will be set to nullptr. This will likely cause a crash when passed to the Prepack function on line 176. The code should check the return value and handle allocation failures appropriately.
tests/bench/backend_litert.cpp
Outdated
| w->lhs_layout_rows, w->lhs_layout_cols, | ||
| static_cast<int>(N), static_cast<int>(K), |
There was a problem hiding this comment.
Missing error handling: The posix_memalign function can fail (return non-zero), but the code doesn't check the return value. If allocation fails, raw will remain nullptr, and w->ref_prepacked will be set to nullptr. This will likely cause a crash when passed to ReferencePrepack on line 185. The code should check the return value and handle allocation failures appropriately.
cactus/engine/engine.h
Outdated
| float default_max_tps = -1.0f; | ||
| float default_cloud_handoff_threshold = 0.0f; | ||
|
|
||
| uint32_t num_mel_bins = 80; |
There was a problem hiding this comment.
The addition of num_mel_bins to the Config struct appears unrelated to the benchmarking infrastructure described in the PR title and description. This change should be in a separate PR focused on model configuration updates.
tests/bench/README.md
Outdated
| cmake .. -DWITH_GGML=ON | ||
| ``` | ||
|
|
||
| Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends. |
There was a problem hiding this comment.
The documentation mentions backend names ggml_q4_0_graph and ggml_q8_0_graph that are commented out in the actual code (backend_ggml.cpp lines 292-299). These backends are not actually registered or available, which contradicts the documentation. Either remove these from the documentation or uncomment the registration code.
| Builds GGML from source. Enables `ggml_q4_0`, `ggml_q8_0`, `ggml_q4_0_graph`, and `ggml_q8_0_graph` backends. | |
| Builds GGML from source. Enables `ggml_q4_0` and `ggml_q8_0` backends. |
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
… ggml, also included dequantization step in litert Signed-off-by: Noah Cylich <noahcylich@gmail.com>
fc8bb8c to
01bbe89
Compare
also added proper threading setup (to enable max and default use cases) Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ad73099 to
b1b1116
Compare
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…ct head sizes of 64. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…ormance Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…eaved format Signed-off-by: Noah Cylich <noahcylich@gmail.com>
… and interleave to optimize attn decode hybrid kernel Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Add CSV, PNG, log files, corpus/, kv_profile_results/, tests/results/, and tests/analysis/ to prevent benchmark output from being tracked. Signed-off-by: Noah Cylich <noah@desertai.io>
Evaluates model perplexity across KV cache configurations (FP16, INT8, INT4, K8V4) with cached and uncached scoring modes. Supports configurable context windows, chunk sizes, and per-position accuracy bucketing. Signed-off-by: Noah Cylich <noah@desertai.io>
Structured benchmark suite measuring task success rate alongside decode performance. Covers code generation (20 problems), math reasoning (15 GSM8K-style), and instruction following (10 problems). Supports baseline save/compare modes for regression detection. Signed-off-by: Noah Cylich <noah@desertai.io>
Three experiment scripts for INT4 KV cache analysis: - exp_0c: keys vs values sensitivity (K4V8 vs K8V4 vs K4V4) - exp_1d: long-context error accumulation across decode steps - exp_1e: per-layer sensitivity (INT4 one layer at a time) Signed-off-by: Noah Cylich <noah@desertai.io>
Per-token timing benchmark (bench_e2e_per_token.py) using streaming callback for precise per-token measurements. Branch comparison script (bench_e2e_decode.sh) that builds and benchmarks current branch vs main, outputting CSV with decode_tps, prefill_tps, and latency metrics. Signed-off-by: Noah Cylich <noah@desertai.io>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Summary
tests/bench/) that compares Cactus INT8/INT4 kernels against 6 other inference frameworks (GGML, MLX, MLC-LLM, LiteRT, ONNX Runtime, ExecuTorch) on identical 1024x1024 workloadsthird_party/to../third_party/(outside the repo root) to keep the repository cleanPARAKEETmodel type,num_mel_bins,encoder_hidden_act,pad_token_id, andhann_periodicfields to engine config to fix build on this branchBenchmark architecture
Each backend implements a single
run_kernel(M, weights, activations, act_int8, act_scales, output, reference)function pointer. Whenoutput/referenceare null (the hot path), the kernel runs with zero capture overhead. When non-null (called once for accuracy), the kernel writes fp32 output and an optional dequantized reference for NRMSE checking.The timed loop cycles through 64 distinct weight matrices per iteration to force L2 cache misses on every call, matching real inference where each transformer layer has unique weights. On the M4 Pro (16 MB L2), 64 matrices x ~1 MB each = 64 MB, exceeding both L2 and SLC.
Two matmul sizes are tested: 1x1024x1024 (GEMV, single-token decode) and 1024x1024x1024 (GEMM, batched/prefill).
Backends (11 total)
cactus_int8,cactus_int4ggml_q4_0,ggml_q8_0,ggml_q4_0_graph,ggml_q8_0_graph-DWITH_GGML=ONmlx_q4_cpu,mlx_q8_cpu,mlx_q4_gpu,mlx_q8_gpu-DWITH_MLX=ONmlc_int4,mlc_int8-DWITH_MLC=ONlitert_neon,ruy,litert_4bit_neon-DWITH_LITERT=ONonnxrt_int8,onnxrt_int4-DWITH_ONNXRT=ONexecutorch_int8,executorch_int4-DWITH_EXECUTORCH=ONKey results (M4 Pro, 14 cores)
GEMV (M=1): Cactus INT8 at 32.5 us is 2nd only to LiteRT NEON (23.6 us), beating GGML (57 us), MLX (140 us), and ONNX Runtime (587 us). Cactus INT4 at 33.6 us is 2nd to LiteRT 4bit (10.8 us).
GEMM (M=1024): Cactus INT8 at 828 us (2593 GOPS) is 2nd to MLX (606 us, using Accelerate/AMX), beating Ruy (1348 us), ONNX Runtime (1616 us), and GGML (3269 us). Cactus INT4 at 1192 us is again 2nd to MLX (595 us).
Changes
tests/bench/— new benchmark suite (matmul_bench.cpp, bench_driver, bench_common, 7 backend files, README with full results)tests/CMakeLists.txt— build rules for benchmark executable and optional framework backends, updated third_party paths to../../third_party/cactus/engine/engine.h— addedPARAKEETenum value,num_mel_bins,encoder_hidden_act,pad_token_id,hann_periodicfields (fixes build)