SIMD-accelerated blitter operations (SSE2, NEON) by JoeMatt · Pull Request #101 · libretro/virtualjaguar-libretro

JoeMatt · 2026-04-03T05:47:38Z

Summary

Extracts four hot-path operations from blitter.c DATA() into a dispatch interface (blitter_simd_ops_t) with architecture-specific implementations
SSE2 (x86/x64), NEON (ARM/ARM64), and scalar (portable fallback) implementations
Compile-time architecture selection via Makefile.common — zero runtime dispatch overhead
40,000+ bit-exactness tests with performance benchmarking

Operations accelerated

Operation	What it does	SIMD approach
LFU	64-bit boolean truth table (16 functions)	Parallel AND/OR/XOR
DCOMP	8-byte equality comparator	`cmpeq_epi8` + `movemask` / `vceq_u8`
ZCOMP	4×16-bit Z-buffer comparison	`cmpgt_epi16` (biased) / `vclt_u16`
byte_merge	Per-byte mask select	`blendv` / `vbsl`

Files

src/blitter_simd.h — dispatch struct typedef
src/blitter_simd_scalar.c — portable C reference (fallback for any platform)
src/blitter_simd_sse2.c — x86/x64 SSE2 intrinsics
src/blitter_simd_neon.c — ARM NEON intrinsics
src/blitter.c — 5 call sites replaced (net -57 lines)
Makefile.common — arch-conditional source selection
test/test_blitter_simd.c — bit-exactness + benchmark

Addresses #99.

Test plan

Core builds on macOS ARM64 (NEON selected)
40,067 bit-exactness tests pass (all ops, all edge cases, 10k random inputs each)
Performance benchmark runs (./test/test_blitter_simd --bench)
CI build on Ubuntu x86_64 (SSE2 selected)
CI build on Windows (SSE2 via MSYS2)
Regression test with JagNICCC2000 baseline (pixel-identical output)

🤖 Generated with Claude Code

Copilot

Pull request overview

Introduces a compile-time-selected SIMD dispatch interface for blitter hot-path operations, with SSE2/NEON implementations and a scalar fallback, plus a new bit-exactness + benchmarking test harness.

Changes:

Adds blitter_simd_ops_t interface and architecture-specific implementations (SSE2, NEON, scalar).
Replaces several hot-path logic blocks in blitter.c with calls through blitter_simd_ops.
Adds a dedicated test/benchmark program and Makefile logic to select the implementation at build time.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
test/test_blitter_simd.c	Adds bit-exactness tests and a micro-benchmark for the SIMD ops.
src/blitter_simd_sse2.c	Implements LFU/DCOMP/ZCOMP/byte_merge using SSE2 intrinsics.
src/blitter_simd_neon.c	Implements LFU/DCOMP/ZCOMP/byte_merge using ARM NEON intrinsics.
src/blitter_simd_scalar.c	Provides the portable scalar fallback/reference implementation.
src/blitter_simd.h	Declares the SIMD ops dispatch struct and exported instance.
src/blitter.c	Switches existing blitter logic to call the new SIMD ops.
Makefile.common	Adds arch-conditional selection of the SIMD implementation source file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T05:54:49Z

+BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c
+
+# x86/x64: use SSE2 (baseline for all x86_64, available since Pentium 4)
+ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+endif
+# MSYS2/MinGW on x86_64
+ifneq (,$(filter MINGW64%,$(MSYSTEM)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+endif
+
+# ARM64 (AArch64): NEON is always available
+ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+# iOS/tvOS ARM64 cross-compilation
+ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+# ARMv7 with NEON (set by Makefile for classic_armv7 targets)
+ifeq ($(HAVE_NEON), 1)
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+


Selecting the implementation based on uname -m uses the build host architecture, which is incorrect for common cross-compilation scenarios (e.g., building an ARM target on an x86 host). Prefer selecting via existing target variables ($(platform), $(ARCH)) and/or compiler target flags, and allow an explicit override (e.g., BLITTER_SIMD=sse2|neon|scalar) so packaging/CI can force the correct choice deterministically.

Suggested change

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c

# x86/x64: use SSE2 (baseline for all x86_64, available since Pentium 4)

ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

endif

# MSYS2/MinGW on x86_64

ifneq (,$(filter MINGW64%,$(MSYSTEM)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

endif

# ARM64 (AArch64): NEON is always available

ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

# iOS/tvOS ARM64 cross-compilation

ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

# ARMv7 with NEON (set by Makefile for classic_armv7 targets)

ifeq ($(HAVE_NEON), 1)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

# BLITTER_SIMD may be set explicitly to one of: scalar, sse2, neon.

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c

ifneq ($(BLITTER_SIMD),)

ifeq (,$(filter scalar sse2 neon,$(BLITTER_SIMD)))

$(error Unsupported BLITTER_SIMD '$(BLITTER_SIMD)'; expected one of: scalar sse2 neon)

endif

endif

ifeq ($(BLITTER_SIMD),sse2)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

else ifeq ($(BLITTER_SIMD),neon)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

else ifeq ($(BLITTER_SIMD),scalar)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c

else

# ARM targets: prefer NEON when the target configuration says it is available.

ifeq ($(HAVE_NEON), 1)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter aarch64 arm64 arm,$(ARCH)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

# x86/x64 targets: use SSE2.

ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

endif

ifneq (,$(filter x86_64 x86 i686 i386 win-x64 win32,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

endif

# MSYS2/MinGW on x86_64

ifneq (,$(filter MINGW64%,$(MSYSTEM)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

endif

endif

Copilot · 2026-04-03T05:54:49Z

+   uint8_t  (*zcomp)(uint64_t srcz, uint64_t dstz, uint8_t zmode);
+
+   /* Byte Mask Merge: select bytes from src or dst based on 16-bit mask.
+    * Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.


The mask description is inaccurate: byte 0 is controlled by mask bits 0–7 (per-bit within the byte), not just bit 0. Updating this comment matters because implementations (and tests) rely on the full low 8 bits having meaning.

Suggested change

* Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.

* Bits 0-7 control byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.

Copilot · 2026-04-03T05:54:49Z

+   /* Build an 8-byte selection mask:
+    * byte 0 = low 8 bits of mask (per-bit)
+    * bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */
+   uint8_t sel[8];
+   sel[0] = (uint8_t)(mask & 0xFF);
+   sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
+   sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
+   sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
+   sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
+   sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
+   sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
+   sel[7] = (mask & 0x4000) ? 0xFF : 0x00;
+
+   uint64_t sel64;
+   __builtin_memcpy(&sel64, sel, 8);
+


This hot-path function does a stack array fill + memcpy every call to build the mask, which can erode SIMD gains. Consider constructing sel64 directly with bit operations (or using a small lookup table for bits 8–14 since there are only 128 possibilities) and then loading it into an SSE register, avoiding the per-call byte array and memcpy.

Suggested change

/* Build an 8-byte selection mask:

* byte 0 = low 8 bits of mask (per-bit)

* bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */

uint8_t sel[8];

sel[0] = (uint8_t)(mask & 0xFF);

sel[1] = (mask & 0x0100) ? 0xFF : 0x00;

sel[2] = (mask & 0x0200) ? 0xFF : 0x00;

sel[3] = (mask & 0x0400) ? 0xFF : 0x00;

sel[4] = (mask & 0x0800) ? 0xFF : 0x00;

sel[5] = (mask & 0x1000) ? 0xFF : 0x00;

sel[6] = (mask & 0x2000) ? 0xFF : 0x00;

sel[7] = (mask & 0x4000) ? 0xFF : 0x00;

uint64_t sel64;

__builtin_memcpy(&sel64, sel, 8);

/* Build an 8-byte selection mask directly in a 64-bit value:

* byte 0 = low 8 bits of mask (per-bit)

* bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */

uint64_t sel64 =

((uint64_t)(mask & 0x00FF)) |

((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8) |

((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) |

((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) |

((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) |

((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) |

((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) |

((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56);

Copilot · 2026-04-03T05:54:50Z

+    * Multiply each lane by a power-of-2 weight and horizontal add.
+    * Weights: 1, 2, 4, 8, 16, 32, 64, 128 */
+   static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 };
+   uint8x8_t vw = vld1_u8(weights);
+
+   /* AND with weights (0xFF & weight = weight, 0 & weight = 0) */
+   uint8x8_t vbits = vand_u8(vcmp, vw);
+
+   /* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */
+   uint8x8_t sum1 = vpadd_u8(vbits, vbits);  /* 4 sums */
+   uint8x8_t sum2 = vpadd_u8(sum1, sum1);     /* 2 sums */
+   uint8x8_t sum3 = vpadd_u8(sum2, sum2);     /* 1 sum  */
+
+   return vget_lane_u8(sum3, 0);


For a function intended to be a hot-path primitive, loading weights from memory and doing three vpadd stages is relatively heavy. A lighter approach is to avoid the memory load by using an immediate vector for weights (so the compiler can constant-fold), or use a bit-extraction strategy (e.g., shift the compare result to 0/1 per lane then reduce) to cut instruction count.

Suggested change

* Multiply each lane by a power-of-2 weight and horizontal add.

* Weights: 1, 2, 4, 8, 16, 32, 64, 128 */

static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 };

uint8x8_t vw = vld1_u8(weights);

/* AND with weights (0xFF & weight = weight, 0 & weight = 0) */

uint8x8_t vbits = vand_u8(vcmp, vw);

/* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */

uint8x8_t sum1 = vpadd_u8(vbits, vbits); /* 4 sums */

uint8x8_t sum2 = vpadd_u8(sum1, sum1); /* 2 sums */

uint8x8_t sum3 = vpadd_u8(sum2, sum2); /* 1 sum */

return vget_lane_u8(sum3, 0);

* Convert compare lanes from 0xFF/0x00 to 1/0, then pack into the

* result byte directly to avoid the weights load and vpadd chain. */

uint8x8_t vbits = vshr_n_u8(vcmp, 7);

return (uint8_t)(

(vget_lane_u8(vbits, 0) << 0) |

(vget_lane_u8(vbits, 1) << 1) |

(vget_lane_u8(vbits, 2) << 2) |

(vget_lane_u8(vbits, 3) << 3) |

(vget_lane_u8(vbits, 4) << 4) |

(vget_lane_u8(vbits, 5) << 5) |

(vget_lane_u8(vbits, 6) << 6) |

(vget_lane_u8(vbits, 7) << 7));

Copilot · 2026-04-03T05:54:50Z

+static double elapsed_ns(struct timespec start, struct timespec end)
+{
+   return (double)(end.tv_sec - start.tv_sec) * 1e9
+        + (double)(end.tv_nsec - start.tv_nsec);
+}


clock_gettime(CLOCK_MONOTONIC, ...) is not available on all platforms/toolchains (notably typical Windows builds), but the PR description calls out Windows CI as a target. Consider adding a small portability shim for timing (or #ifdef-guard the benchmark path) so the test program still builds everywhere even if benchmarking is disabled on some platforms.

Copilot · 2026-04-03T05:54:50Z

+   clock_gettime(CLOCK_MONOTONIC, &t0);
+   for (i = 0; i < BENCH_ITERS; i++)
+      sink += ref_lfu(0xAAAAAAAAAAAAAAAAULL, 0x5555555555555555ULL, (uint8_t)(i & 0x0F));
+   clock_gettime(CLOCK_MONOTONIC, &t1);


clock_gettime(CLOCK_MONOTONIC, ...) is not available on all platforms/toolchains (notably typical Windows builds), but the PR description calls out Windows CI as a target. Consider adding a small portability shim for timing (or #ifdef-guard the benchmark path) so the test program still builds everywhere even if benchmarking is disabled on some platforms.

…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, lighter NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T07:26:36Z

+# ARM targets: prefer NEON when the target configuration says it is available.
+ifeq ($(HAVE_NEON), 1)
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+ifneq (,$(filter aarch64 arm64 arm,$(ARCH)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform)))


The auto-selection logic forces the NEON implementation whenever ARCH or platform matches arm, even when HAVE_NEON is not set. This will break builds on ARM targets without NEON (e.g., the generic ARM block in Makefile sets ARCH = arm but does not guarantee -mfpu=neon / NEON support). Restrict NEON selection to aarch64/arm64 or HAVE_NEON=1 (and avoid matching plain arm unless NEON is explicitly enabled).

Suggested change

# ARM targets: prefer NEON when the target configuration says it is available.

ifeq ($(HAVE_NEON), 1)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter aarch64 arm64 arm,$(ARCH)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform)))

# ARM targets: prefer NEON only when the target configuration says it is

# available, or when targeting architectures/platforms where NEON is implied.

ifeq ($(HAVE_NEON), 1)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter aarch64 arm64,$(ARCH)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform)))

Copilot · 2026-04-03T07:26:37Z

+# x86/x64 targets: use SSE2.
+ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+endif
+ifneq (,$(filter x86_64 x86 i686 i386 win-x64 win32,$(platform)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+endif


When selecting the SSE2 implementation for 32-bit x86 targets (x86/i686/i386), the build may fail unless the compiler is explicitly told to enable SSE2 intrinsics (e.g., -msse2 for GCC/Clang). Currently the Makefiles choose blitter_simd_sse2.c but do not add any corresponding compile flags. Consider either (1) limiting automatic SSE2 selection to x86_64 only, or (2) appending -msse2 (and any required defines) when building the SSE2 source on 32-bit x86.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T05:15:13Z

+
+# Native build fallback: auto-detect from host architecture
+ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+endif
+ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+endif


The uname -m auto-detection at the end will select the SIMD source based on the build host, which can break cross-compiles to non-x86/ARM targets (e.g. PS3/PSP/etc.) by pulling in SSE2/NEON intrinsics for the wrong architecture. Consider removing the uname fallback or gating it behind a condition that guarantees a native build (or an explicit BLITTER_SIMD setting).

Suggested change

# Native build fallback: auto-detect from host architecture

ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c

endif

ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

endif

endif

Copilot · 2026-04-04T05:15:14Z

-	uint64_t func2 = funcmask[(lfu_func >> 2) & 0x01];
-	uint64_t func3 = funcmask[(lfu_func >> 3) & 0x01];
-	uint64_t lfu = (~srcd & ~dstd & func0) | (~srcd & dstd & func1) | (srcd & ~dstd & func2) | (srcd & dstd & func3);
+	uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func);


These calls go through function pointers (blitter_simd_ops.*), which introduces an indirect-call overhead on a very hot path and contradicts the PR description's “zero runtime dispatch overhead” claim unless LTO/devirtualization is guaranteed. If the goal is truly zero overhead, consider selecting the implementation via static inline functions/macros at compile time (or exposing direct symbols) so the compiler can inline/constant-fold.

…er intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, lighter NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T03:09:22Z

+   /* Build 8-byte selection mask */
+   uint8_t sel[8];
+   sel[0] = (uint8_t)(mask & 0xFF);
+   sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
+   sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
+   sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
+   sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
+   sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
+   sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
+   sel[7] = (mask & 0x4000) ? 0xFF : 0x00;
+
+   uint8x8_t vmask = vld1_u8(sel);


neon_byte_merge builds an 8-byte mask in a stack array and then does vld1_u8(sel) every call. Since this function is intended for a hot blitter datapath, the extra stores + load can materially reduce the NEON speedup. Consider computing the 64-bit sel value purely in registers (similar to the SSE2 version) and using vcreate_u64 + reinterpret instead of a memory load.

Suggested change

/* Build 8-byte selection mask */

uint8_t sel[8];

sel[0] = (uint8_t)(mask & 0xFF);

sel[1] = (mask & 0x0100) ? 0xFF : 0x00;

sel[2] = (mask & 0x0200) ? 0xFF : 0x00;

sel[3] = (mask & 0x0400) ? 0xFF : 0x00;

sel[4] = (mask & 0x0800) ? 0xFF : 0x00;

sel[5] = (mask & 0x1000) ? 0xFF : 0x00;

sel[6] = (mask & 0x2000) ? 0xFF : 0x00;

sel[7] = (mask & 0x4000) ? 0xFF : 0x00;

uint8x8_t vmask = vld1_u8(sel);

/* Build 8-byte selection mask entirely in registers. Byte 0 uses the

* low 8 bits directly; bytes 1..7 are all-ones/all-zero from mask bits. */

uint64_t sel =

((uint64_t)(mask & 0x00FF)) |

((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8) |

((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) |

((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) |

((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) |

((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) |

((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) |

((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56);

uint8x8_t vmask = vreinterpret_u8_u64(vcreate_u64(sel));

Copilot · 2026-04-16T03:09:23Z

+# x86/x64 targets: use SSE2.
+ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+   # 32-bit x86 may need explicit SSE2 enable
+   ifneq (,$(filter i686 i386 x86,$(ARCH)))
+      CFLAGS += -msse2
+   endif
+endif


-msse2 is only appended when ARCH matches i686/i386/x86, but the SSE2 source can be selected via platform matches, MSYSTEM, or uname -m even when ARCH is unset. On 32-bit x86 toolchains (e.g., MinGW32), this can cause compilation failures due to SSE2 intrinsics being used without enabling SSE2. Consider tying the -msse2 flag to the condition that selects the SSE2 source for 32-bit targets (e.g., based on platform, uname -m, or MSYSTEM).

Copilot · 2026-04-16T03:09:23Z

+	uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func);
   bool mir_bit, mir_byte;


These hot-path calls go through function pointers (blitter_simd_ops.*). Unless LTO is reliably enabled for all builds, this introduces an indirect call that the compiler typically cannot inline/devirtualize, which conflicts with the stated goal of “zero runtime dispatch overhead” and can materially reduce the SIMD speedup in inner loops. Consider using direct symbols (e.g., blitter_simd_lfu() etc.) selected at link-time by compiling exactly one implementation file, or provide static inline wrappers that can be inlined when the implementation is known.

Copilot · 2026-04-16T03:09:23Z

+# ARM targets: prefer NEON when the target configuration says it is available.
+ifeq ($(HAVE_NEON), 1)
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+# Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it
+ifneq (,$(filter aarch64 arm64,$(ARCH)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform)))


The NEON implementation is selected for $(platform) values matching armv7% and armhf, but those do not universally guarantee NEON availability or the necessary compiler flags (e.g., -mfpu=neon). This can break builds on ARM targets without NEON (or with toolchains not defaulting to NEON). Consider selecting blitter_simd_neon.c only when HAVE_NEON=1 (or when targeting aarch64/arm64 / iOS-arm64 where NEON is mandatory), otherwise fall back to scalar unless BLITTER_SIMD=neon is explicitly requested.

Suggested change

# ARM targets: prefer NEON when the target configuration says it is available.

ifeq ($(HAVE_NEON), 1)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

# Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it

ifneq (,$(filter aarch64 arm64,$(ARCH)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform)))

# ARM targets: prefer NEON only when the target configuration says it is available

# or when targeting ARM64/AArch64 where NEON is mandatory.

ifeq ($(HAVE_NEON), 1)

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

# Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it.

ifneq (,$(filter aarch64 arm64,$(ARCH)))

BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c

endif

ifneq (,$(filter arm64 armv8%,$(platform)))

Copilot · 2026-04-16T03:09:23Z

+# Native build fallback: auto-detect from host when no SIMD was selected above.
+ifeq ($(BLITTER_SIMD_SRC),)
+ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
+endif
+endif
+ifeq ($(BLITTER_SIMD_SRC),)
+ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
+   BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
+endif
+endif
+endif


The uname -m fallback can pick a SIMD implementation based on the build host even when platform targets a different architecture (cross-compiles like vita, ps3, etc. don't consistently set ARCH here). That can accidentally compile the SSE2 file for non-x86 targets. Consider restricting host auto-detection to true native builds (e.g., when platform is unix/osx/win or platform == system_platform) and otherwise default to the scalar implementation unless the target arch is explicitly detected via platform/ARCH or BLITTER_SIMD.

Extract four hot-path operations from blitter.c DATA() into a dispatch interface (blitter_simd_ops_t) with arch-specific implementations: - LFU: 64-bit truth table (4 boolean functions) - DCOMP: 8-byte equality comparator - ZCOMP: 4x 16-bit Z-buffer comparison - byte_merge: per-byte mask select for pixel and Z data Files: - blitter_simd.h: dispatch struct typedef - blitter_simd_scalar.c: portable C reference (fallback) - blitter_simd_sse2.c: x86/x64 SSE2 intrinsics - blitter_simd_neon.c: ARM NEON intrinsics - test/test_blitter_simd.c: 40k+ bit-exactness tests + benchmark Makefile.common selects implementation at compile time based on target architecture (zero runtime dispatch cost). Addresses libretro#99. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, lighter NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove plain 'arm' from NEON ARCH filter (armv5/armv6 lack NEON) - Don't initialize BLITTER_SIMD_SRC to scalar — let auto-detection run by keeping it empty until a match is found - Add -msse2 flag for 32-bit x86 targets - Add scalar fallback for exotic platforms with no SIMD support - uname -m fallback now only fires when no prior rule matched Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Makefile.common: - Restrict armv7/armhf NEON auto-selection to HAVE_NEON=1 only - Guard uname -m fallback to native-build platforms (unix/osx/win) - Add -msse2 flag consistently for all 32-bit x86 selection paths blitter_simd_neon.c: - Build neon_byte_merge mask in registers via vcreate_u64 instead of stack array + vld1_u8 memory load Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Runs the test harness for both the platform-native SIMD implementation and the scalar fallback on every build, covering SSE2 (linux-x64, windows), NEON (linux-arm64, macos-arm64), and scalar cross-check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 3, 2026 05:47

JoeMatt linked an issue Apr 3, 2026 that may be closed by this pull request

[FEATURE] SIMD-accelerated blitter #99

Closed

Copilot started reviewing on behalf of JoeMatt April 3, 2026 05:53 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

JoeMatt force-pushed the libretro/feature/simd-blitter branch from 40d942a to 1b4a86a Compare April 3, 2026 07:07

JoeMatt requested a review from Copilot April 3, 2026 07:22

Copilot started reviewing on behalf of JoeMatt April 3, 2026 07:23 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

JoeMatt requested a review from Copilot April 4, 2026 05:10

Copilot started reviewing on behalf of JoeMatt April 4, 2026 05:11 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

JoeMatt force-pushed the libretro/feature/simd-blitter branch from c143045 to 70bebb7 Compare April 15, 2026 23:01

JoeMatt requested a review from Copilot April 16, 2026 03:01

Copilot started reviewing on behalf of JoeMatt April 16, 2026 03:02 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

JoeMatt and others added 5 commits April 16, 2026 00:13

update .gitignore for test files

b51b63f

JoeMatt force-pushed the libretro/feature/simd-blitter branch from 70bebb7 to b51b63f Compare April 16, 2026 04:14

JoeMatt and others added 2 commits April 16, 2026 00:29

JoeMatt merged commit dd44259 into libretro:master Apr 16, 2026
8 of 9 checks passed

JoeMatt deleted the libretro/feature/simd-blitter branch April 16, 2026 04:44

	* Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.
	* Bits 0-7 control byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.

-    * Multiply each lane by a power-of-2 weight and horizontal add.
-    * Weights: 1, 2, 4, 8, 16, 32, 64, 128 */
-   static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 };
-   uint8x8_t vw = vld1_u8(weights);
-   /* AND with weights (0xFF & weight = weight, 0 & weight = 0) */
-   uint8x8_t vbits = vand_u8(vcmp, vw);
-   /* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */
-   uint8x8_t sum1 = vpadd_u8(vbits, vbits);  /* 4 sums */
-   uint8x8_t sum2 = vpadd_u8(sum1, sum1);     /* 2 sums */
-   uint8x8_t sum3 = vpadd_u8(sum2, sum2);     /* 1 sum  */
-   return vget_lane_u8(sum3, 0);
+    * Convert compare lanes from 0xFF/0x00 to 1/0, then pack into the
+    * result byte directly to avoid the weights load and vpadd chain. */
+   uint8x8_t vbits = vshr_n_u8(vcmp, 7);
+   return (uint8_t)(
+      (vget_lane_u8(vbits, 0) << 0) |
+      (vget_lane_u8(vbits, 1) << 1) |
+      (vget_lane_u8(vbits, 2) << 2) |
+      (vget_lane_u8(vbits, 3) << 3) |
+      (vget_lane_u8(vbits, 4) << 4) |
+      (vget_lane_u8(vbits, 5) << 5) |
+      (vget_lane_u8(vbits, 6) << 6) |
+      (vget_lane_u8(vbits, 7) << 7));

-   /* Build 8-byte selection mask */
-   uint8_t sel[8];
-   sel[0] = (uint8_t)(mask & 0xFF);
-   sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
-   sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
-   sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
-   sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
-   sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
-   sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
-   sel[7] = (mask & 0x4000) ? 0xFF : 0x00;
-   uint8x8_t vmask = vld1_u8(sel);
+   /* Build 8-byte selection mask entirely in registers. Byte 0 uses the
+    * low 8 bits directly; bytes 1..7 are all-ones/all-zero from mask bits. */
+   uint64_t sel =
+      ((uint64_t)(mask & 0x00FF)) |
+      ((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8)  |
+      ((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) |
+      ((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) |
+      ((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) |
+      ((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) |
+      ((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) |
+      ((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56);
+   uint8x8_t vmask = vreinterpret_u8_u64(vcreate_u64(sel));

		uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func);
		bool mir_bit, mir_byte;

Conversation

JoeMatt commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Operations accelerated

Files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoeMatt commented Apr 3, 2026 •

edited

Loading