SIMD-accelerated blitter operations (SSE2, NEON)#101
SIMD-accelerated blitter operations (SSE2, NEON)#101JoeMatt merged 7 commits intolibretro:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Introduces a compile-time-selected SIMD dispatch interface for blitter hot-path operations, with SSE2/NEON implementations and a scalar fallback, plus a new bit-exactness + benchmarking test harness.
Changes:
- Adds
blitter_simd_ops_tinterface and architecture-specific implementations (SSE2, NEON, scalar). - Replaces several hot-path logic blocks in
blitter.cwith calls throughblitter_simd_ops. - Adds a dedicated test/benchmark program and Makefile logic to select the implementation at build time.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_blitter_simd.c | Adds bit-exactness tests and a micro-benchmark for the SIMD ops. |
| src/blitter_simd_sse2.c | Implements LFU/DCOMP/ZCOMP/byte_merge using SSE2 intrinsics. |
| src/blitter_simd_neon.c | Implements LFU/DCOMP/ZCOMP/byte_merge using ARM NEON intrinsics. |
| src/blitter_simd_scalar.c | Provides the portable scalar fallback/reference implementation. |
| src/blitter_simd.h | Declares the SIMD ops dispatch struct and exported instance. |
| src/blitter.c | Switches existing blitter logic to call the new SIMD ops. |
| Makefile.common | Adds arch-conditional selection of the SIMD implementation source file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c | ||
|
|
||
| # x86/x64: use SSE2 (baseline for all x86_64, available since Pentium 4) | ||
| ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| endif | ||
| # MSYS2/MinGW on x86_64 | ||
| ifneq (,$(filter MINGW64%,$(MSYSTEM))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| endif | ||
|
|
||
| # ARM64 (AArch64): NEON is always available | ||
| ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| # iOS/tvOS ARM64 cross-compilation | ||
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| # ARMv7 with NEON (set by Makefile for classic_armv7 targets) | ||
| ifeq ($(HAVE_NEON), 1) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
|
|
There was a problem hiding this comment.
Selecting the implementation based on uname -m uses the build host architecture, which is incorrect for common cross-compilation scenarios (e.g., building an ARM target on an x86 host). Prefer selecting via existing target variables ($(platform), $(ARCH)) and/or compiler target flags, and allow an explicit override (e.g., BLITTER_SIMD=sse2|neon|scalar) so packaging/CI can force the correct choice deterministically.
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c | |
| # x86/x64: use SSE2 (baseline for all x86_64, available since Pentium 4) | |
| ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| endif | |
| # MSYS2/MinGW on x86_64 | |
| ifneq (,$(filter MINGW64%,$(MSYSTEM))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| endif | |
| # ARM64 (AArch64): NEON is always available | |
| ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| # iOS/tvOS ARM64 cross-compilation | |
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| # ARMv7 with NEON (set by Makefile for classic_armv7 targets) | |
| ifeq ($(HAVE_NEON), 1) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| # BLITTER_SIMD may be set explicitly to one of: scalar, sse2, neon. | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c | |
| ifneq ($(BLITTER_SIMD),) | |
| ifeq (,$(filter scalar sse2 neon,$(BLITTER_SIMD))) | |
| $(error Unsupported BLITTER_SIMD '$(BLITTER_SIMD)'; expected one of: scalar sse2 neon) | |
| endif | |
| endif | |
| ifeq ($(BLITTER_SIMD),sse2) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| else ifeq ($(BLITTER_SIMD),neon) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| else ifeq ($(BLITTER_SIMD),scalar) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c | |
| else | |
| # ARM targets: prefer NEON when the target configuration says it is available. | |
| ifeq ($(HAVE_NEON), 1) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter aarch64 arm64 arm,$(ARCH))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| # x86/x64 targets: use SSE2. | |
| ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| endif | |
| ifneq (,$(filter x86_64 x86 i686 i386 win-x64 win32,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| endif | |
| # MSYS2/MinGW on x86_64 | |
| ifneq (,$(filter MINGW64%,$(MSYSTEM))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| endif | |
| endif |
| uint8_t (*zcomp)(uint64_t srcz, uint64_t dstz, uint8_t zmode); | ||
|
|
||
| /* Byte Mask Merge: select bytes from src or dst based on 16-bit mask. | ||
| * Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7. |
There was a problem hiding this comment.
The mask description is inaccurate: byte 0 is controlled by mask bits 0–7 (per-bit within the byte), not just bit 0. Updating this comment matters because implementations (and tests) rely on the full low 8 bits having meaning.
| * Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7. | |
| * Bits 0-7 control byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7. |
| /* Build an 8-byte selection mask: | ||
| * byte 0 = low 8 bits of mask (per-bit) | ||
| * bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */ | ||
| uint8_t sel[8]; | ||
| sel[0] = (uint8_t)(mask & 0xFF); | ||
| sel[1] = (mask & 0x0100) ? 0xFF : 0x00; | ||
| sel[2] = (mask & 0x0200) ? 0xFF : 0x00; | ||
| sel[3] = (mask & 0x0400) ? 0xFF : 0x00; | ||
| sel[4] = (mask & 0x0800) ? 0xFF : 0x00; | ||
| sel[5] = (mask & 0x1000) ? 0xFF : 0x00; | ||
| sel[6] = (mask & 0x2000) ? 0xFF : 0x00; | ||
| sel[7] = (mask & 0x4000) ? 0xFF : 0x00; | ||
|
|
||
| uint64_t sel64; | ||
| __builtin_memcpy(&sel64, sel, 8); | ||
|
|
There was a problem hiding this comment.
This hot-path function does a stack array fill + memcpy every call to build the mask, which can erode SIMD gains. Consider constructing sel64 directly with bit operations (or using a small lookup table for bits 8–14 since there are only 128 possibilities) and then loading it into an SSE register, avoiding the per-call byte array and memcpy.
| /* Build an 8-byte selection mask: | |
| * byte 0 = low 8 bits of mask (per-bit) | |
| * bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */ | |
| uint8_t sel[8]; | |
| sel[0] = (uint8_t)(mask & 0xFF); | |
| sel[1] = (mask & 0x0100) ? 0xFF : 0x00; | |
| sel[2] = (mask & 0x0200) ? 0xFF : 0x00; | |
| sel[3] = (mask & 0x0400) ? 0xFF : 0x00; | |
| sel[4] = (mask & 0x0800) ? 0xFF : 0x00; | |
| sel[5] = (mask & 0x1000) ? 0xFF : 0x00; | |
| sel[6] = (mask & 0x2000) ? 0xFF : 0x00; | |
| sel[7] = (mask & 0x4000) ? 0xFF : 0x00; | |
| uint64_t sel64; | |
| __builtin_memcpy(&sel64, sel, 8); | |
| /* Build an 8-byte selection mask directly in a 64-bit value: | |
| * byte 0 = low 8 bits of mask (per-bit) | |
| * bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */ | |
| uint64_t sel64 = | |
| ((uint64_t)(mask & 0x00FF)) | | |
| ((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8) | | |
| ((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) | | |
| ((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) | | |
| ((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) | | |
| ((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) | | |
| ((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) | | |
| ((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56); |
| * Multiply each lane by a power-of-2 weight and horizontal add. | ||
| * Weights: 1, 2, 4, 8, 16, 32, 64, 128 */ | ||
| static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 }; | ||
| uint8x8_t vw = vld1_u8(weights); | ||
|
|
||
| /* AND with weights (0xFF & weight = weight, 0 & weight = 0) */ | ||
| uint8x8_t vbits = vand_u8(vcmp, vw); | ||
|
|
||
| /* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */ | ||
| uint8x8_t sum1 = vpadd_u8(vbits, vbits); /* 4 sums */ | ||
| uint8x8_t sum2 = vpadd_u8(sum1, sum1); /* 2 sums */ | ||
| uint8x8_t sum3 = vpadd_u8(sum2, sum2); /* 1 sum */ | ||
|
|
||
| return vget_lane_u8(sum3, 0); |
There was a problem hiding this comment.
For a function intended to be a hot-path primitive, loading weights from memory and doing three vpadd stages is relatively heavy. A lighter approach is to avoid the memory load by using an immediate vector for weights (so the compiler can constant-fold), or use a bit-extraction strategy (e.g., shift the compare result to 0/1 per lane then reduce) to cut instruction count.
| * Multiply each lane by a power-of-2 weight and horizontal add. | |
| * Weights: 1, 2, 4, 8, 16, 32, 64, 128 */ | |
| static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 }; | |
| uint8x8_t vw = vld1_u8(weights); | |
| /* AND with weights (0xFF & weight = weight, 0 & weight = 0) */ | |
| uint8x8_t vbits = vand_u8(vcmp, vw); | |
| /* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */ | |
| uint8x8_t sum1 = vpadd_u8(vbits, vbits); /* 4 sums */ | |
| uint8x8_t sum2 = vpadd_u8(sum1, sum1); /* 2 sums */ | |
| uint8x8_t sum3 = vpadd_u8(sum2, sum2); /* 1 sum */ | |
| return vget_lane_u8(sum3, 0); | |
| * Convert compare lanes from 0xFF/0x00 to 1/0, then pack into the | |
| * result byte directly to avoid the weights load and vpadd chain. */ | |
| uint8x8_t vbits = vshr_n_u8(vcmp, 7); | |
| return (uint8_t)( | |
| (vget_lane_u8(vbits, 0) << 0) | | |
| (vget_lane_u8(vbits, 1) << 1) | | |
| (vget_lane_u8(vbits, 2) << 2) | | |
| (vget_lane_u8(vbits, 3) << 3) | | |
| (vget_lane_u8(vbits, 4) << 4) | | |
| (vget_lane_u8(vbits, 5) << 5) | | |
| (vget_lane_u8(vbits, 6) << 6) | | |
| (vget_lane_u8(vbits, 7) << 7)); |
| static double elapsed_ns(struct timespec start, struct timespec end) | ||
| { | ||
| return (double)(end.tv_sec - start.tv_sec) * 1e9 | ||
| + (double)(end.tv_nsec - start.tv_nsec); | ||
| } |
There was a problem hiding this comment.
clock_gettime(CLOCK_MONOTONIC, ...) is not available on all platforms/toolchains (notably typical Windows builds), but the PR description calls out Windows CI as a target. Consider adding a small portability shim for timing (or #ifdef-guard the benchmark path) so the test program still builds everywhere even if benchmarking is disabled on some platforms.
| clock_gettime(CLOCK_MONOTONIC, &t0); | ||
| for (i = 0; i < BENCH_ITERS; i++) | ||
| sink += ref_lfu(0xAAAAAAAAAAAAAAAAULL, 0x5555555555555555ULL, (uint8_t)(i & 0x0F)); | ||
| clock_gettime(CLOCK_MONOTONIC, &t1); |
There was a problem hiding this comment.
clock_gettime(CLOCK_MONOTONIC, ...) is not available on all platforms/toolchains (notably typical Windows builds), but the PR description calls out Windows CI as a target. Consider adding a small portability shim for timing (or #ifdef-guard the benchmark path) so the test program still builds everywhere even if benchmarking is disabled on some platforms.
…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
40d942a to
1b4a86a
Compare
…, lighter NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ARM targets: prefer NEON when the target configuration says it is available. | ||
| ifeq ($(HAVE_NEON), 1) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| ifneq (,$(filter aarch64 arm64 arm,$(ARCH))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform))) |
There was a problem hiding this comment.
The auto-selection logic forces the NEON implementation whenever ARCH or platform matches arm, even when HAVE_NEON is not set. This will break builds on ARM targets without NEON (e.g., the generic ARM block in Makefile sets ARCH = arm but does not guarantee -mfpu=neon / NEON support). Restrict NEON selection to aarch64/arm64 or HAVE_NEON=1 (and avoid matching plain arm unless NEON is explicitly enabled).
| # ARM targets: prefer NEON when the target configuration says it is available. | |
| ifeq ($(HAVE_NEON), 1) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter aarch64 arm64 arm,$(ARCH))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform))) | |
| # ARM targets: prefer NEON only when the target configuration says it is | |
| # available, or when targeting architectures/platforms where NEON is implied. | |
| ifeq ($(HAVE_NEON), 1) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter aarch64 arm64,$(ARCH))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform))) |
| # x86/x64 targets: use SSE2. | ||
| ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| endif | ||
| ifneq (,$(filter x86_64 x86 i686 i386 win-x64 win32,$(platform))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| endif |
There was a problem hiding this comment.
When selecting the SSE2 implementation for 32-bit x86 targets (x86/i686/i386), the build may fail unless the compiler is explicitly told to enable SSE2 intrinsics (e.g., -msse2 for GCC/Clang). Currently the Makefiles choose blitter_simd_sse2.c but do not add any corresponding compile flags. Consider either (1) limiting automatic SSE2 selection to x86_64 only, or (2) appending -msse2 (and any required defines) when building the SSE2 source on 32-bit x86.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Native build fallback: auto-detect from host architecture | ||
| ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| endif | ||
| ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| endif |
There was a problem hiding this comment.
The uname -m auto-detection at the end will select the SIMD source based on the build host, which can break cross-compiles to non-x86/ARM targets (e.g. PS3/PSP/etc.) by pulling in SSE2/NEON intrinsics for the wrong architecture. Consider removing the uname fallback or gating it behind a condition that guarantees a native build (or an explicit BLITTER_SIMD setting).
| # Native build fallback: auto-detect from host architecture | |
| ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | |
| endif | |
| ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| endif | |
| endif |
| uint64_t func2 = funcmask[(lfu_func >> 2) & 0x01]; | ||
| uint64_t func3 = funcmask[(lfu_func >> 3) & 0x01]; | ||
| uint64_t lfu = (~srcd & ~dstd & func0) | (~srcd & dstd & func1) | (srcd & ~dstd & func2) | (srcd & dstd & func3); | ||
| uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func); |
There was a problem hiding this comment.
These calls go through function pointers (blitter_simd_ops.*), which introduces an indirect-call overhead on a very hot path and contradicts the PR description's “zero runtime dispatch overhead” claim unless LTO/devirtualization is guaranteed. If the goal is truly zero overhead, consider selecting the implementation via static inline functions/macros at compile time (or exposing direct symbols) so the compiler can inline/constant-fold.
…er intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, lighter NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c143045 to
70bebb7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /* Build 8-byte selection mask */ | ||
| uint8_t sel[8]; | ||
| sel[0] = (uint8_t)(mask & 0xFF); | ||
| sel[1] = (mask & 0x0100) ? 0xFF : 0x00; | ||
| sel[2] = (mask & 0x0200) ? 0xFF : 0x00; | ||
| sel[3] = (mask & 0x0400) ? 0xFF : 0x00; | ||
| sel[4] = (mask & 0x0800) ? 0xFF : 0x00; | ||
| sel[5] = (mask & 0x1000) ? 0xFF : 0x00; | ||
| sel[6] = (mask & 0x2000) ? 0xFF : 0x00; | ||
| sel[7] = (mask & 0x4000) ? 0xFF : 0x00; | ||
|
|
||
| uint8x8_t vmask = vld1_u8(sel); |
There was a problem hiding this comment.
neon_byte_merge builds an 8-byte mask in a stack array and then does vld1_u8(sel) every call. Since this function is intended for a hot blitter datapath, the extra stores + load can materially reduce the NEON speedup. Consider computing the 64-bit sel value purely in registers (similar to the SSE2 version) and using vcreate_u64 + reinterpret instead of a memory load.
| /* Build 8-byte selection mask */ | |
| uint8_t sel[8]; | |
| sel[0] = (uint8_t)(mask & 0xFF); | |
| sel[1] = (mask & 0x0100) ? 0xFF : 0x00; | |
| sel[2] = (mask & 0x0200) ? 0xFF : 0x00; | |
| sel[3] = (mask & 0x0400) ? 0xFF : 0x00; | |
| sel[4] = (mask & 0x0800) ? 0xFF : 0x00; | |
| sel[5] = (mask & 0x1000) ? 0xFF : 0x00; | |
| sel[6] = (mask & 0x2000) ? 0xFF : 0x00; | |
| sel[7] = (mask & 0x4000) ? 0xFF : 0x00; | |
| uint8x8_t vmask = vld1_u8(sel); | |
| /* Build 8-byte selection mask entirely in registers. Byte 0 uses the | |
| * low 8 bits directly; bytes 1..7 are all-ones/all-zero from mask bits. */ | |
| uint64_t sel = | |
| ((uint64_t)(mask & 0x00FF)) | | |
| ((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8) | | |
| ((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) | | |
| ((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) | | |
| ((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) | | |
| ((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) | | |
| ((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) | | |
| ((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56); | |
| uint8x8_t vmask = vreinterpret_u8_u64(vcreate_u64(sel)); |
| # x86/x64 targets: use SSE2. | ||
| ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| # 32-bit x86 may need explicit SSE2 enable | ||
| ifneq (,$(filter i686 i386 x86,$(ARCH))) | ||
| CFLAGS += -msse2 | ||
| endif | ||
| endif |
There was a problem hiding this comment.
-msse2 is only appended when ARCH matches i686/i386/x86, but the SSE2 source can be selected via platform matches, MSYSTEM, or uname -m even when ARCH is unset. On 32-bit x86 toolchains (e.g., MinGW32), this can cause compilation failures due to SSE2 intrinsics being used without enabling SSE2. Consider tying the -msse2 flag to the condition that selects the SSE2 source for 32-bit targets (e.g., based on platform, uname -m, or MSYSTEM).
| uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func); | ||
| bool mir_bit, mir_byte; |
There was a problem hiding this comment.
These hot-path calls go through function pointers (blitter_simd_ops.*). Unless LTO is reliably enabled for all builds, this introduces an indirect call that the compiler typically cannot inline/devirtualize, which conflicts with the stated goal of “zero runtime dispatch overhead” and can materially reduce the SIMD speedup in inner loops. Consider using direct symbols (e.g., blitter_simd_lfu() etc.) selected at link-time by compiling exactly one implementation file, or provide static inline wrappers that can be inlined when the implementation is known.
| # ARM targets: prefer NEON when the target configuration says it is available. | ||
| ifeq ($(HAVE_NEON), 1) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| # Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it | ||
| ifneq (,$(filter aarch64 arm64,$(ARCH))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform))) |
There was a problem hiding this comment.
The NEON implementation is selected for $(platform) values matching armv7% and armhf, but those do not universally guarantee NEON availability or the necessary compiler flags (e.g., -mfpu=neon). This can break builds on ARM targets without NEON (or with toolchains not defaulting to NEON). Consider selecting blitter_simd_neon.c only when HAVE_NEON=1 (or when targeting aarch64/arm64 / iOS-arm64 where NEON is mandatory), otherwise fall back to scalar unless BLITTER_SIMD=neon is explicitly requested.
| # ARM targets: prefer NEON when the target configuration says it is available. | |
| ifeq ($(HAVE_NEON), 1) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| # Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it | |
| ifneq (,$(filter aarch64 arm64,$(ARCH))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform))) | |
| # ARM targets: prefer NEON only when the target configuration says it is available | |
| # or when targeting ARM64/AArch64 where NEON is mandatory. | |
| ifeq ($(HAVE_NEON), 1) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter ios-arm64 tvos-arm64,$(platform))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| # Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it. | |
| ifneq (,$(filter aarch64 arm64,$(ARCH))) | |
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | |
| endif | |
| ifneq (,$(filter arm64 armv8%,$(platform))) |
| # Native build fallback: auto-detect from host when no SIMD was selected above. | ||
| ifeq ($(BLITTER_SIMD_SRC),) | ||
| ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c | ||
| endif | ||
| endif | ||
| ifeq ($(BLITTER_SIMD_SRC),) | ||
| ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null))) | ||
| BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c | ||
| endif | ||
| endif | ||
| endif |
There was a problem hiding this comment.
The uname -m fallback can pick a SIMD implementation based on the build host even when platform targets a different architecture (cross-compiles like vita, ps3, etc. don't consistently set ARCH here). That can accidentally compile the SSE2 file for non-x86 targets. Consider restricting host auto-detection to true native builds (e.g., when platform is unix/osx/win or platform == system_platform) and otherwise default to the scalar implementation unless the target arch is explicitly detected via platform/ARCH or BLITTER_SIMD.
Extract four hot-path operations from blitter.c DATA() into a dispatch interface (blitter_simd_ops_t) with arch-specific implementations: - LFU: 64-bit truth table (4 boolean functions) - DCOMP: 8-byte equality comparator - ZCOMP: 4x 16-bit Z-buffer comparison - byte_merge: per-byte mask select for pixel and Z data Files: - blitter_simd.h: dispatch struct typedef - blitter_simd_scalar.c: portable C reference (fallback) - blitter_simd_sse2.c: x86/x64 SSE2 intrinsics - blitter_simd_neon.c: ARM NEON intrinsics - test/test_blitter_simd.c: 40k+ bit-exactness tests + benchmark Makefile.common selects implementation at compile time based on target architecture (zero runtime dispatch cost). Addresses libretro#99. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e, cleaner intrinsics - Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation - blitter_simd.h: clarify byte_merge mask bit semantics in comment - blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge - blitter_simd_neon.c: replace static const array with inline vcreate_u8 - test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows), fix build instructions (link one impl, not both) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, lighter NEON dcomp - Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH) variables for cross-compilation, keep uname -m as native fallback only - blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with vshr_n_u8 + lane extraction for fewer instructions in dcomp - test_blitter_simd.c: fix duplicate lines from rebase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove plain 'arm' from NEON ARCH filter (armv5/armv6 lack NEON) - Don't initialize BLITTER_SIMD_SRC to scalar — let auto-detection run by keeping it empty until a match is found - Add -msse2 flag for 32-bit x86 targets - Add scalar fallback for exotic platforms with no SIMD support - uname -m fallback now only fires when no prior rule matched Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
70bebb7 to
b51b63f
Compare
Makefile.common: - Restrict armv7/armhf NEON auto-selection to HAVE_NEON=1 only - Guard uname -m fallback to native-build platforms (unix/osx/win) - Add -msse2 flag consistently for all 32-bit x86 selection paths blitter_simd_neon.c: - Build neon_byte_merge mask in registers via vcreate_u64 instead of stack array + vld1_u8 memory load Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs the test harness for both the platform-native SIMD implementation and the scalar fallback on every build, covering SSE2 (linux-x64, windows), NEON (linux-arm64, macos-arm64), and scalar cross-check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
blitter.c DATA()into a dispatch interface (blitter_simd_ops_t) with architecture-specific implementationsMakefile.common— zero runtime dispatch overheadOperations accelerated
cmpeq_epi8+movemask/vceq_u8cmpgt_epi16(biased) /vclt_u16blendv/vbslFiles
src/blitter_simd.h— dispatch struct typedefsrc/blitter_simd_scalar.c— portable C reference (fallback for any platform)src/blitter_simd_sse2.c— x86/x64 SSE2 intrinsicssrc/blitter_simd_neon.c— ARM NEON intrinsicssrc/blitter.c— 5 call sites replaced (net -57 lines)Makefile.common— arch-conditional source selectiontest/test_blitter_simd.c— bit-exactness + benchmarkAddresses #99.
Test plan
./test/test_blitter_simd --bench)🤖 Generated with Claude Code