Skip to content

SIMD-accelerated blitter operations (SSE2, NEON)#101

Merged
JoeMatt merged 7 commits intolibretro:masterfrom
Provenance-Emu:libretro/feature/simd-blitter
Apr 16, 2026
Merged

SIMD-accelerated blitter operations (SSE2, NEON)#101
JoeMatt merged 7 commits intolibretro:masterfrom
Provenance-Emu:libretro/feature/simd-blitter

Conversation

@JoeMatt
Copy link
Copy Markdown
Collaborator

@JoeMatt JoeMatt commented Apr 3, 2026

Summary

  • Extracts four hot-path operations from blitter.c DATA() into a dispatch interface (blitter_simd_ops_t) with architecture-specific implementations
  • SSE2 (x86/x64), NEON (ARM/ARM64), and scalar (portable fallback) implementations
  • Compile-time architecture selection via Makefile.common — zero runtime dispatch overhead
  • 40,000+ bit-exactness tests with performance benchmarking

Operations accelerated

Operation What it does SIMD approach
LFU 64-bit boolean truth table (16 functions) Parallel AND/OR/XOR
DCOMP 8-byte equality comparator cmpeq_epi8 + movemask / vceq_u8
ZCOMP 4×16-bit Z-buffer comparison cmpgt_epi16 (biased) / vclt_u16
byte_merge Per-byte mask select blendv / vbsl

Files

  • src/blitter_simd.h — dispatch struct typedef
  • src/blitter_simd_scalar.c — portable C reference (fallback for any platform)
  • src/blitter_simd_sse2.c — x86/x64 SSE2 intrinsics
  • src/blitter_simd_neon.c — ARM NEON intrinsics
  • src/blitter.c — 5 call sites replaced (net -57 lines)
  • Makefile.common — arch-conditional source selection
  • test/test_blitter_simd.c — bit-exactness + benchmark

Addresses #99.

Test plan

  • Core builds on macOS ARM64 (NEON selected)
  • 40,067 bit-exactness tests pass (all ops, all edge cases, 10k random inputs each)
  • Performance benchmark runs (./test/test_blitter_simd --bench)
  • CI build on Ubuntu x86_64 (SSE2 selected)
  • CI build on Windows (SSE2 via MSYS2)
  • Regression test with JagNICCC2000 baseline (pixel-identical output)

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 3, 2026 05:47
@JoeMatt JoeMatt linked an issue Apr 3, 2026 that may be closed by this pull request
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a compile-time-selected SIMD dispatch interface for blitter hot-path operations, with SSE2/NEON implementations and a scalar fallback, plus a new bit-exactness + benchmarking test harness.

Changes:

  • Adds blitter_simd_ops_t interface and architecture-specific implementations (SSE2, NEON, scalar).
  • Replaces several hot-path logic blocks in blitter.c with calls through blitter_simd_ops.
  • Adds a dedicated test/benchmark program and Makefile logic to select the implementation at build time.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/test_blitter_simd.c Adds bit-exactness tests and a micro-benchmark for the SIMD ops.
src/blitter_simd_sse2.c Implements LFU/DCOMP/ZCOMP/byte_merge using SSE2 intrinsics.
src/blitter_simd_neon.c Implements LFU/DCOMP/ZCOMP/byte_merge using ARM NEON intrinsics.
src/blitter_simd_scalar.c Provides the portable scalar fallback/reference implementation.
src/blitter_simd.h Declares the SIMD ops dispatch struct and exported instance.
src/blitter.c Switches existing blitter logic to call the new SIMD ops.
Makefile.common Adds arch-conditional selection of the SIMD implementation source file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile.common Outdated
Comment on lines +53 to +76
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c

# x86/x64: use SSE2 (baseline for all x86_64, available since Pentium 4)
ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
# MSYS2/MinGW on x86_64
ifneq (,$(filter MINGW64%,$(MSYSTEM)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif

# ARM64 (AArch64): NEON is always available
ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# iOS/tvOS ARM64 cross-compilation
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# ARMv7 with NEON (set by Makefile for classic_armv7 targets)
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selecting the implementation based on uname -m uses the build host architecture, which is incorrect for common cross-compilation scenarios (e.g., building an ARM target on an x86 host). Prefer selecting via existing target variables ($(platform), $(ARCH)) and/or compiler target flags, and allow an explicit override (e.g., BLITTER_SIMD=sse2|neon|scalar) so packaging/CI can force the correct choice deterministically.

Suggested change
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c
# x86/x64: use SSE2 (baseline for all x86_64, available since Pentium 4)
ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
# MSYS2/MinGW on x86_64
ifneq (,$(filter MINGW64%,$(MSYSTEM)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
# ARM64 (AArch64): NEON is always available
ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# iOS/tvOS ARM64 cross-compilation
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# ARMv7 with NEON (set by Makefile for classic_armv7 targets)
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# BLITTER_SIMD may be set explicitly to one of: scalar, sse2, neon.
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c
ifneq ($(BLITTER_SIMD),)
ifeq (,$(filter scalar sse2 neon,$(BLITTER_SIMD)))
$(error Unsupported BLITTER_SIMD '$(BLITTER_SIMD)'; expected one of: scalar sse2 neon)
endif
endif
ifeq ($(BLITTER_SIMD),sse2)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
else ifeq ($(BLITTER_SIMD),neon)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
else ifeq ($(BLITTER_SIMD),scalar)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_scalar.c
else
# ARM targets: prefer NEON when the target configuration says it is available.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter aarch64 arm64 arm,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# x86/x64 targets: use SSE2.
ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
ifneq (,$(filter x86_64 x86 i686 i386 win-x64 win32,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
# MSYS2/MinGW on x86_64
ifneq (,$(filter MINGW64%,$(MSYSTEM)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
endif

Copilot uses AI. Check for mistakes.
Comment thread src/blitter_simd.h Outdated
uint8_t (*zcomp)(uint64_t srcz, uint64_t dstz, uint8_t zmode);

/* Byte Mask Merge: select bytes from src or dst based on 16-bit mask.
* Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mask description is inaccurate: byte 0 is controlled by mask bits 0–7 (per-bit within the byte), not just bit 0. Updating this comment matters because implementations (and tests) rely on the full low 8 bits having meaning.

Suggested change
* Bit 0 controls byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.
* Bits 0-7 control byte 0 (per-bit within byte 0), bits 8-14 control bytes 1-7.

Copilot uses AI. Check for mistakes.
Comment thread src/blitter_simd_sse2.c Outdated
Comment on lines +127 to +142
/* Build an 8-byte selection mask:
* byte 0 = low 8 bits of mask (per-bit)
* bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */
uint8_t sel[8];
sel[0] = (uint8_t)(mask & 0xFF);
sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
sel[7] = (mask & 0x4000) ? 0xFF : 0x00;

uint64_t sel64;
__builtin_memcpy(&sel64, sel, 8);

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hot-path function does a stack array fill + memcpy every call to build the mask, which can erode SIMD gains. Consider constructing sel64 directly with bit operations (or using a small lookup table for bits 8–14 since there are only 128 possibilities) and then loading it into an SSE register, avoiding the per-call byte array and memcpy.

Suggested change
/* Build an 8-byte selection mask:
* byte 0 = low 8 bits of mask (per-bit)
* bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */
uint8_t sel[8];
sel[0] = (uint8_t)(mask & 0xFF);
sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
sel[7] = (mask & 0x4000) ? 0xFF : 0x00;
uint64_t sel64;
__builtin_memcpy(&sel64, sel, 8);
/* Build an 8-byte selection mask directly in a 64-bit value:
* byte 0 = low 8 bits of mask (per-bit)
* bytes 1-7 = 0xFF or 0x00 based on mask bits 8-14 */
uint64_t sel64 =
((uint64_t)(mask & 0x00FF)) |
((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8) |
((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) |
((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) |
((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) |
((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) |
((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) |
((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56);

Copilot uses AI. Check for mistakes.
Comment thread src/blitter_simd_neon.c Outdated
Comment on lines +58 to +71
* Multiply each lane by a power-of-2 weight and horizontal add.
* Weights: 1, 2, 4, 8, 16, 32, 64, 128 */
static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 };
uint8x8_t vw = vld1_u8(weights);

/* AND with weights (0xFF & weight = weight, 0 & weight = 0) */
uint8x8_t vbits = vand_u8(vcmp, vw);

/* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */
uint8x8_t sum1 = vpadd_u8(vbits, vbits); /* 4 sums */
uint8x8_t sum2 = vpadd_u8(sum1, sum1); /* 2 sums */
uint8x8_t sum3 = vpadd_u8(sum2, sum2); /* 1 sum */

return vget_lane_u8(sum3, 0);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a function intended to be a hot-path primitive, loading weights from memory and doing three vpadd stages is relatively heavy. A lighter approach is to avoid the memory load by using an immediate vector for weights (so the compiler can constant-fold), or use a bit-extraction strategy (e.g., shift the compare result to 0/1 per lane then reduce) to cut instruction count.

Suggested change
* Multiply each lane by a power-of-2 weight and horizontal add.
* Weights: 1, 2, 4, 8, 16, 32, 64, 128 */
static const uint8_t weights[8] = { 1, 2, 4, 8, 16, 32, 64, 128 };
uint8x8_t vw = vld1_u8(weights);
/* AND with weights (0xFF & weight = weight, 0 & weight = 0) */
uint8x8_t vbits = vand_u8(vcmp, vw);
/* Pairwise add to collapse 8 bytes -> 4 -> 2 -> 1 */
uint8x8_t sum1 = vpadd_u8(vbits, vbits); /* 4 sums */
uint8x8_t sum2 = vpadd_u8(sum1, sum1); /* 2 sums */
uint8x8_t sum3 = vpadd_u8(sum2, sum2); /* 1 sum */
return vget_lane_u8(sum3, 0);
* Convert compare lanes from 0xFF/0x00 to 1/0, then pack into the
* result byte directly to avoid the weights load and vpadd chain. */
uint8x8_t vbits = vshr_n_u8(vcmp, 7);
return (uint8_t)(
(vget_lane_u8(vbits, 0) << 0) |
(vget_lane_u8(vbits, 1) << 1) |
(vget_lane_u8(vbits, 2) << 2) |
(vget_lane_u8(vbits, 3) << 3) |
(vget_lane_u8(vbits, 4) << 4) |
(vget_lane_u8(vbits, 5) << 5) |
(vget_lane_u8(vbits, 6) << 6) |
(vget_lane_u8(vbits, 7) << 7));

Copilot uses AI. Check for mistakes.
Comment thread test/test_blitter_simd.c Outdated
Comment on lines +300 to +304
static double elapsed_ns(struct timespec start, struct timespec end)
{
return (double)(end.tv_sec - start.tv_sec) * 1e9
+ (double)(end.tv_nsec - start.tv_nsec);
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clock_gettime(CLOCK_MONOTONIC, ...) is not available on all platforms/toolchains (notably typical Windows builds), but the PR description calls out Windows CI as a target. Consider adding a small portability shim for timing (or #ifdef-guard the benchmark path) so the test program still builds everywhere even if benchmarking is disabled on some platforms.

Copilot uses AI. Check for mistakes.
Comment thread test/test_blitter_simd.c Outdated
Comment on lines +313 to +316
clock_gettime(CLOCK_MONOTONIC, &t0);
for (i = 0; i < BENCH_ITERS; i++)
sink += ref_lfu(0xAAAAAAAAAAAAAAAAULL, 0x5555555555555555ULL, (uint8_t)(i & 0x0F));
clock_gettime(CLOCK_MONOTONIC, &t1);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clock_gettime(CLOCK_MONOTONIC, ...) is not available on all platforms/toolchains (notably typical Windows builds), but the PR description calls out Windows CI as a target. Consider adding a small portability shim for timing (or #ifdef-guard the benchmark path) so the test program still builds everywhere even if benchmarking is disabled on some platforms.

Copilot uses AI. Check for mistakes.
JoeMatt added a commit to Provenance-Emu/virtualjaguar-libretro that referenced this pull request Apr 3, 2026
…e, cleaner intrinsics

- Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation
- blitter_simd.h: clarify byte_merge mask bit semantics in comment
- blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge
- blitter_simd_neon.c: replace static const array with inline vcreate_u8
- test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows),
  fix build instructions (link one impl, not both)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
JoeMatt added a commit to Provenance-Emu/virtualjaguar-libretro that referenced this pull request Apr 3, 2026
…e, cleaner intrinsics

- Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation
- blitter_simd.h: clarify byte_merge mask bit semantics in comment
- blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge
- blitter_simd_neon.c: replace static const array with inline vcreate_u8
- test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows),
  fix build instructions (link one impl, not both)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JoeMatt JoeMatt force-pushed the libretro/feature/simd-blitter branch from 40d942a to 1b4a86a Compare April 3, 2026 07:07
JoeMatt added a commit to Provenance-Emu/virtualjaguar-libretro that referenced this pull request Apr 3, 2026
…, lighter NEON dcomp

- Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH)
  variables for cross-compilation, keep uname -m as native fallback only
- blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with
  vshr_n_u8 + lane extraction for fewer instructions in dcomp
- test_blitter_simd.c: fix duplicate lines from rebase

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JoeMatt JoeMatt requested a review from Copilot April 3, 2026 07:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile.common Outdated
Comment on lines +68 to +78
# ARM targets: prefer NEON when the target configuration says it is available.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter aarch64 arm64 arm,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform)))
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-selection logic forces the NEON implementation whenever ARCH or platform matches arm, even when HAVE_NEON is not set. This will break builds on ARM targets without NEON (e.g., the generic ARM block in Makefile sets ARCH = arm but does not guarantee -mfpu=neon / NEON support). Restrict NEON selection to aarch64/arm64 or HAVE_NEON=1 (and avoid matching plain arm unless NEON is explicitly enabled).

Suggested change
# ARM targets: prefer NEON when the target configuration says it is available.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter aarch64 arm64 arm,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8% armv7% armhf arm,$(platform)))
# ARM targets: prefer NEON only when the target configuration says it is
# available, or when targeting architectures/platforms where NEON is implied.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter aarch64 arm64,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform)))

Copilot uses AI. Check for mistakes.
Comment thread Makefile.common
Comment on lines +82 to +88
# x86/x64 targets: use SSE2.
ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
ifneq (,$(filter x86_64 x86 i686 i386 win-x64 win32,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When selecting the SSE2 implementation for 32-bit x86 targets (x86/i686/i386), the build may fail unless the compiler is explicitly told to enable SSE2 intrinsics (e.g., -msse2 for GCC/Clang). Currently the Makefiles choose blitter_simd_sse2.c but do not add any corresponding compile flags. Consider either (1) limiting automatic SSE2 selection to x86_64 only, or (2) appending -msse2 (and any required defines) when building the SSE2 source on 32-bit x86.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile.common
Comment on lines +93 to +101

# Native build fallback: auto-detect from host architecture
ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
endif
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uname -m auto-detection at the end will select the SIMD source based on the build host, which can break cross-compiles to non-x86/ARM targets (e.g. PS3/PSP/etc.) by pulling in SSE2/NEON intrinsics for the wrong architecture. Consider removing the uname fallback or gating it behind a condition that guarantees a native build (or an explicit BLITTER_SIMD setting).

Suggested change
# Native build fallback: auto-detect from host architecture
ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
endif
endif

Copilot uses AI. Check for mistakes.
Comment thread src/blitter.c
uint64_t func2 = funcmask[(lfu_func >> 2) & 0x01];
uint64_t func3 = funcmask[(lfu_func >> 3) & 0x01];
uint64_t lfu = (~srcd & ~dstd & func0) | (~srcd & dstd & func1) | (srcd & ~dstd & func2) | (srcd & dstd & func3);
uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func);
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These calls go through function pointers (blitter_simd_ops.*), which introduces an indirect-call overhead on a very hot path and contradicts the PR description's “zero runtime dispatch overhead” claim unless LTO/devirtualization is guaranteed. If the goal is truly zero overhead, consider selecting the implementation via static inline functions/macros at compile time (or exposing direct symbols) so the compiler can inline/constant-fold.

Copilot uses AI. Check for mistakes.
JoeMatt added a commit that referenced this pull request Apr 15, 2026
…er intrinsics

- Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation
- blitter_simd.h: clarify byte_merge mask bit semantics in comment
- blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge
- blitter_simd_neon.c: replace static const array with inline vcreate_u8
- test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows),
  fix build instructions (link one impl, not both)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
JoeMatt added a commit that referenced this pull request Apr 15, 2026
…r NEON dcomp

- Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH)
  variables for cross-compilation, keep uname -m as native fallback only
- blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with
  vshr_n_u8 + lane extraction for fewer instructions in dcomp
- test_blitter_simd.c: fix duplicate lines from rebase

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
JoeMatt added a commit to Provenance-Emu/virtualjaguar-libretro that referenced this pull request Apr 15, 2026
…e, cleaner intrinsics

- Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation
- blitter_simd.h: clarify byte_merge mask bit semantics in comment
- blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge
- blitter_simd_neon.c: replace static const array with inline vcreate_u8
- test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows),
  fix build instructions (link one impl, not both)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
JoeMatt added a commit to Provenance-Emu/virtualjaguar-libretro that referenced this pull request Apr 15, 2026
…, lighter NEON dcomp

- Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH)
  variables for cross-compilation, keep uname -m as native fallback only
- blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with
  vshr_n_u8 + lane extraction for fewer instructions in dcomp
- test_blitter_simd.c: fix duplicate lines from rebase

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JoeMatt JoeMatt force-pushed the libretro/feature/simd-blitter branch from c143045 to 70bebb7 Compare April 15, 2026 23:01
@JoeMatt JoeMatt requested a review from Copilot April 16, 2026 03:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/blitter_simd_neon.c Outdated
Comment on lines +111 to +122
/* Build 8-byte selection mask */
uint8_t sel[8];
sel[0] = (uint8_t)(mask & 0xFF);
sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
sel[7] = (mask & 0x4000) ? 0xFF : 0x00;

uint8x8_t vmask = vld1_u8(sel);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neon_byte_merge builds an 8-byte mask in a stack array and then does vld1_u8(sel) every call. Since this function is intended for a hot blitter datapath, the extra stores + load can materially reduce the NEON speedup. Consider computing the 64-bit sel value purely in registers (similar to the SSE2 version) and using vcreate_u64 + reinterpret instead of a memory load.

Suggested change
/* Build 8-byte selection mask */
uint8_t sel[8];
sel[0] = (uint8_t)(mask & 0xFF);
sel[1] = (mask & 0x0100) ? 0xFF : 0x00;
sel[2] = (mask & 0x0200) ? 0xFF : 0x00;
sel[3] = (mask & 0x0400) ? 0xFF : 0x00;
sel[4] = (mask & 0x0800) ? 0xFF : 0x00;
sel[5] = (mask & 0x1000) ? 0xFF : 0x00;
sel[6] = (mask & 0x2000) ? 0xFF : 0x00;
sel[7] = (mask & 0x4000) ? 0xFF : 0x00;
uint8x8_t vmask = vld1_u8(sel);
/* Build 8-byte selection mask entirely in registers. Byte 0 uses the
* low 8 bits directly; bytes 1..7 are all-ones/all-zero from mask bits. */
uint64_t sel =
((uint64_t)(mask & 0x00FF)) |
((uint64_t)((mask & 0x0100) ? 0xFF : 0x00) << 8) |
((uint64_t)((mask & 0x0200) ? 0xFF : 0x00) << 16) |
((uint64_t)((mask & 0x0400) ? 0xFF : 0x00) << 24) |
((uint64_t)((mask & 0x0800) ? 0xFF : 0x00) << 32) |
((uint64_t)((mask & 0x1000) ? 0xFF : 0x00) << 40) |
((uint64_t)((mask & 0x2000) ? 0xFF : 0x00) << 48) |
((uint64_t)((mask & 0x4000) ? 0xFF : 0x00) << 56);
uint8x8_t vmask = vreinterpret_u8_u64(vcreate_u64(sel));

Copilot uses AI. Check for mistakes.
Comment thread Makefile.common
Comment on lines +83 to +90
# x86/x64 targets: use SSE2.
ifneq (,$(filter x86_64 x86 i686 i386,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
# 32-bit x86 may need explicit SSE2 enable
ifneq (,$(filter i686 i386 x86,$(ARCH)))
CFLAGS += -msse2
endif
endif
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-msse2 is only appended when ARCH matches i686/i386/x86, but the SSE2 source can be selected via platform matches, MSYSTEM, or uname -m even when ARCH is unset. On 32-bit x86 toolchains (e.g., MinGW32), this can cause compilation failures due to SSE2 intrinsics being used without enabling SSE2. Consider tying the -msse2 flag to the condition that selects the SSE2 source for 32-bit targets (e.g., based on platform, uname -m, or MSYSTEM).

Copilot uses AI. Check for mistakes.
Comment thread src/blitter.c
Comment on lines +2837 to 2838
uint64_t lfu = blitter_simd_ops.lfu(srcd, dstd, lfu_func);
bool mir_bit, mir_byte;
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hot-path calls go through function pointers (blitter_simd_ops.*). Unless LTO is reliably enabled for all builds, this introduces an indirect call that the compiler typically cannot inline/devirtualize, which conflicts with the stated goal of “zero runtime dispatch overhead” and can materially reduce the SIMD speedup in inner loops. Consider using direct symbols (e.g., blitter_simd_lfu() etc.) selected at link-time by compiling exactly one implementation file, or provide static inline wrappers that can be inlined when the implementation is known.

Copilot uses AI. Check for mistakes.
Comment thread Makefile.common Outdated
Comment on lines +68 to +79
# ARM targets: prefer NEON when the target configuration says it is available.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it
ifneq (,$(filter aarch64 arm64,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform)))
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NEON implementation is selected for $(platform) values matching armv7% and armhf, but those do not universally guarantee NEON availability or the necessary compiler flags (e.g., -mfpu=neon). This can break builds on ARM targets without NEON (or with toolchains not defaulting to NEON). Consider selecting blitter_simd_neon.c only when HAVE_NEON=1 (or when targeting aarch64/arm64 / iOS-arm64 where NEON is mandatory), otherwise fall back to scalar unless BLITTER_SIMD=neon is explicitly requested.

Suggested change
# ARM targets: prefer NEON when the target configuration says it is available.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it
ifneq (,$(filter aarch64 arm64,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8% armv7% armhf,$(platform)))
# ARM targets: prefer NEON only when the target configuration says it is available
# or when targeting ARM64/AArch64 where NEON is mandatory.
ifeq ($(HAVE_NEON), 1)
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter ios-arm64 tvos-arm64,$(platform)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
# Only aarch64/arm64 guaranteed to have NEON; plain 'arm' may lack it.
ifneq (,$(filter aarch64 arm64,$(ARCH)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
ifneq (,$(filter arm64 armv8%,$(platform)))

Copilot uses AI. Check for mistakes.
Comment thread Makefile.common Outdated
Comment on lines +99 to +110
# Native build fallback: auto-detect from host when no SIMD was selected above.
ifeq ($(BLITTER_SIMD_SRC),)
ifneq (,$(filter x86_64 x86 i686 i386,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_sse2.c
endif
endif
ifeq ($(BLITTER_SIMD_SRC),)
ifneq (,$(filter aarch64 arm64,$(shell uname -m 2>/dev/null)))
BLITTER_SIMD_SRC := $(CORE_DIR)/src/blitter_simd_neon.c
endif
endif
endif
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uname -m fallback can pick a SIMD implementation based on the build host even when platform targets a different architecture (cross-compiles like vita, ps3, etc. don't consistently set ARCH here). That can accidentally compile the SSE2 file for non-x86 targets. Consider restricting host auto-detection to true native builds (e.g., when platform is unix/osx/win or platform == system_platform) and otherwise default to the scalar implementation unless the target arch is explicitly detected via platform/ARCH or BLITTER_SIMD.

Copilot uses AI. Check for mistakes.
JoeMatt and others added 5 commits April 16, 2026 00:13
Extract four hot-path operations from blitter.c DATA() into a
dispatch interface (blitter_simd_ops_t) with arch-specific
implementations:

- LFU: 64-bit truth table (4 boolean functions)
- DCOMP: 8-byte equality comparator
- ZCOMP: 4x 16-bit Z-buffer comparison
- byte_merge: per-byte mask select for pixel and Z data

Files:
- blitter_simd.h: dispatch struct typedef
- blitter_simd_scalar.c: portable C reference (fallback)
- blitter_simd_sse2.c: x86/x64 SSE2 intrinsics
- blitter_simd_neon.c: ARM NEON intrinsics
- test/test_blitter_simd.c: 40k+ bit-exactness tests + benchmark

Makefile.common selects implementation at compile time based on
target architecture (zero runtime dispatch cost).

Addresses libretro#99.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e, cleaner intrinsics

- Makefile.common: add BLITTER_SIMD=scalar|sse2|neon override for cross-compilation
- blitter_simd.h: clarify byte_merge mask bit semantics in comment
- blitter_simd_sse2.c: replace stack array + memcpy with direct bit ops in byte_merge
- blitter_simd_neon.c: replace static const array with inline vcreate_u8
- test_blitter_simd.c: portable TIMER_DECL/START/STOP/NS macros (POSIX + Windows),
  fix build instructions (link one impl, not both)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, lighter NEON dcomp

- Makefile.common: validate BLITTER_SIMD value, detect via $(platform)/$(ARCH)
  variables for cross-compilation, keep uname -m as native fallback only
- blitter_simd_neon.c: replace vcreate_u8 weights + vpadd chain with
  vshr_n_u8 + lane extraction for fewer instructions in dcomp
- test_blitter_simd.c: fix duplicate lines from rebase

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove plain 'arm' from NEON ARCH filter (armv5/armv6 lack NEON)
- Don't initialize BLITTER_SIMD_SRC to scalar — let auto-detection
  run by keeping it empty until a match is found
- Add -msse2 flag for 32-bit x86 targets
- Add scalar fallback for exotic platforms with no SIMD support
- uname -m fallback now only fires when no prior rule matched

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JoeMatt JoeMatt force-pushed the libretro/feature/simd-blitter branch from 70bebb7 to b51b63f Compare April 16, 2026 04:14
JoeMatt and others added 2 commits April 16, 2026 00:29
Makefile.common:
- Restrict armv7/armhf NEON auto-selection to HAVE_NEON=1 only
- Guard uname -m fallback to native-build platforms (unix/osx/win)
- Add -msse2 flag consistently for all 32-bit x86 selection paths

blitter_simd_neon.c:
- Build neon_byte_merge mask in registers via vcreate_u64 instead of
  stack array + vld1_u8 memory load

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs the test harness for both the platform-native SIMD implementation
and the scalar fallback on every build, covering SSE2 (linux-x64,
windows), NEON (linux-arm64, macos-arm64), and scalar cross-check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JoeMatt JoeMatt merged commit dd44259 into libretro:master Apr 16, 2026
8 of 9 checks passed
@JoeMatt JoeMatt deleted the libretro/feature/simd-blitter branch April 16, 2026 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] SIMD-accelerated blitter

2 participants