refactor/march based reorganization by richyreachy · Pull Request #193 · alibaba/zvec

richyreachy · 2026-03-03T08:15:59Z

march based reorganization

greptile-apps · 2026-03-03T08:19:07Z

Greptile Summary

This PR performs a large-scale refactoring of SIMD math kernel files, splitting monolithic per-type files (e.g. euclidean_distance_matrix_fp32.cc) into separate _sse, _avx, _avx2, _avx512, _neon, and _dispatch translation units, each compiled with targeted -march flags via the new setup_compiler_march_for_x86 CMake helper.

Key issues found:

cmake/option.cmake:79 — Undefined variable in ARM march check: The _setup_armv8_march() function references undefined variable _ver instead of _arch in the compiler flag check, causing the validation to test an empty flag and trivially succeed.
src/ailego/CMakeLists.txt:98 — Unset MATH_MARCH_FLAG_NEON variable: The ARM code path uses ${MATH_MARCH_FLAG_NEON} in per-file compiler flags but never assigns the variable, resulting in empty compile flags for all ARM NEON/dispatch files instead of the intended -march=armv8-a+simd (or similar).
src/ailego/math/inner_product_matrix_fp16_dispatch.cc:145 — Dead code in FP16 sparse dispatch: The preprocessor checks __AVX__ before __AVX512FP16__. Since AVX512FP16 implies AVX (both are defined when compiling with -march=sapphirerapids or similar), the #elif defined(__AVX512FP16__) branch is unreachable, preventing the more optimal InnerProductSparseInSegmentAVX512FP16 path from ever being used.
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc:67 — Off-by-one boundary check: After the 16-wide main loop, when exactly 8 elements remain, the condition dim + 8 < dimensionality is false and those elements fall through to the scalar loop instead of the 8-wide SIMD path. Use <= to capture that boundary case.

Confidence Score: 2/5

Not safe to merge — build system bugs will silently misconfigure ARM targets, and ISA dispatch/boundary check bugs degrade performance on x86.
Two CMake-level bugs (undefined _ver in ARM flag check, unset MATH_MARCH_FLAG_NEON) mean ARM SIMD files will be compiled without intended march flags. The inverted ISA priority in FP16 sparse dispatch makes the AVX512FP16 path permanently unreachable. The boundary check off-by-one skips the 8-wide SIMD path for edge cases. Together these affect correctness of the build system and performance of production code paths.
cmake/option.cmake and src/ailego/CMakeLists.txt (ARM march variable bugs) must be fixed before any ARM target builds; src/ailego/math/inner_product_matrix_fp16_dispatch.cc needs ISA check order corrected; src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc boundary condition fix.

_{Last reviewed commit: a353f42}

greptile-apps

_{60 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T08:19:14Z

src/ailego/CMakeLists.txt

+file(GLOB_RECURSE MATH_FILES_AVX512
+    ${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.c
+    ${CMAKE_CURRENT_SOURCE_DIR}/math/*_avx512.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/math/*_avx512.c
+    ${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_avx512.cc
+    ${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_avx512.c
+)


Dispatch files should NOT be compiled with AVX512 flags. They contain runtime CPU detection logic and must be compiled with baseline flags (like the default broadwell) to run safely on all CPUs. Currently, these dispatch files will crash with illegal instruction errors on non-AVX512 CPUs.

Suggested change

file(GLOB_RECURSE MATH_FILES_AVX512

${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.cc

${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.c

${CMAKE_CURRENT_SOURCE_DIR}/math/*_avx512.cc

${CMAKE_CURRENT_SOURCE_DIR}/math/*_avx512.c

${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_avx512.cc

${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_avx512.c

)

file(GLOB_RECURSE MATH_FILES_AVX512

${CMAKE_CURRENT_SOURCE_DIR}/math/*_avx512.cc

${CMAKE_CURRENT_SOURCE_DIR}/math/*_avx512.c

${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_avx512.cc

${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_avx512.c

)

file(GLOB_RECURSE MATH_FILES_DISPATCH

${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.cc

${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.c

)

greptile-apps · 2026-03-03T08:19:19Z

src/ailego/CMakeLists.txt

+foreach(MATH_FILE ${MATH_FILES_AVX512})
+    set_source_files_properties(
+        ${MATH_FILE}
+        PROPERTIES
+        COMPILE_FLAGS "${MATH_MARCH_FLAG_AVX512}"
+    )
+endforeach()


Need separate loop for dispatch files with baseline flags

Suggested change

foreach(MATH_FILE ${MATH_FILES_AVX512})

set_source_files_properties(

${MATH_FILE}

PROPERTIES

COMPILE_FLAGS "${MATH_MARCH_FLAG_AVX512}"

)

endforeach()

foreach(MATH_FILE ${MATH_FILES_AVX512})

set_source_files_properties(

${MATH_FILE}

PROPERTIES

COMPILE_FLAGS "${MATH_MARCH_FLAG_AVX512}"

)

endforeach()

# Dispatch files should use baseline flags for runtime CPU detection

foreach(MATH_FILE ${MATH_FILES_DISPATCH})

set_source_files_properties(

${MATH_FILE}

PROPERTIES

COMPILE_FLAGS "-march=broadwell"

)

endforeach()

iaojnh · 2026-03-06T09:19:43Z

@greptile

greptile-apps · 2026-03-06T09:30:11Z

cmake/option.cmake

-  message(WARNING "No ARMv8 architecture flag supported by compiler.")
+function(_setup_armv8_march)
+  set(_arch "armv8")
+  check_c_compiler_flag("-march=${_ver}" _COMP_SUPP_${_arch})


Undefined variable _ver breaks ARM architecture check

_ver is undefined in this scope — it was the loop variable from the old _detect_armv8_best function. The call therefore expands to check_c_compiler_flag("-march=" ...), which tests an empty flag rather than -march=armv8. Most compilers will accept an empty flag and succeed, so _COMP_SUPP_armv8 will be TRUE regardless of whether armv8 is actually supported — masking any real incompatibility.

Suggested change

check_c_compiler_flag("-march=${_ver}" _COMP_SUPP_${_arch})

check_c_compiler_flag("-march=${_arch}" _COMP_SUPP_${_arch})

greptile-apps · 2026-03-06T09:30:12Z

src/ailego/CMakeLists.txt

+elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|arm64|ARM64")
+  file(GLOB_RECURSE MATH_FILES_NEON
+      ${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.cc
+      ${CMAKE_CURRENT_SOURCE_DIR}/math/*_dispatch.c
+      ${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_dispatch.cc
+      ${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_dispatch.c
+      ${CMAKE_CURRENT_SOURCE_DIR}/math/*_neon.cc
+      ${CMAKE_CURRENT_SOURCE_DIR}/math/*_neon.c
+      ${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_neon.cc
+      ${CMAKE_CURRENT_SOURCE_DIR}/math_batch/*_neon.c
+  )
+
+  foreach(MATH_FILE ${MATH_FILES_NEON})
+      set_source_files_properties(
+          ${MATH_FILE}
+          PROPERTIES
+          COMPILE_FLAGS "${MATH_MARCH_FLAG_NEON}"
+      )
+  endforeach()
+endif()


MATH_MARCH_FLAG_NEON is never set — ARM per-file flags are silently empty

For ARM builds, MATH_MARCH_FLAG_NEON is never assigned anywhere (the x86 path has setup_compiler_march_for_x86(...) which sets the three flag variables, but there is no equivalent call for ARM). set_source_files_properties(...COMPILE_FLAGS "${MATH_MARCH_FLAG_NEON}") therefore expands to an empty string, so all NEON/dispatch source files are compiled with whatever the global default flags happen to be instead of the intended -march=armv8-a+simd (or similar).

You need to define the variable before using it. For example:

elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|arm64|ARM64") set(MATH_MARCH_FLAG_NEON "-march=armv8-a+simd") # or armv8.2-a+fp16, etc. file(GLOB_RECURSE MATH_FILES_NEON ... )

greptile-apps · 2026-03-06T09:30:13Z

src/ailego/math/inner_product_matrix_fp16_dispatch.cc

+#if defined(__AVX__)
+  return InnerProductSparseInSegmentAVX(m_sparse_count, m_sparse_index,
+                                        m_sparse_value, q_sparse_count,
+                                        q_sparse_index, q_sparse_value);
+#elif defined(__AVX512FP16__)
+  return InnerProductSparseInSegmentAVX512FP16(m_sparse_count, m_sparse_index,
+                                               m_sparse_value, q_sparse_count,
+                                               q_sparse_index, q_sparse_value);
+#else
+  return InnerProductSparseInSegment(m_sparse_count, m_sparse_index,
+                                     m_sparse_value, q_sparse_count,
+                                     q_sparse_index, q_sparse_value);
+#endif


AVX512FP16 sparse path is dead code — __AVX__ is always defined when __AVX512FP16__ is

The preprocessor checks __AVX__ before __AVX512FP16__. Because AVX512FP16 implies AVX (compiling with -march=sapphirerapids or similar also defines __AVX__), the #elif defined(__AVX512FP16__) branch is unreachable — the more optimal InnerProductSparseInSegmentAVX512FP16 path will never be taken. The higher-capability ISA must be checked first:

Suggested change

#if defined(__AVX__)

return InnerProductSparseInSegmentAVX(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#elif defined(__AVX512FP16__)

return InnerProductSparseInSegmentAVX512FP16(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#else

return InnerProductSparseInSegment(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#endif

#if defined(__AVX512FP16__)

return InnerProductSparseInSegmentAVX512FP16(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#elif defined(__AVX__)

return InnerProductSparseInSegmentAVX(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#else

return InnerProductSparseInSegment(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#endif

greptile-apps · 2026-03-06T09:30:14Z

src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc

+    }
+  }
+
+  if (dim + 8 < dimensionality) {


Off-by-one boundary check skips the 8-wide SIMD path for exactly 8 remaining elements

After the main 16-at-a-time loop, when exactly 8 elements remain (dim + 8 == dimensionality), the strict < makes this condition false. Those 8 elements fall through to the scalar loop instead of being handled by the _mm256_cvtph_ps + _mm256_fmadd_ps path. Use <= to capture that case:

Suggested change

if (dim + 8 < dimensionality) {

if (dim + 8 <= dimensionality) {

greptile-apps · 2026-03-06T09:30:15Z

cmake/option.cmake

-  message(WARNING "No ARMv8 architecture flag supported by compiler.")
+function(_setup_armv8_march)
+  set(_arch "armv8")
+  check_c_compiler_flag("-march=${_ver}" _COMP_SUPP_${_arch})


Undefined variable _ver breaks ARM architecture check.

_ver is undefined in this scope. The function defines _arch = "armv8" on line 78, but the flag check references _ver, causing the check to expand to -march= (empty flag). Most compilers will accept an empty flag and succeed, so the ARM march support test becomes vacuous.

Suggested change

check_c_compiler_flag("-march=${_ver}" _COMP_SUPP_${_arch})

check_c_compiler_flag("-march=${_arch}" _COMP_SUPP_${_arch})

greptile-apps · 2026-03-06T09:30:16Z

src/ailego/CMakeLists.txt

+      set_source_files_properties(
+          ${MATH_FILE}
+          PROPERTIES
+          COMPILE_FLAGS "${MATH_MARCH_FLAG_NEON}"


MATH_MARCH_FLAG_NEON is never set — ARM per-file SIMD flags are silently empty.

The x86 path calls setup_compiler_march_for_x86(...) to set MATH_MARCH_FLAG_SSE, MATH_MARCH_FLAG_AVX2, and MATH_MARCH_FLAG_AVX512 (lines 22–23), but the ARM path does not set MATH_MARCH_FLAG_NEON before using it on line 98. As a result, all NEON/dispatch source files are compiled with empty per-file COMPILE_FLAGS instead of the intended -march=armv8-a+simd (or similar).

Define the variable before using it:

elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|arm64|ARM64") set(CMAKE_CXX_FLAGS "-march=armv8-a") set(CMAKE_C_FLAGS "-march=armv8-a") set(MATH_MARCH_FLAG_NEON "-march=armv8-a")

greptile-apps · 2026-03-06T09:30:17Z

src/ailego/math/inner_product_matrix_fp16_dispatch.cc

+#if defined(__AVX__)
+  return InnerProductSparseInSegmentAVX(m_sparse_count, m_sparse_index,
+                                        m_sparse_value, q_sparse_count,
+                                        q_sparse_index, q_sparse_value);
+#elif defined(__AVX512FP16__)


AVX512FP16 sparse path is dead code — __AVX__ is always defined when __AVX512FP16__ is available.

When compiling with AVX512FP16 support (e.g., -march=sapphirerapids), the preprocessor defines both __AVX512FP16__ and __AVX__. Since __AVX__ is checked first, the #elif defined(__AVX512FP16__) branch is unreachable, and the more optimal InnerProductSparseInSegmentAVX512FP16 path is never taken. Check for the more specific ISA first:

Suggested change

#if defined(__AVX__)

return InnerProductSparseInSegmentAVX(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#elif defined(__AVX512FP16__)

#if defined(__AVX512FP16__)

return InnerProductSparseInSegmentAVX512FP16(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#elif defined(__AVX__)

return InnerProductSparseInSegmentAVX(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#else

return InnerProductSparseInSegment(m_sparse_count, m_sparse_index,

m_sparse_value, q_sparse_count,

q_sparse_index, q_sparse_value);

#endif

greptile-apps · 2026-03-06T09:30:18Z

src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc

+    }
+  }
+
+  if (dim + 8 < dimensionality) {


Off-by-one boundary check skips the 8-wide SIMD path for exactly 8 remaining elements.

The main loop processes 16 elements at a time (line 38). After it exits, when exactly 8 elements remain (e.g., dim + 8 == dimensionality), the strict < makes this condition false, and those 8 elements fall through to the scalar loop (lines 85–89) instead of being handled by the optimized _mm256_cvtph_ps + _mm256_fmadd_ps path (lines 68–75). Use <= to include that boundary case:

Suggested change

if (dim + 8 < dimensionality) {

if (dim + 8 <= dimensionality) {

richyreachy added 19 commits February 26, 2026 19:14

add for march compatible test

a2326a8

update codes

d4c9b6e

update code

8f61b0d

remove avx

2b0742d

update code

c907dc0

add euclidean dispatch

04fabee

add l2 int4 dispatch

9930d64

refactor codes

a8a238e

refactor mips codes

bd0060a

refactor inner product

c8a8126

refactor inner product metric

b64639a

update codes

1f78251

fix params

dd8c170

fix params

e92cf59

fix params

5bfbf80

fix params

8e5af92

fix mips metric

7096507

remove inline

b74d256

fix mips metric

4370619

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

richyreachy and others added 9 commits March 3, 2026 19:51

fix mips metric

2f2fba5

remove MxN distances

012fa99

update cmake

f0e2bab

refactor one2many distances

205b85a

fix inner product

ee477dd

add specialized function

5c22c92

fix compile

f9d162a

fix euclidean

af8f724

remove inner product MxN interface

fcbbeca

iaojnh and others added 8 commits March 5, 2026 11:33

fix euclidean fp16

5091a37

fix interfaces

7168a19

suppress warning

e46129e

remove pragma

c4c73b2

suppress warning

52803d5

add cmake option

3c8ceaa

update cmake

e5ca323

merge main

4a0b4f4

richyreachy changed the title ~~refactor/march based reorganization [WIP]~~ refactor/march based reorganization Mar 5, 2026

iaojnh and others added 18 commits March 5, 2026 19:27

clang-format

aff8ea6

clang-format

d6df228

change cmake config

9989f56

move static array

65707a7

fix: move static array

289d21a

refactor: cmake config

1dbc9af

fix: typo

637095c

refactor: neon cmake config

57e5685

upd cflags

d1959b0

Merge branch 'main' into refactor/march_based_reorganization

aadc2a3

fix: fix typo

f049e54

fix: condition error

608339f

enhance android build ci

d5b4f29

fix android build ci

01aa52a

refactor: turn on sse using a unique switch

b81e67b

refactor: use macro instead of static var

1019f3d

fix: fix typo

1fcdc00

refactor: cmake update

a353f42

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

	check_c_compiler_flag("-march=${_ver}" _COMP_SUPP_${_arch})
	check_c_compiler_flag("-march=${_arch}" _COMP_SUPP_${_arch})

	if (dim + 8 < dimensionality) {
	if (dim + 8 <= dimensionality) {

Conversation

richyreachy commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

iaojnh commented Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Mar 3, 2026 •

edited

Loading