Skip to content

Conversation

@ggerganov
Copy link
Member

No description provided.

ggerganov and others added 30 commits August 18, 2025 19:31
The pipeline member can be cast to VkPipeline.
This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit.
Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat
* SYCL: Add set_rows support for quantized types

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

* Use get_global_linear_id() instead

ggml-ci

* Fix formatting

ggml-ci

* Use const for ne11 and size_t variables in set_rows_sycl_q

ggml-ci

* Increase block size for q kernel to 256

ggml-ci

* Cleanup imports

* Add float.h to cpy.hpp
* remove redundant code in riscv

* remove redundant code in arm

* remove redundant code in loongarch

* remove redundant code in ppc

* remove redundant code in s390

* remove redundant code in wasm

* remove redundant code in x86

* remove fallback headers

* fix x86 ggml_vec_dot_q8_0_q8_0
* CUDA: add roll

* Make everything const, use __restrict__
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
…(llama/14930)

This is useful for testing for regressions on GCN with CDNA hardware.

With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
…t. (llama/14985)

* CANN: Improve loading efficiency after converting weights to NZ format.

* CANN: fix typo
* vulkan: fix debug mode issues

* vulkan: remove broken check_results GGML_OP_SET_ROWS support
* docker: add cann build pipline

* docker: add cann build pipline

* docker: fix cann devops

* cann : fix multi card hccl

* Update ggml/src/ggml-cann/ggml-cann.cpp

Co-authored-by: Xuan-Son Nguyen <[email protected]>

* Update ggml-cann.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Xuan-Son Nguyen <[email protected]>
* Initial Q2_K Block Interleaving Implementation

* Addressed review comments and clean up of the code

* Post rebase fixes

* Initial CI/CD fixes

* Update declarations in arch-fallback.h

* Changes for GEMV Q2_K in arch-fallback.h

* Enable repacking only on AVX-512 machines

* Update comments in repack.cpp

* Address q2k comments

---------

Co-authored-by: Manogna-Sree <[email protected]>
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
ggerganov and others added 27 commits August 18, 2025 19:31
* update `rope_multi`:

1. add `ggml_rope_multi_inplace`;
1. use `GGML_MROPE_SECTIONS` instead of 4.

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
* fix USE_CUDA_GRAPH=OFF

ggml-ci

* check capture status

* completely disable capturing check instead
* ggml: fix ggml_conv_1d_dw bug

* Fixed conv1d_dw weight tensor dimension.
* vulkan: perf_logger improvements

- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.

* use <=mul_mat_vec_max_cols rather than ==1
add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews

---------

Co-authored-by: Johannes Gäßler <[email protected]>
* ggml-zdnn: inital backend impl

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: temp change z17 to arch15

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: fix build bugs

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: tensor->extra logging check

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: add layout name mapping, ztensor information

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: separate logging into its own line

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: add shape comparison

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: add ggml_tensor shape log

Signed-off-by: Aaron Teo <[email protected]>

ggml-zdnn: fix incorrect shape logging

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add output buffer check

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: run compute and store into tensor->extra

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add set_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add more loggers

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: update set_tensor logging to check only for matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: last working matmul version

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add comments to prevent accidentally deleting lines

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: support op out_prod

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: update op out_prod to use tensor->extra

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rewrite the backend implementation

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: bugfix new impl

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix compiler warnings and bugfixes

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: test ztensor finding in init_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: implement at least 1 op to test

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: assign tensor->extra to buffer

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add check for view tensors to prevent init_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rework init_tensor to create new buffers

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: switch to std vector instead of array

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: switch buffers back and set to arbitrary number

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: impl init_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: update supports_op matmul matrix

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix incorrect ztensor shape, reduce memory padding

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: impl matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix compiler error missing type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix missing data transform call

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add bias init_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: tighten memory usage, change string allocation

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add bias ztensor and data free

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add bias data transform

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add more debug info for extra buffer transform

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add logger to check if mat mul ops go through set_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: activate bias transform in matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: move weights transform into mulmat

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add more safeguards in matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix sequencing of transforms

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: bugfix transform ztensor vs origtensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: figure out why sigtrap is happening

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix sigsegv

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: move everything back to local declaration

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: move bias data to local also

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: bring back working matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rewrite into mre

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix missing vector import

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix missing vector import in header

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: attempt to fix sigsegv

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix missing load tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix invalid ztensor buffer release

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add logging to debug free buffer

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: remove free_buffer debug info

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add parmblkformat detections

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add nnpa installed detection

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add zdnn_init call for static libs

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add init_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: attempt at fixing invalid buffer

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: switch to using deque to fix pointer deref problem

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add weights logging to check

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: attempt to use unique ptr

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add tensor to pre_tfm_desc logging

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add inputs logging

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: disable op_none initialisation for testing

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix missing return from init_tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: load ztensors in cgraph exec

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: work on moving output ztensor as well

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: disable logging and breakpoints for full test

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: attempt at manually changing the layout

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: attempt at using default nwhc format instead

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: disable global load ztensor for now

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix errorenous output load tensor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: add guards to prevent loading ztensor if transformed

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: bring load ztensor back to init routine

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix ztensor deallocation abort

stabilise ggml <-> zdnn api

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: clean up matmul selection

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: clean up project structure

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: update documentation, prepare for upstream

Signed-off-by: Aaron Teo <[email protected]>

* chore: add codeowners

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: disable batched matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: attempt at fixing tensor views during matmul

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: deny all view tensors directly

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix pr comments

Signed-off-by: Aaron Teo <[email protected]>

* docs: update ops docs for zdnn

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: redo test-backend-ops for ops.md

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: fix typo in build-s390x.md

Signed-off-by: Aaron Teo <[email protected]>

* codeowners: remove taronaeo for now

Signed-off-by: Aaron Teo <[email protected]>

* Revert "codeowners: remove taronaeo for now"

This reverts commit 411ea4ed78d08778967bd0bd33a6538cfcbe082f.

* ggml-zdnn: remove unused ggml_zdnn macro

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
* opencl: add reference `mul_mv_mxfp4_f32`

* opencl: add reference `mul_mv_id` for mxfp4

* Q4_0 tranpose fix for Adreno

---------

Co-authored-by: shawngu-quic <[email protected]>
* add F16/F16 fa support

* fix kernel init

* use mad instead of fma

* use inline function

* mark FA with sinks as unsupported for now

* add pragma unroll to loops
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id

* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up

- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
* vulkan: fuse adds

Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.

* check runtimeDescriptorArray feature

* disable multi_add for Intel due to likely driver bug
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
…15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>
…/15379)

* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants
@ggerganov ggerganov merged commit fc45bb8 into master Aug 18, 2025
62 of 63 checks passed
@ggerganov ggerganov deleted the sync-ggml-25-08-18 branch August 18, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.