-
Notifications
You must be signed in to change notification settings - Fork 4.8k
sync : ggml #3383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
sync : ggml #3383
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
* sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat
* SYCL: Add set_rows support for quantized types This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code. * Use get_global_linear_id() instead ggml-ci * Fix formatting ggml-ci * Use const for ne11 and size_t variables in set_rows_sycl_q ggml-ci * Increase block size for q kernel to 256 ggml-ci * Cleanup imports * Add float.h to cpy.hpp
* remove redundant code in riscv * remove redundant code in arm * remove redundant code in loongarch * remove redundant code in ppc * remove redundant code in s390 * remove redundant code in wasm * remove redundant code in x86 * remove fallback headers * fix x86 ggml_vec_dot_q8_0_q8_0
* CUDA: add roll * Make everything const, use __restrict__
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
…(llama/14930) This is useful for testing for regressions on GCN with CDNA hardware. With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
…AMD targets (llama/14945)
…apes (llama/14949)
…t. (llama/14985) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo
* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support
* docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>
* Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <[email protected]>
* vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <[email protected]> --------- Co-authored-by: 0cc4m <[email protected]>
- Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch
* update `rope_multi`: 1. add `ggml_rope_multi_inplace`; 1. use `GGML_MROPE_SECTIONS` instead of 4. * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
* examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) * Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alpha*wd * minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>
* fix USE_CUDA_GRAPH=OFF ggml-ci * check capture status * completely disable capturing check instead
* ggml: fix ggml_conv_1d_dw bug * Fixed conv1d_dw weight tensor dimension.
* vulkan: perf_logger improvements - Account for batch dimension in flops calculation. - Fix how "_VEC" is detected for mat_mul_id. - Fix "n" dimension for mat_mul_id (in case of broadcasting). - Include a->type in name. * use <=mul_mat_vec_max_cols rather than ==1
add expicit conversion operator to support older versions of rocm Switch over to hip_bf16 from legacy hip_bfloat16 Simplify RDNA3 define Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews --------- Co-authored-by: Johannes Gäßler <[email protected]>
* ggml-zdnn: inital backend impl Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: temp change z17 to arch15 Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: fix build bugs Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: tensor->extra logging check Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: add layout name mapping, ztensor information Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: separate logging into its own line Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: add shape comparison Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: add ggml_tensor shape log Signed-off-by: Aaron Teo <[email protected]> ggml-zdnn: fix incorrect shape logging Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add output buffer check Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: run compute and store into tensor->extra Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add set_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add more loggers Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: update set_tensor logging to check only for matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: last working matmul version Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add comments to prevent accidentally deleting lines Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: support op out_prod Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: update op out_prod to use tensor->extra Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: rewrite the backend implementation Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: bugfix new impl Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix compiler warnings and bugfixes Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: test ztensor finding in init_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: implement at least 1 op to test Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: assign tensor->extra to buffer Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add check for view tensors to prevent init_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: rework init_tensor to create new buffers Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: switch to std vector instead of array Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: switch buffers back and set to arbitrary number Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: impl init_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: update supports_op matmul matrix Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix incorrect ztensor shape, reduce memory padding Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: code clean up Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: impl matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix compiler error missing type Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix missing data transform call Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add bias init_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: tighten memory usage, change string allocation Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add bias ztensor and data free Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add bias data transform Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add more debug info for extra buffer transform Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add logger to check if mat mul ops go through set_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: activate bias transform in matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: move weights transform into mulmat Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add more safeguards in matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix sequencing of transforms Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: bugfix transform ztensor vs origtensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: figure out why sigtrap is happening Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix sigsegv Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: move everything back to local declaration Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: move bias data to local also Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: bring back working matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: rewrite into mre Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix missing vector import Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix missing vector import in header Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: attempt to fix sigsegv Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix missing load tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix invalid ztensor buffer release Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add logging to debug free buffer Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: remove free_buffer debug info Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add parmblkformat detections Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add nnpa installed detection Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add zdnn_init call for static libs Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add init_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: attempt at fixing invalid buffer Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: switch to using deque to fix pointer deref problem Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add weights logging to check Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: attempt to use unique ptr Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add tensor to pre_tfm_desc logging Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add inputs logging Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: disable op_none initialisation for testing Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix missing return from init_tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: load ztensors in cgraph exec Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: work on moving output ztensor as well Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: disable logging and breakpoints for full test Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: attempt at manually changing the layout Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: attempt at using default nwhc format instead Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: disable global load ztensor for now Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix errorenous output load tensor Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: add guards to prevent loading ztensor if transformed Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: bring load ztensor back to init routine Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: code clean up Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix ztensor deallocation abort stabilise ggml <-> zdnn api Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: clean up matmul selection Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: clean up project structure Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: update documentation, prepare for upstream Signed-off-by: Aaron Teo <[email protected]> * chore: add codeowners Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: disable batched matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: attempt at fixing tensor views during matmul Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: deny all view tensors directly Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix pr comments Signed-off-by: Aaron Teo <[email protected]> * docs: update ops docs for zdnn Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: redo test-backend-ops for ops.md Signed-off-by: Aaron Teo <[email protected]> * ggml-zdnn: fix typo in build-s390x.md Signed-off-by: Aaron Teo <[email protected]> * codeowners: remove taronaeo for now Signed-off-by: Aaron Teo <[email protected]> * Revert "codeowners: remove taronaeo for now" This reverts commit 411ea4ed78d08778967bd0bd33a6538cfcbe082f. * ggml-zdnn: remove unused ggml_zdnn macro Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
* opencl: add reference `mul_mv_mxfp4_f32` * opencl: add reference `mul_mv_id` for mxfp4 * Q4_0 tranpose fix for Adreno --------- Co-authored-by: shawngu-quic <[email protected]>
* add F16/F16 fa support * fix kernel init * use mad instead of fma * use inline function * mark FA with sinks as unsupported for now * add pragma unroll to loops
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id * vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up - There's no explicit way to request f32 precision for mul_mat_id, but there probably should be, and this gets the code in place for that. - A couple fixes to check_results. - Remove casts to fp16 in coopmat1 FA shader (found by inspection).
* vulkan: fuse adds Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed. * check runtimeDescriptorArray feature * disable multi_add for Intel due to likely driver bug
- Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access.
…15355) * vulkan: Use larger workgroups for mul_mat_vec when M is small Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups. * update heuristic for amd/intel Co-authored-by: 0cc4m <[email protected]> --------- Co-authored-by: 0cc4m <[email protected]>
…/15379) * ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors * ggml-quants : avoid division by zero in make_q3_quants
ggml-ci
ggml-ci
danbev
approved these changes
Aug 18, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.