Prerequisites
Feature Description
Running with -sm tensor and -ctk/-ctv set refuses at model init:
llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented
It appears that by replacing the 2D reshape before the attention-rotation matmul with a 4D reshape that keeps the head dimension visible, and adding a matching case to the meta backend's matmul split-axis inference, the refusal can be lifted without breaking anything I've been able to measure.
Motivation
I'm filing this as a feature request rather than a PR because the fix was AI-assisted and I'm not confident enough in the meta backend internals to claim I fully understand why it works. On my test rig (2× MI60 32 GiB running GLM-4.5 Air 106B-A12B at IQ4_XS), the model only fits at about 32k context without quantized KV, and about 64k context with, which forces me to choose between -sm tensor and -sm layer.
Considering that both attention-rotate and tensor-parallel were both recently added, I'd like for these two features to play nice.
Possible Implementation
My working branch is here.
It touches ggml-backend-meta.cpp, llama-graph.cpp, and llama-kv-cache.cpp (~25 lines total). I validated it two ways, each time with both -sm layer and -sm tensor at q8_0 on both k and v.
- I First ran a multi-step, long-form math reasoning problem: the classic two-trains meeting problem. Both modes converged on identical numeric answers through identical algebraic setups.
- I then ran a multi-needle retrieval diff at 8.8k and 14.5k tokens, which both came back byte-identical.
It appears to work quite well, but someone who actually understands the meta backend should look it over before any of it gets implemented.
ggml/src/ggml-backend-meta.cpp (inserted at line 553, before the existing AXIS_0/AXIS_0 branch)
// A mirrored, B split on a pass-through dim (dims >= 2 of B are carried to the output unchanged).
// This covers the PR #21038 rotation matmul where activations are reshaped to preserve the head
// axis and the head dim ends up as dim 2 or 3 of the 4D input.
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
(src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_2 || src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_3)) {
ggml_backend_meta_split_state ret = src_ss[1];
ret.n_segments = 1;
return ret;
}
src/llama-graph.cpp (replaces the previous ggml_reshape_2d at line 67)
GGML_ASSERT(cur->ne[0] % n == 0);
// Preserve the head dim through the matmul so SPLIT_MODE_TENSOR's split-axis
// inference can track a head-axis split. Collapsing heads and tokens together
// (reshape_2d) drops that information and trips the meta backend.
res = ggml_reshape_4d(ctx, cur, n, cur->ne[0]/n, cur->ne[1], cur->ne[2]*cur->ne[3]);
res = ggml_mul_mat (ctx, rot, res);
res = ggml_reshape_4d(ctx, res, cur->ne[0], cur->ne[1], cur->ne[2], cur->ne[3]);
src/llama-kv-cache.cpp (same change, at line 68)
GGML_ASSERT(cur->ne[0] % n == 0);
// Preserve the head dim through the matmul so SPLIT_MODE_TENSOR's split-axis
// inference can track a head-axis split. Collapsing heads and tokens together
// (reshape_2d) drops that information and trips the meta backend.
res = ggml_reshape_4d(ctx, cur, n, cur->ne[0]/n, cur->ne[1], cur->ne[2]*cur->ne[3]);
res = ggml_mul_mat (ctx, rot, res);
res = ggml_reshape_4d(ctx, res, cur->ne[0], cur->ne[1], cur->ne[2], cur->ne[3]);
Prerequisites
Feature Description
Running with
-sm tensorand-ctk/-ctvset refuses at model init:It appears that by replacing the 2D reshape before the attention-rotation matmul with a 4D reshape that keeps the head dimension visible, and adding a matching case to the meta backend's matmul split-axis inference, the refusal can be lifted without breaking anything I've been able to measure.
Motivation
I'm filing this as a feature request rather than a PR because the fix was AI-assisted and I'm not confident enough in the meta backend internals to claim I fully understand why it works. On my test rig (2× MI60 32 GiB running GLM-4.5 Air 106B-A12B at IQ4_XS), the model only fits at about 32k context without quantized KV, and about 64k context with, which forces me to choose between
-sm tensorand-sm layer.Considering that both attention-rotate and tensor-parallel were both recently added, I'd like for these two features to play nice.
Possible Implementation
My working branch is here.
It touches ggml-backend-meta.cpp, llama-graph.cpp, and llama-kv-cache.cpp (~25 lines total). I validated it two ways, each time with both
-sm layerand-sm tensorat q8_0 on both k and v.It appears to work quite well, but someone who actually understands the meta backend should look it over before any of it gets implemented.
ggml/src/ggml-backend-meta.cpp(inserted at line 553, before the existingAXIS_0/AXIS_0branch)src/llama-graph.cpp(replaces the previousggml_reshape_2dat line 67)src/llama-kv-cache.cpp(same change, at line 68)