Skip to content

Feature Request: Allow SPLIT_MODE_TENSOR with KV cache quantization #21788

@JackBinary

Description

@JackBinary

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Running with -sm tensor and -ctk/-ctv set refuses at model init:

llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented

It appears that by replacing the 2D reshape before the attention-rotation matmul with a 4D reshape that keeps the head dimension visible, and adding a matching case to the meta backend's matmul split-axis inference, the refusal can be lifted without breaking anything I've been able to measure.

Motivation

I'm filing this as a feature request rather than a PR because the fix was AI-assisted and I'm not confident enough in the meta backend internals to claim I fully understand why it works. On my test rig (2× MI60 32 GiB running GLM-4.5 Air 106B-A12B at IQ4_XS), the model only fits at about 32k context without quantized KV, and about 64k context with, which forces me to choose between -sm tensor and -sm layer.

Considering that both attention-rotate and tensor-parallel were both recently added, I'd like for these two features to play nice.

Possible Implementation

My working branch is here.

It touches ggml-backend-meta.cpp, llama-graph.cpp, and llama-kv-cache.cpp (~25 lines total). I validated it two ways, each time with both -sm layer and -sm tensor at q8_0 on both k and v.

  1. I First ran a multi-step, long-form math reasoning problem: the classic two-trains meeting problem. Both modes converged on identical numeric answers through identical algebraic setups.
  2. I then ran a multi-needle retrieval diff at 8.8k and 14.5k tokens, which both came back byte-identical.

It appears to work quite well, but someone who actually understands the meta backend should look it over before any of it gets implemented.

ggml/src/ggml-backend-meta.cpp (inserted at line 553, before the existing AXIS_0/AXIS_0 branch)

// A mirrored, B split on a pass-through dim (dims >= 2 of B are carried to the output unchanged).
// This covers the PR #21038 rotation matmul where activations are reshaped to preserve the head
// axis and the head dim ends up as dim 2 or 3 of the 4D input.
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
    (src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_2 || src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_3)) {
    ggml_backend_meta_split_state ret = src_ss[1];
    ret.n_segments = 1;
    return ret;
}

src/llama-graph.cpp (replaces the previous ggml_reshape_2d at line 67)

GGML_ASSERT(cur->ne[0] % n == 0);
// Preserve the head dim through the matmul so SPLIT_MODE_TENSOR's split-axis
// inference can track a head-axis split. Collapsing heads and tokens together
// (reshape_2d) drops that information and trips the meta backend.
res = ggml_reshape_4d(ctx, cur, n, cur->ne[0]/n, cur->ne[1], cur->ne[2]*cur->ne[3]);
res = ggml_mul_mat  (ctx, rot, res);
res = ggml_reshape_4d(ctx, res, cur->ne[0], cur->ne[1], cur->ne[2], cur->ne[3]);

src/llama-kv-cache.cpp (same change, at line 68)

GGML_ASSERT(cur->ne[0] % n == 0);
// Preserve the head dim through the matmul so SPLIT_MODE_TENSOR's split-axis
// inference can track a head-axis split. Collapsing heads and tokens together
// (reshape_2d) drops that information and trips the meta backend.
res = ggml_reshape_4d(ctx, cur, n, cur->ne[0]/n, cur->ne[1], cur->ne[2]*cur->ne[3]);
res = ggml_mul_mat  (ctx, rot, res);
res = ggml_reshape_4d(ctx, res, cur->ne[0], cur->ne[1], cur->ne[2], cur->ne[3]);

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions