Feature Request: Allow `SPLIT_MODE_TENSOR` with KV cache quantization

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description


Running with `-sm tensor` and `-ctk`/`-ctv` set refuses at model init:
```
llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented
```
It appears that by replacing the 2D reshape before the attention-rotation matmul with a 4D reshape that keeps the head dimension visible, and adding a matching case to the meta backend's matmul split-axis inference, the refusal can be lifted without breaking anything I've been able to measure.

### Motivation

I'm filing this as a feature request rather than a PR because the fix was AI-assisted and I'm not confident enough in the meta backend internals to claim I fully understand why it works. On my test rig (2× MI60 32 GiB running GLM-4.5 Air 106B-A12B at IQ4_XS), the model only fits at about 32k context without quantized KV, and about 64k context with, which forces me to choose between `-sm tensor` and `-sm layer`.

Considering that both attention-rotate and tensor-parallel were both recently added, I'd like for these two features to play nice.

### Possible Implementation

My working branch is [here](https://github.com/ggml-org/llama.cpp/commit/c11c3627f65c45271dbd3c698544550a83c334d7).

It touches ggml-backend-meta.cpp, llama-graph.cpp, and llama-kv-cache.cpp (~25 lines total). I validated it two ways, each time with both `-sm layer` and `-sm tensor` at q8_0 on both k and v.

1. I First ran a multi-step, long-form math reasoning problem: the classic two-trains meeting problem. Both modes converged on identical numeric answers through identical algebraic setups.
2. I then ran a multi-needle retrieval diff at 8.8k and 14.5k tokens, which both came back byte-identical.

It appears to work quite well, but someone who actually understands the meta backend should look it over before any of it gets implemented.

### `ggml/src/ggml-backend-meta.cpp` (inserted at line 553, before the existing `AXIS_0`/`AXIS_0` branch)

```cpp
// A mirrored, B split on a pass-through dim (dims >= 2 of B are carried to the output unchanged).
// This covers the PR #21038 rotation matmul where activations are reshaped to preserve the head
// axis and the head dim ends up as dim 2 or 3 of the 4D input.
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
    (src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_2 || src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_3)) {
    ggml_backend_meta_split_state ret = src_ss[1];
    ret.n_segments = 1;
    return ret;
}
```

### `src/llama-graph.cpp` (replaces the previous `ggml_reshape_2d` at line 67)

```cpp
GGML_ASSERT(cur->ne[0] % n == 0);
// Preserve the head dim through the matmul so SPLIT_MODE_TENSOR's split-axis
// inference can track a head-axis split. Collapsing heads and tokens together
// (reshape_2d) drops that information and trips the meta backend.
res = ggml_reshape_4d(ctx, cur, n, cur->ne[0]/n, cur->ne[1], cur->ne[2]*cur->ne[3]);
res = ggml_mul_mat  (ctx, rot, res);
res = ggml_reshape_4d(ctx, res, cur->ne[0], cur->ne[1], cur->ne[2], cur->ne[3]);
```

### `src/llama-kv-cache.cpp` (same change, at line 68)

```cpp
GGML_ASSERT(cur->ne[0] % n == 0);
// Preserve the head dim through the matmul so SPLIT_MODE_TENSOR's split-axis
// inference can track a head-axis split. Collapsing heads and tokens together
// (reshape_2d) drops that information and trips the meta backend.
res = ggml_reshape_4d(ctx, cur, n, cur->ne[0]/n, cur->ne[1], cur->ne[2]*cur->ne[3]);
res = ggml_mul_mat  (ctx, rot, res);
res = ggml_reshape_4d(ctx, res, cur->ne[0], cur->ne[1], cur->ne[2], cur->ne[3]);
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Allow `SPLIT_MODE_TENSOR` with KV cache quantization #21788

Prerequisites

Feature Description

Motivation

Possible Implementation

`ggml/src/ggml-backend-meta.cpp` (inserted at line 553, before the existing `AXIS_0`/`AXIS_0` branch)

`src/llama-graph.cpp` (replaces the previous `ggml_reshape_2d` at line 67)

`src/llama-kv-cache.cpp` (same change, at line 68)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Allow SPLIT_MODE_TENSOR with KV cache quantization #21788

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

ggml/src/ggml-backend-meta.cpp (inserted at line 553, before the existing AXIS_0/AXIS_0 branch)

src/llama-graph.cpp (replaces the previous ggml_reshape_2d at line 67)

src/llama-kv-cache.cpp (same change, at line 68)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: Allow `SPLIT_MODE_TENSOR` with KV cache quantization #21788

`ggml/src/ggml-backend-meta.cpp` (inserted at line 553, before the existing `AXIS_0`/`AXIS_0` branch)

`src/llama-graph.cpp` (replaces the previous `ggml_reshape_2d` at line 67)

`src/llama-kv-cache.cpp` (same change, at line 68)