Skip to content

Prepare TensorStorage and Tensor for non-dense storage backends#2051

Open
kali wants to merge 18 commits intomainfrom
block-quant-storage-prep
Open

Prepare TensorStorage and Tensor for non-dense storage backends#2051
kali wants to merge 18 commits intomainfrom
block-quant-storage-prep

Conversation

@kali
Copy link
Collaborator

@kali kali commented Mar 20, 2026

Add Downcast bound to TensorStorage trait (via downcast-rs) so downstream crates can downcast dyn TensorStorage to concrete types.

New trait methods: dyn_hash, eq_storage.
New Tensor methods: storage_as, try_storage_as, from_storage. Fix Hash, Drop, deep_clone, PartialEq to branch on dense vs non-dense instead of panicking on StorageKind::Other.

@kali kali force-pushed the block-quant-storage-prep branch 2 times, most recently from abd9f60 to 4c12e36 Compare March 21, 2026 12:08
kali added 16 commits March 21, 2026 14:46
Add Downcast bound to TensorStorage trait (via downcast-rs) so
downstream crates can downcast `dyn TensorStorage` to concrete types.

New trait methods: dyn_hash, eq_storage.
New Tensor methods: storage_as<T>, try_storage_as<T>, from_storage.
Fix Hash, Drop, deep_clone, PartialEq to branch on dense vs non-dense
instead of panicking on StorageKind::Other.
Adds the concrete storage type for block-quantized weights in linalg.
Stores format, m, k dimensions, and one or more Arc<Blob> groups.
Implements TensorStorage with proper hash, eq, clone, and debug support.
Includes into_tensor() and to_block_quant_fact() convenience methods.
Preparatory methods for the BlobWithFact migration:
- with_shape(): clone with updated m/k dimensions
- split_m(): partition single-group into multi-group by splitting m rows
Replace the double-downcast pattern (Opaque → BlobWithFact → BlockQuantFact)
with single-downcast via Tensor::try_storage_as::<BlockQuantStorage>()
across all creation and access sites in 10 crates.

Key changes:
- All 14 creation sites now use BlockQuantStorage::new(...).into_tensor()
- All ~20 access sites use tensor.try_storage_as::<BlockQuantStorage>()
- SplitGroupBlockQuant output shape changes from [n] to [] (scalar)
- Tensor::is_uniform() returns false for non-dense storage
- pad_q40, as_q40_tensor, gather eval_bq updated signatures
No longer needed after migration to BlockQuantStorage.
BlobWithFact stored an explicit ND shape (e.g. [b, n, k]).  The migration
to BlockQuantStorage lost that structure by keeping only flat m and k, and
into_tensor() produced a rank-0 tensor.  get_concrete_shapes was then
chaining the outer tensor shape ([]) with bqf.shape() ([b*n, k]), giving
a 2D weight shape that confused the batch-matmul kernel for b > 1.

- Replace separate m/k fields in BlockQuantStorage with shape: TVec<usize>;
  m() and k() are now derived from the shape.
- Add new_with_shape() constructor for callers that need ND shapes.
- into_tensor() uses self.shape() so the tensor carries its full logical shape.
- get_concrete_shapes: drop the Q4_0 chain(bqf.shape()) and use b.shape()
  directly, since the shape now lives in the tensor.
- pad_q40: preserve the ND shape when rebuilding the padded storage.
- Test: use new_with_shape([b, n, k]) so batch structure survives the
  into_tensor / into_device round-trip.
…shape

Same fix as the CUDA side: get_concrete_shapes and output_facts were
building the weight shape by chaining the outer tensor shape with
bqf.shape(). Now that BlockQuantStorage::into_tensor() gives the tensor
its full logical shape, b.shape() is already correct and the chain
produces duplicated dims.

- kernels/matmul/mod.rs (eval + dispatch_eval): use b.shape() directly
- ops/gemm.rs (output_facts + eval_with_session): same
- tests: use new_with_shape([batch, n, k]) to preserve batch structure
…sors

block_quant_aware_input_shape was chaining fact.shape + bqf.shape(), which
was correct when the tensor was rank-0 (fact.shape=[]) but now produces
double dims (e.g. [n,k,n,k]) since the tensor carries its shape.
Use fact.shape directly, matching the non-opaque path.

CudaTensor::to_host had a "scalar opaque" assertion assuming rank-0 tensors,
and reconstructed storage with flat m/k from bqf. Remove the assertion and
use new_with_shape with the full bqf shape so the round-trip preserves the
tensor's logical shape.
Two issues introduced when BlockQuantStorage::into_tensor() started using
the full logical shape (e.g. [m, k]) instead of a rank-0 tensor:

1. Const::change_axes now succeeded on opaque tensors (the rank-2 shape
   allowed AxisOp::Rm to proceed), causing ChangeAxes to strip the m-axis
   from Q4_0 weight consts.  inject_m_or_n_axis then re-added it via an
   AddAxis node, but AddAxis::output_facts does not propagate konst, so
   optimized_mat_mul saw a non-const A input and panicked on the
   PackedFormat downcast.
   Fix: block axis changes on opaque Const tensors.

2. OptSimpleMatMulPack::output_facts copied inputs[0].shape (now [m, k])
   as its output shape, but eval() still emits a rank-0 tensor.  The
   mismatch caused PropConst to fail with a "trying to substitute" error.
   Fix: always emit a scalar (rank-0) output fact, consistent with eval.
Three stale assumptions based on block-quant consts being rank-0 scalars:

1. rewrite_block_quant_const_to_scalar: the rule was previously a no-op
   for block-quant consts (rank-0 hit the len()==0 early-return).  Now
   that consts have shape [m, k] the rule fired and hit the assert that
   volume()==1.  Fix: add !volume().is_one() to the early-return guard so
   properly-shaped consts are left alone; the NNEF serializer already
   writes the correct shape via bqs.to_block_quant_fact().shape().

2. matmul deserializer: the two ensure!(shape.volume().is_one()) guards
   were checking for the old rank-0 representation.  With the new code the
   loaded tensor has shape [m, k] so volume > 1.  The guards are simply
   wrong now; remove them.
Previously only the `required_rank > actual_rank` branch propagated the
BlockQuantFact shape and the `konst` field through axis operations.  With
rank-0 BQ tensors every AxisOp fell into that branch.  Now that
BlockQuantStorage carries a full logical shape ([m, k]) the tensor rank
equals the required rank, so the normal path is used instead – and it
neither updated the BQF shape nor forwarded `konst`.

Fix the normal path to:
1. Update the BlockQuantFact shape in sync with the tensor shape.
2. Propagate `konst` by applying the same axis transformation to the
   const tensor (falls back gracefully if the transform fails, e.g. for
   symbolic dims).

Also include the previously-derived fix for EinSum::axes_mapping that
computes logical_rank - actual_rank correctly for the mixed
BlockQuantFact / Packed* opaque-fact world.
SplitGroupBlockQuant::output_facts had a hardcoded rank==0 check and
produced a rank-0 output fact — both relics of the old rank-0 BQ
representation. Remove the check and build the output fact with the
correct rank-2 shape.

BlockQuantIntoShape::output_facts was cloning the input fact (including
its old tensor shape) and only replacing the opaque metadata, so the
tensor shape in the output fact was stale. Build a fresh fact using
self.shape instead.

Conv::wire_kernel_as_g_o_ihw was also guarding with `fact.rank() == 0`
when validating the kernel format for BQ inputs. Drop that check so
rank-2 BQ kernels are accepted.
BlockQuantStorage goes back to plain `m: usize, k: usize` instead of
`shape: TVec<usize>`.  The full logical kernel shape (e.g. [OC, IC, kH])
is carried by the Tensor and by BlockQuantFact; the storage only needs
the flat matmul dimensions.

The NNEF reader now stores the full header shape on the Tensor while
keeping the BQS flat, and the variable deserializer builds the
BlockQuantFact from the NNEF-declared shape so that conv deserialization
sees the correct kernel dimensions.

Conv wiring fixes:
- OptSimpleMatMulPack gains a `num_groups` field so output_facts returns
  rank-0 (scalar) for single-group BQ and rank-1 [G] for multi-group,
  matching what eval produces.
- wire_mm_weights_bias skips the group axis mapping when the packed
  kernel is a rank-0 scalar, and forces trivial_packing=true in that
  case to avoid navigating a rank-0 tensor via offset_axis.
- declutter_as_einsum bails out early for opaque (BQ) kernels whose
  wire_kernel_as_g_o_ihw output does not carry a leading group axis.
- consistent() relaxes the BQF/BQS cross-check to compare m(), k(), and
  format only, since the fact may carry a full logical shape while the
  storage is always flat.

All conv-q40 NNEF test cases and the Llama-3.2-1B q40ef16 LLM test pass.
…ghts

BlockQuantStorage::into_tensor() now produces [m, k] shaped tensors.
GgmlGemm::output_shape expects the weights tensor to carry its full
[b, n, k] shape to compute the correct output dimensions, so use
Tensor::from_storage directly with the 3D shape in the test helper.
…eights

Same issue as the cuda fix: BlockQuantStorage::into_tensor() now produces
[m, k] shaped tensors, but GemmImpl::output_shape expects weights to carry
their full [b, n, k] shape. Use Tensor::from_storage with the 3D shape.
@kali kali force-pushed the block-quant-storage-prep branch from 4c12e36 to cbe98ce Compare March 21, 2026 13:46
kali added 2 commits March 21, 2026 15:12
pad_q40_weights was calling padded_bqs.into_tensor() which produces a
flat [m, k+pad] tensor, but GgmlGemm expects the weight tensor to carry
its full logical shape (e.g. [b, n, k+pad]). Reconstruct the padded
tensor from the original shape with only the last dimension updated.

Fixes cuda-lovelace openelm-270M q40ef16 CI failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant