Prepare TensorStorage and Tensor for non-dense storage backends#2051
Open
Prepare TensorStorage and Tensor for non-dense storage backends#2051
Conversation
abd9f60 to
4c12e36
Compare
Add Downcast bound to TensorStorage trait (via downcast-rs) so downstream crates can downcast `dyn TensorStorage` to concrete types. New trait methods: dyn_hash, eq_storage. New Tensor methods: storage_as<T>, try_storage_as<T>, from_storage. Fix Hash, Drop, deep_clone, PartialEq to branch on dense vs non-dense instead of panicking on StorageKind::Other.
Adds the concrete storage type for block-quantized weights in linalg. Stores format, m, k dimensions, and one or more Arc<Blob> groups. Implements TensorStorage with proper hash, eq, clone, and debug support. Includes into_tensor() and to_block_quant_fact() convenience methods.
Preparatory methods for the BlobWithFact migration: - with_shape(): clone with updated m/k dimensions - split_m(): partition single-group into multi-group by splitting m rows
Replace the double-downcast pattern (Opaque → BlobWithFact → BlockQuantFact) with single-downcast via Tensor::try_storage_as::<BlockQuantStorage>() across all creation and access sites in 10 crates. Key changes: - All 14 creation sites now use BlockQuantStorage::new(...).into_tensor() - All ~20 access sites use tensor.try_storage_as::<BlockQuantStorage>() - SplitGroupBlockQuant output shape changes from [n] to [] (scalar) - Tensor::is_uniform() returns false for non-dense storage - pad_q40, as_q40_tensor, gather eval_bq updated signatures
No longer needed after migration to BlockQuantStorage.
BlobWithFact stored an explicit ND shape (e.g. [b, n, k]). The migration to BlockQuantStorage lost that structure by keeping only flat m and k, and into_tensor() produced a rank-0 tensor. get_concrete_shapes was then chaining the outer tensor shape ([]) with bqf.shape() ([b*n, k]), giving a 2D weight shape that confused the batch-matmul kernel for b > 1. - Replace separate m/k fields in BlockQuantStorage with shape: TVec<usize>; m() and k() are now derived from the shape. - Add new_with_shape() constructor for callers that need ND shapes. - into_tensor() uses self.shape() so the tensor carries its full logical shape. - get_concrete_shapes: drop the Q4_0 chain(bqf.shape()) and use b.shape() directly, since the shape now lives in the tensor. - pad_q40: preserve the ND shape when rebuilding the padded storage. - Test: use new_with_shape([b, n, k]) so batch structure survives the into_tensor / into_device round-trip.
…shape Same fix as the CUDA side: get_concrete_shapes and output_facts were building the weight shape by chaining the outer tensor shape with bqf.shape(). Now that BlockQuantStorage::into_tensor() gives the tensor its full logical shape, b.shape() is already correct and the chain produces duplicated dims. - kernels/matmul/mod.rs (eval + dispatch_eval): use b.shape() directly - ops/gemm.rs (output_facts + eval_with_session): same - tests: use new_with_shape([batch, n, k]) to preserve batch structure
…sors block_quant_aware_input_shape was chaining fact.shape + bqf.shape(), which was correct when the tensor was rank-0 (fact.shape=[]) but now produces double dims (e.g. [n,k,n,k]) since the tensor carries its shape. Use fact.shape directly, matching the non-opaque path. CudaTensor::to_host had a "scalar opaque" assertion assuming rank-0 tensors, and reconstructed storage with flat m/k from bqf. Remove the assertion and use new_with_shape with the full bqf shape so the round-trip preserves the tensor's logical shape.
Two issues introduced when BlockQuantStorage::into_tensor() started using the full logical shape (e.g. [m, k]) instead of a rank-0 tensor: 1. Const::change_axes now succeeded on opaque tensors (the rank-2 shape allowed AxisOp::Rm to proceed), causing ChangeAxes to strip the m-axis from Q4_0 weight consts. inject_m_or_n_axis then re-added it via an AddAxis node, but AddAxis::output_facts does not propagate konst, so optimized_mat_mul saw a non-const A input and panicked on the PackedFormat downcast. Fix: block axis changes on opaque Const tensors. 2. OptSimpleMatMulPack::output_facts copied inputs[0].shape (now [m, k]) as its output shape, but eval() still emits a rank-0 tensor. The mismatch caused PropConst to fail with a "trying to substitute" error. Fix: always emit a scalar (rank-0) output fact, consistent with eval.
Three stale assumptions based on block-quant consts being rank-0 scalars: 1. rewrite_block_quant_const_to_scalar: the rule was previously a no-op for block-quant consts (rank-0 hit the len()==0 early-return). Now that consts have shape [m, k] the rule fired and hit the assert that volume()==1. Fix: add !volume().is_one() to the early-return guard so properly-shaped consts are left alone; the NNEF serializer already writes the correct shape via bqs.to_block_quant_fact().shape(). 2. matmul deserializer: the two ensure!(shape.volume().is_one()) guards were checking for the old rank-0 representation. With the new code the loaded tensor has shape [m, k] so volume > 1. The guards are simply wrong now; remove them.
Previously only the `required_rank > actual_rank` branch propagated the BlockQuantFact shape and the `konst` field through axis operations. With rank-0 BQ tensors every AxisOp fell into that branch. Now that BlockQuantStorage carries a full logical shape ([m, k]) the tensor rank equals the required rank, so the normal path is used instead – and it neither updated the BQF shape nor forwarded `konst`. Fix the normal path to: 1. Update the BlockQuantFact shape in sync with the tensor shape. 2. Propagate `konst` by applying the same axis transformation to the const tensor (falls back gracefully if the transform fails, e.g. for symbolic dims). Also include the previously-derived fix for EinSum::axes_mapping that computes logical_rank - actual_rank correctly for the mixed BlockQuantFact / Packed* opaque-fact world.
SplitGroupBlockQuant::output_facts had a hardcoded rank==0 check and produced a rank-0 output fact — both relics of the old rank-0 BQ representation. Remove the check and build the output fact with the correct rank-2 shape. BlockQuantIntoShape::output_facts was cloning the input fact (including its old tensor shape) and only replacing the opaque metadata, so the tensor shape in the output fact was stale. Build a fresh fact using self.shape instead. Conv::wire_kernel_as_g_o_ihw was also guarding with `fact.rank() == 0` when validating the kernel format for BQ inputs. Drop that check so rank-2 BQ kernels are accepted.
BlockQuantStorage goes back to plain `m: usize, k: usize` instead of `shape: TVec<usize>`. The full logical kernel shape (e.g. [OC, IC, kH]) is carried by the Tensor and by BlockQuantFact; the storage only needs the flat matmul dimensions. The NNEF reader now stores the full header shape on the Tensor while keeping the BQS flat, and the variable deserializer builds the BlockQuantFact from the NNEF-declared shape so that conv deserialization sees the correct kernel dimensions. Conv wiring fixes: - OptSimpleMatMulPack gains a `num_groups` field so output_facts returns rank-0 (scalar) for single-group BQ and rank-1 [G] for multi-group, matching what eval produces. - wire_mm_weights_bias skips the group axis mapping when the packed kernel is a rank-0 scalar, and forces trivial_packing=true in that case to avoid navigating a rank-0 tensor via offset_axis. - declutter_as_einsum bails out early for opaque (BQ) kernels whose wire_kernel_as_g_o_ihw output does not carry a leading group axis. - consistent() relaxes the BQF/BQS cross-check to compare m(), k(), and format only, since the fact may carry a full logical shape while the storage is always flat. All conv-q40 NNEF test cases and the Llama-3.2-1B q40ef16 LLM test pass.
…ghts BlockQuantStorage::into_tensor() now produces [m, k] shaped tensors. GgmlGemm::output_shape expects the weights tensor to carry its full [b, n, k] shape to compute the correct output dimensions, so use Tensor::from_storage directly with the 3D shape in the test helper.
…eights Same issue as the cuda fix: BlockQuantStorage::into_tensor() now produces [m, k] shaped tensors, but GemmImpl::output_shape expects weights to carry their full [b, n, k] shape. Use Tensor::from_storage with the 3D shape.
4c12e36 to
cbe98ce
Compare
pad_q40_weights was calling padded_bqs.into_tensor() which produces a flat [m, k+pad] tensor, but GgmlGemm expects the weight tensor to carry its full logical shape (e.g. [b, n, k+pad]). Reconstruct the padded tensor from the original shape with only the last dimension updated. Fixes cuda-lovelace openelm-270M q40ef16 CI failure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Downcast bound to TensorStorage trait (via downcast-rs) so downstream crates can downcast
dyn TensorStorageto concrete types.New trait methods: dyn_hash, eq_storage.
New Tensor methods: storage_as, try_storage_as, from_storage. Fix Hash, Drop, deep_clone, PartialEq to branch on dense vs non-dense instead of panicking on StorageKind::Other.