Add Int8Tensor for clearer interface #3038

namgyu-youn · 2025-09-21T11:06:20Z

Summary:
Introduce new tensor subclass API for int8 quantization with clearer interface.

The main change can be summarized to the following:

Old: Complex affine transform (AffineQuantizedTensor) with separate layout handling
New: Direct int8 tensor with scaling factor and zero point

Test plan:
test/quantization/quantize_/workflows/int8/test_int8_tensor.py

Introduce new tensor subclass API for int8 quantization with clearer interface. The main change can be summarized to the following: - Old: Complex affine transform (AffineQuantizedTensor) with separate layout handling - New: Direct int8 tensor with qdata, scale, and zero_point attributes Test plan: test/quantization/quantize_/workflows/int8/test_int8_tensor.py Future plan: Implement block-wise quantization using `block_size` parameter

pytorch-bot · 2025-09-21T11:06:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3038

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 · 2025-09-22T17:41:34Z

can you add a version 2 and expose this tensor through

ao/torchao/quantization/quant_api.py

Line 1497 in 8525185

class Int8DynamicActivationInt8WeightConfig(AOBaseConfig):

? similar to

ao/torchao/quantization/quant_api.py

Line 1752 in 8525185

class Float8DynamicActivationFloat8WeightConfig(AOBaseConfig):

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 · 2025-09-23T17:44:46Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        result = result.to(scale.dtype) * scale
+        result = result.view(*input_tensor.shape[:-1], -1)
+    else:
+        # FP × INT8 (static)


also this is the code for weight only quant I think:

ao/torchao/dtypes/uintx/plain_layout.py

Line 250 in 122b307

def _linear_fp_act_int8_weight_impl(input_tensor, weight_tensor, bias):

Done at 9383550 , thanks for pointing it out.

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

jerryzh168 · 2025-09-25T21:35:24Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+            raise ValueError("Expected 2D tensor and block_size length 2")
+
+        # Rounding function from high precision dtype
+        scale = w.abs().max(dim=-1, keepdim=True)[0] / 127.0


looks like block_size is not used? why is that?

you can checkout

ao/torchao/dtypes/uintx/plain_layout.py

Line 232 in 8c5c33e

def _linear_fp_act_int8_weight_check(input_tensor, weight_tensor, bias):

for expected granularity

also this should be using these quant primitive ops:

ao/torchao/quantization/quantize_/workflows/int4/int4_marlin_sparse_tensor.py

Lines 79 to 97 in 8c5c33e

scale, zero_point = choose_qparams_affine(

input=preprocessed_w,

mapping_type=MappingType.SYMMETRIC,

block_size=block_size,

target_dtype=target_dtype,

quant_min=quant_min,

quant_max=quant_max,

eps=1e-6,

)

wq = quantize_affine(

input=preprocessed_w,

block_size=block_size,

scale=scale,

zero_point=zero_point,

output_dtype=target_dtype,

quant_min=quant_min,

quant_max=quant_max,

)

, arguments can be found by tracing through the code path for int8 in

ao/torchao/quantization/quant_api.py

Line 1566 in 8c5c33e

new_weight = to_affine_quantized_intx(

and

ao/torchao/dtypes/affine_quantized_tensor.py

Line 325 in 8c5c33e

scale, zero_point = choose_qparams_affine(

this might require a bit too much context, let me know if you would like us to take over

Thanks, surely want to take over! Drafted this PR for those updates, but will look into it today (6 hours later)

btw, version 2 is updated at c53dad0 (version 1 is default)

jerryzh168

please rebase, and let me know when this is ready for review again @namgyu-youn

jerryzh168 · 2025-10-06T17:40:44Z

torchao/quantization/quant_api.py

+        )
+    else:
+        assert config.version == 2, f"Unexpected version: {config.version}"
+        block_size = [weight.shape[0], weight.shape[1]]


this should be the same as L1393 I think, you can extract L1390-L1393 out of the first if branch and use that I think

Isn't dividing logics much safer and easier to deprecate old API in the future? Other APIs like _float8_weight_only_quant_tensor also have been used with this convention without a common branch.

it's fine to duplicate I think, but the current code for block_size doesn't support 3d though

Okay then I will keep this branch and update the assert for 3D check.

Oh actually we already doing 3D-check at from_hp(), by using

if w.dim() != 2 or len(block_size) != 2: raise ValueError("Expected 2D tensor and block_size length 2")

jerryzh168 · 2025-10-06T17:41:43Z

torchao/quantization/quant_api.py

+    else:
+        quantized_weight = Int8Tensor.from_hp(
+            weight,
+            block_size=get_weight_block_size(weight),


nit: can calculate block_size outside of the if/else

jerryzh168 · 2025-10-06T17:42:11Z

torchao/quantization/quantize_/common/quantize_tensor_kwargs.py

+    elif isinstance(quant_kwargs, QuantizeTensorToInt8Kwargs):
+        return Int8Tensor.from_hp(
+            tensor,
+            quant_kwargs.block_size or [1, tensor.shape[-1]],


nit: why not make block_size mandatory?

this one is still not resolved yet

jerryzh168 · 2025-10-06T17:42:27Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        block_size (Optional[list[int]]): block size for quantization granularity
+    """
+
+    block_size: Optional[list[int]] = None


why is this optional?

It was wrong type hint because api can't work without granularity, mandatory (not-optional) should be right.

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 · 2025-10-06T17:44:25Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        # Reshape 1D scale to [N, 1] for broadcasting with [N, K] qdata
+        if scale.ndim == 1:
+            scale = scale.unsqueeze(1)


is this needed?

jerryzh168 · 2025-10-06T19:59:26Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+    )
+
+
+@implements(aten.transpose.int)


we don't need this yet I think, we can remove for now and add later when needed

Could you tell me why there is no need to support transposition for quantized tensors? I thought it was just a type of tensor. If we remove this, how can users transpose it like Tensor.transpose() ?

just haven't seen people using it yet, I think we should implement as little as possible to keep maintainence burden low

jerryzh168 · 2025-10-06T20:00:14Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+    if dim == 0 and tensor.scale.ndim >= 1:
+        sliced_scale = aten.slice.Tensor(tensor.scale, 0, start, end, step)
+
+    sliced_shape = list(


why not get the shape from sliced tensor directly?

can you check

ao/torchao/quantization/quantize_/workflows/float8/float8_tensor.py

Line 419 in c96f2dd

@implements(aten.slice.Tensor)

? I'm not sure if the current implementation is enough to cover all cases actually

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

jerryzh168 · 2025-10-14T17:38:20Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+    return args[0].dequantize()
+
+
+@implements([torch.nn.functional.linear, aten.linear.default])


nit: implements is refactored now: https://github.com/pytorch/ao/pull/2866/files

torchao/quantization/quant_api.py

jerryzh168 · 2025-10-16T20:09:37Z

torchao/utils.py


-    if not isinstance(aten_ops_or_torch_fns, (list, tuple)):
-        aten_ops_or_torch_fns = [aten_ops_or_torch_fns]
+def _implements_torch_function(cls, torch_fns):


why these changes? is there some issue with rebase?

There was no merge conflict, so I overwrote this file, but I regret this commit; let me know if 0a45f90 should be reverted.

jerryzh168 · 2025-10-16T20:10:52Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        if not isinstance(activation_tensor, Int8Tensor):
+            if weight_tensor.act_quant_kwargs.static_scale is not None:
+                # INT8 × INT8 (static): symmetric quantization only
+                static_scale = weight_tensor.act_quant_kwargs.static_scale


OK if this is needed I think it should be included in _choose_quant_func_and_quantize_tensor as well?

jerryzh168 · 2025-10-16T20:11:24Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+implements_torch_function = Int8Tensor.implements_torch_function
+
+
+@implements([aten.dequantize.self])


is this needed? if not we should remove for now

jerryzh168 · 2025-10-16T20:11:46Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        if scale.numel() > 1 and scale.shape != qdata_fp.shape:
+            scale = scale.view(*scale.shape, *[1] * (qdata_fp.ndim - scale.ndim))


is this needed?

It is needed for block-level granularity. For example,

Row-wise: If scale shape is (64, 1) and w_q (quantized weight shape) is (256, 512), we can naturally broadcast them

Channel-wise: If scale shape is (512,) and w_q is (256, 512), we can naturally broadcast them

Block-size granularity: If scale shape is (32, 64) and w_q is (256, 512), we have to rescale to broadcast them.

But we can also reuse _maybe_expand_scale_to_tensor_shape, similar to:

ao/torchao/quantization/quantize_/workflows/float8/float8_tensor.py

Lines 149 to 154 in 4b79f9e

def dequantize(self, output_dtype: Optional[torch.dtype] = None) -> torch.Tensor:

if output_dtype is None:

output_dtype = self.dtype

qdata, scale = self.qdata, self.scale

return _dequantize_affine_float8(qdata, scale, output_dtype)

jerryzh168 · 2025-10-16T20:12:47Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+        cls: type,
+        qdata: torch.Tensor,
+        scale: torch.Tensor,
+        block_size: list[int],


nit: I remember list has a higher python version requirements, so probably better to change this to List from typing I think

Thanks, it is only for List, not for Dict, Tuple, etc.?

jerryzh168 · 2025-10-16T20:13:20Z

torchao/quantization/quant_api.py

    return module


+def _unwrap_float8_linear(module: Float8Linear) -> nn.Linear:


some rebase issue?

namgyu-youn · 2025-10-17T07:22:37Z

Updated log:

Made block_size mandatory (resolves Add Int8Tensor for clearer interface #3038 (comment), Add Int8Tensor for clearer interface #3038 (comment))
Fix static quantization flows (resolves: Add Int8Tensor for clearer interface #3038 (comment))
Update block_size 3D-check logic (related to: Add Int8Tensor for clearer interface #3038 (comment))

To reviewers:
Unfortunately, I can't build and run local tests, caused by #2919, after trying downgrade and gradual installation. Please feel free to direct commit if test_int8_tensor.py fails.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 21, 2025

jerryzh168 reviewed Sep 22, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Sep 22, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

namgyu-youn added 2 commits September 23, 2025 02:45

rename for clearly: Int8PlainInt8Tensor -> Int8Tensor

db23cf3

add flags for static/dynamic quant

b861dbc

namgyu-youn changed the title ~~Add Int8PlainInt8Tensor for clearer interface~~ Add Int8Tensor for clearer interface Sep 23, 2025

namgyu-youn requested a review from jerryzh168 September 23, 2025 15:12

jerryzh168 reviewed Sep 23, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Sep 23, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Sep 23, 2025

View reviewed changes

test/quantization/quantize_/workflows/int8/test_int8_tensor.py Show resolved Hide resolved

namgyu-youn added 4 commits September 25, 2025 01:33

update static/dynamic quantization workflows

9383550

add kernel preference unit test

2c84ba4

add kernel preference unit test

8ddddd3

Merge remote-tracking branch 'upstream/main' into int8-quant

bd6f58a

namgyu-youn requested a review from jerryzh168 September 24, 2025 17:26

fix missing attribute

b5cb3c8

jerryzh168 mentioned this pull request Sep 25, 2025

[WIP]Adds _weight_int8pack_mm pass for woq-int8 #3061

Open

jerryzh168 reviewed Sep 25, 2025

View reviewed changes

remove kernel preference args

9a51cae

namgyu-youn marked this pull request as draft September 28, 2025 13:23

namgyu-youn added 2 commits September 28, 2025 23:48

link new API with old API using version 2

c53dad0

add granularity, block size support

d300b02

namgyu-youn marked this pull request as ready for review September 30, 2025 06:09

namgyu-youn requested a review from jerryzh168 September 30, 2025 06:09

namgyu-youn mentioned this pull request Oct 3, 2025

make smoothquant more PT2 friendly #1639

Open

jerryzh168 reviewed Oct 4, 2025

View reviewed changes

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Show resolved Hide resolved

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Show resolved Hide resolved

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

namgyu-youn added 2 commits October 7, 2025 19:22

update int8 quantization API

df79aa8

Merge remote-tracking branch 'upstream/main' into int8-quant

910906b

namgyu-youn requested a review from jerryzh168 October 7, 2025 10:25

add static quantization support

c61b36e

jerryzh168 reviewed Oct 14, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Show resolved Hide resolved

jerryzh168 reviewed Oct 14, 2025

View reviewed changes

torchao/quantization/quant_api.py Show resolved Hide resolved

jerryzh168 reviewed Oct 14, 2025

View reviewed changes

torchao/quantization/quant_api.py Show resolved Hide resolved

namgyu-youn added 2 commits October 16, 2025 16:46

sync with main branch

0a45f90

split dispatch decorator

1251187

jerryzh168 reviewed Oct 16, 2025

View reviewed changes

namgyu-youn added 2 commits October 17, 2025 15:57

update int8-quant api

844d99d

update type-hint to prevent depenedency issue

a844678

namgyu-youn requested a review from jerryzh168 October 17, 2025 07:32

	scale, zero_point = choose_qparams_affine(
	input=preprocessed_w,
	mapping_type=MappingType.SYMMETRIC,
	block_size=block_size,
	target_dtype=target_dtype,
	quant_min=quant_min,
	quant_max=quant_max,
	eps=1e-6,
	)

	wq = quantize_affine(
	input=preprocessed_w,
	block_size=block_size,
	scale=scale,
	zero_point=zero_point,
	output_dtype=target_dtype,
	quant_min=quant_min,
	quant_max=quant_max,
	)

		return args[0].dequantize()


		@implements([torch.nn.functional.linear, aten.linear.default])

		implements_torch_function = Int8Tensor.implements_torch_function


		@implements([aten.dequantize.self])

		if scale.numel() > 1 and scale.shape != qdata_fp.shape:
		scale = scale.view(scale.shape, [1] * (qdata_fp.ndim - scale.ndim))

	def dequantize(self, output_dtype: Optional[torch.dtype] = None) -> torch.Tensor:
	if output_dtype is None:
	output_dtype = self.dtype

	qdata, scale = self.qdata, self.scale
	return _dequantize_affine_float8(qdata, scale, output_dtype)

		return module


		def _unwrap_float8_linear(module: Float8Linear) -> nn.Linear:

		)


		@implements(aten.transpose.int)

Add Int8Tensor for clearer interface #3038

Are you sure you want to change the base?

Add Int8Tensor for clearer interface #3038

Uh oh!

Conversation

namgyu-youn commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3038

Uh oh!

Uh oh!

jerryzh168 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Sep 21, 2025 •

edited

Loading

pytorch-bot bot commented Sep 21, 2025 •

edited

Loading

namgyu-youn Sep 24, 2025 •

edited

Loading

namgyu-youn Sep 29, 2025 •

edited

Loading

namgyu-youn Oct 7, 2025 •

edited

Loading

namgyu-youn Oct 17, 2025 •

edited

Loading

namgyu-youn Oct 7, 2025 •

edited

Loading

jerryzh168 Oct 14, 2025 •

edited

Loading

jerryzh168 Oct 16, 2025 •

edited

Loading

namgyu-youn Oct 17, 2025 •

edited

Loading

namgyu-youn Oct 17, 2025 •

edited

Loading

namgyu-youn commented Oct 17, 2025 •

edited

Loading