[Sharktank] sliding_window out of ops.attention #2293

oyazdanb · 2025-09-19T20:53:43Z

taking sliding window out of ops.attention

github-actions · 2025-09-22T21:58:15Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
sharktank/sharktank/layers
paged_attention.py
sharktank/sharktank/ops
attention_impls.py
sharded_impls.py					1001, 1003
sharktank/tests/layers
paged_llama_attention_block_test.py
sharktank/tests/ops
test_attention_ops.py					309
Project Total

_{This report was generated by python-coverage-comment-action}

rsuderman · 2025-09-22T22:50:23Z

sharktank/sharktank/layers/paged_attention.py

+                mask = torch.triu(mask, diagonal=1)
+            return mask.to(device)
+
+        is_prefill = kv_size == n_tokens


Generating the mask should barely matter for prefill vs decode. The only thing that makes a difference in the offset.

rsuderman · 2025-09-22T22:51:07Z

sharktank/sharktank/layers/paged_attention.py


+    def build_mask(
+        self,
+        mask: Optional[torch.Tensor],


Your behavior for this mask is weird. Sometimes you return it, other times you ignore it. Rethink what is needed for arguments.

rsuderman · 2025-09-22T22:54:26Z

sharktank/sharktank/ops/attention_impls.py

+    sink: torch.Tensor, bs: int, n_heads: int, n_tokens: int
+) -> torch.Tensor:
+    """Prepare sink tensor for attention: [sink_size, n_heads] -> [bs, n_heads, n_tokens, sink_size]"""
+    if sink.dim() == 1:


Just make sure the input is dim == 2 and if its the same per head materialize an unary dimension

rsuderman · 2025-09-22T22:57:07Z

sharktank/tests/ops/test_attention_ops.py

+    n_tokens: int,
+    dtype: torch.dtype,
+    device: torch.device,
+):


This appears just to be a copy of the above component - replicated behavior doesn't really help for testing.

Adding moe for gpt-oss. Also cleaned moe implementions and add numeric testing for pregather.

sogartar · 2025-10-02T14:19:07Z

sharktank/sharktank/layers/paged_attention.py

                    v_planes, quantizer=cache_quantizer, dtype=self.attn_dtype
                )

+        effective_mask = self.build_mask(


There is already mask construction here.
It is a parent function down the call stack.
I am curious what is the reason to split it like that.

Considering that the other method is called paged_attention it probably does make sense to move it out of there and put all mask construction here as the construction is not related to the paging details.

This move can probably happen in another PR.

start_positions does not get propagated in the call chain. Maybe they should stay separate.

archana-ramalingam · 2025-10-07T19:32:25Z

sharktank/tests/ops/test_attention_ops.py

+        # Sink weight
+        sink_expanded = sink.reshape(1, 1, 1, 1).expand(1, 1, 2, 1)
+        attn_with_sink = torch.cat([attn_weights, sink_expanded], dim=-1)
+        sink_weights_full = torch.softmax(attn_with_sink, dim=-1)


Can we replace all the torch ops used in this test with sharktank ops? Eg: torch.softmax -> ops.softmax

archana-ramalingam · 2025-10-08T18:59:21Z

sharktank/sharktank/layers/paged_attention.py

+            sliding_window_tensor = torch.full_like(global_q_pos, sliding_window - 1)
+            first_allowed_k_pos = (global_q_pos - sliding_window_tensor).clamp_min(0)
+            too_old = kv_positions.unsqueeze(0) < first_allowed_k_pos.unsqueeze(1)
+            invalid = future | too_old


Can we rename future, too_old to future_ctx, initial_ctx, previous_ctx, accordingly?

archana-ramalingam · 2025-10-08T19:04:43Z

sharktank/sharktank/layers/paged_attention.py

        fake_quant: Optional[bool],
        softcap: Optional[float] = None,
        scale: Optional[torch.Tensor | ReplicatedTensor] = None,
        mask: Optional[torch.Tensor | ReplicatedTensor] = None,


Shouldn't all these types be included for mask type hint in build_mask?

archana-ramalingam · 2025-10-08T19:58:24Z

sharktank/sharktank/ops/attention_impls.py

    if sink is not None:
-        sink = sink.to(q.dtype)
-        sink = sink.reshape(1, -1, 1, 1).expand(bs, -1, n_tokens, 1)
+        sink = sink.to(q.dtype).to(q.device)


Shouldn't this be taken care of before sdpa, where the sink is generated?

archana-ramalingam · 2025-10-08T20:00:31Z

sharktank/sharktank/ops/attention_impls.py

+        # Sink should match [bs, n_heads, n_tokens, sink_size] to concat with attn_weights [bs, n_heads, n_tokens, kv_size]
+        sink = sink.reshape(1, n_heads, 1, 1).expand(bs, n_heads, n_tokens, 1)


Same as above, if possible this has to be done before sink is fed to sdpa.

archana-ramalingam · 2025-10-08T20:05:03Z

sharktank/tests/layers/paged_llama_attention_block_test.py

+        # write(). With a fixed stride=16 and seqlen=8 we tried to unflatten an
+        # 8-length dimension into (1,16) causing the RuntimeError observed:
+        #   unflatten: Provided sizes [1, 16] don't multiply up to size 8
+        # Setting stride=min(16, seqlen) ensures partial (short) sequences map


Is there a reason, we mention this but don't enforce it here?

archana-ramalingam · 2025-10-09T01:44:20Z

sharktank/tests/ops/test_attention_ops.py

+        )
+
+
+def test__invoke_golden_mask_cases():


Double _ after "test"

oyazdanb marked this pull request as draft September 19, 2025 20:54

oyazdanb marked this pull request as ready for review September 22, 2025 22:02

oyazdanb requested review from IanNod, KyleHerndon and rsuderman September 22, 2025 22:02

rsuderman requested changes Sep 22, 2025

View reviewed changes

oyazdanb force-pushed the users/oyazdanb/attn_sliding_window branch from ab56847 to a7d6d53 Compare September 24, 2025 19:12

oyazdanb and others added 4 commits September 24, 2025 22:33

[Sharktank] Moe for gpt-oss (#2212)

01df949

Adding moe for gpt-oss. Also cleaned moe implementions and add numeric testing for pregather.

fixing test issues

9b825aa

changing sliding_w build_mask and adding golden tests

4476102

addressing compiler errors

bdc8948

oyazdanb force-pushed the users/oyazdanb/attn_sliding_window branch 3 times, most recently from 237e2ce to ff80fcb Compare September 24, 2025 23:21

fixing the device issue for build_mask

51b12ee

oyazdanb force-pushed the users/oyazdanb/attn_sliding_window branch from ff80fcb to 51b12ee Compare September 24, 2025 23:22

oyazdanb added 2 commits September 24, 2025 19:22

Merge branch 'main' into users/oyazdanb/attn_sliding_window

1613f5b

Merge branch 'main' into users/oyazdanb/attn_sliding_window

2756718

oyazdanb requested a review from rsuderman September 26, 2025 18:10

sogartar reviewed Oct 2, 2025

View reviewed changes

archana-ramalingam reviewed Oct 7, 2025

View reviewed changes

archana-ramalingam reviewed Oct 8, 2025

View reviewed changes

archana-ramalingam reviewed Oct 9, 2025

View reviewed changes

sharktank/tests/ops/test_attention_ops.py

)

def test__invoke_golden_mask_cases():

Copy link

Collaborator

archana-ramalingam Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double _ after "test"

		# Sink should match [bs, n_heads, n_tokens, sink_size] to concat with attn_weights [bs, n_heads, n_tokens, kv_size]
		sink = sink.reshape(1, n_heads, 1, 1).expand(bs, n_heads, n_tokens, 1)

[Sharktank] sliding_window out of ops.attention #2293

Are you sure you want to change the base?

[Sharktank] sliding_window out of ops.attention #2293

Conversation

oyazdanb commented Sep 19, 2025

Uh oh!

github-actions bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Sep 22, 2025 •

edited

Loading