Skip to content

Conversation

@kaix-nv
Copy link
Contributor

@kaix-nv kaix-nv commented Nov 7, 2025

What does this PR do?

Type of change: ?
New feature

Overview: ?
This PR provides a sparse attention support in ModelOpt for applying attention sparsity through skip softmax method, enabling inference speedups for LLMs.

Key Features:

  • Skip softmax support
  • Sparse attention config
  • Extensible method registry for future sparse attention algorithms
  • HuggingFace Transformers integration
  • Phase-aware thresholds (separate prefill/decode)

Design doc

Usage

import torch
import modelopt.torch.sparsity.attention_sparsity as mts
from transformers import AutoModelForCausalLM

# Load model (must use eager attention for softmax patching)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="eager",  # Required!
    torch_dtype=torch.bfloat16,
)

# Use pre-defined configuration
from modelopt.torch.sparsity.attention_sparsity import SKIP_SOFTMAX_DEFAULT
model = mts.sparsify(model, SKIP_SOFTMAX_DEFAULT)

Testing

Unit Test

pytest tests/unit/torch/sparsity/attention_sparsity -v
pytest tests/gpu/torch/sparsity/attention_sparsity -v
pytest tests/examples/llm_sparsity/attention_sparsity -v

ALL PASSED.

Accuracy

Benchmark: MMLU
Model: Qwen/Qwen3-4B
Cmd: python mmlu.py --model_name causal --model_path Qwen/Qwen3-4B --sparse_cfg SKIP_SOFTMAX_DEFAULT

MMLU
BF16 69.96
SKIP_SOFTMAX_DEFAULT 69.86

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

@kaix-nv kaix-nv requested a review from a team as a code owner November 7, 2025 07:53
@kaix-nv kaix-nv requested review from realAsma and removed request for realAsma November 7, 2025 07:53
@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 89.67254% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.69%. Comparing base (e74a468) to head (5d027e0).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...ch/sparsity/attention_sparsity/sparse_attention.py 71.42% 16 Missing ⚠️
...pt/torch/sparsity/attention_sparsity/conversion.py 93.02% 9 Missing ⚠️
...y/attention_sparsity/methods/flash_skip_softmax.py 90.90% 8 Missing ⚠️
...ch/sparsity/attention_sparsity/methods/registry.py 88.88% 3 Missing ⚠️
modelopt/torch/sparsity/attention_sparsity/mode.py 90.32% 3 Missing ⚠️
...delopt/torch/sparsity/attention_sparsity/config.py 95.91% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #527      +/-   ##
==========================================
+ Coverage   74.36%   74.69%   +0.33%     
==========================================
  Files         182      191       +9     
  Lines       18216    18616     +400     
==========================================
+ Hits        13547    13906     +359     
- Misses       4669     4710      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_core branch 4 times, most recently from 54bfe2c to 0ce1376 Compare November 8, 2025 03:31
@kaix-nv kaix-nv requested a review from jy-yuan November 11, 2025 18:29
@kaix-nv kaix-nv force-pushed the kaix/sparse_attention_core branch from fc9d285 to 5d027e0 Compare November 11, 2025 23:44
@kaix-nv kaix-nv changed the title [2/n] Add Core Sparse Attention Infrastructure [OMNIML-2852][2/n] Add Core Sparse Attention Infrastructure Nov 12, 2025
@kaix-nv kaix-nv changed the title [OMNIML-2852][2/n] Add Core Sparse Attention Infrastructure [OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants