Skip to content

Conversation

@danielkorzekwa
Copy link

What does this PR do?

Compress tutorial (PoC) + compress cli app.

danielkorzekwa and others added 30 commits October 27, 2025 11:50
using MIP-based NAS search algorithm.

Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
…ntal/ folder to not be run by CICD yet.

Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
@danielkorzekwa danielkorzekwa requested review from a team as code owners November 3, 2025 13:56
@danielkorzekwa danielkorzekwa requested review from kevalmorabia97 and removed request for a team November 3, 2025 13:56
@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.40%. Comparing base (1c12fd8) to head (25b4aed).

Additional details and impacted files
@@                Coverage Diff                @@
##           feature/compress     #492   +/-   ##
=================================================
  Coverage             73.40%   73.40%           
=================================================
  Files                   180      180           
  Lines                 18127    18127           
=================================================
  Hits                  13306    13306           
  Misses                 4821     4821           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

## What does this PR do?

**Type of change:** 
Documentation

**Overview:** 
Updated the tutorial with more details on how to choose the required
config parameters and added MMLU evaluation.

---------

Signed-off-by: Liana Mikaelyan <[email protected]>
@LianaMikael LianaMikael requested a review from a team as a code owner November 4, 2025 10:03
@@ -0,0 +1,64 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be its own file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is how it was design, any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename from modelopt/torch/_compress/dataset/prepare_dataset.py to modelopt/torch/_compress/utils/dataset_utils.py and later unify with https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/dataset_utils.py

We already have nemotron-post-training-dataset-v2 supported in modelopt/torch/utils/dataset_utils.py so ideally we should be able to just used that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems made for the Nemotron post-training dataset rather than being generic. Which file even uses this?

@@ -0,0 +1,41 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevalmorabia97 Would we want to adopt such timestamped logger for ModelOpt globally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging approach should one across modelopt, I made a little one for now, we should use some standard python logging library. Created internal Nvidia modelopt issue: issues/40

config={}, # this is not used as the search space is defined in the hydra config
)

print(timestamped("Compress Progress 8/8: compression pipeline completed (multi-gpu)"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just make the print part of the function def print_timestamped("")

)

# mip_and_realize_models (distributed processing)
# TODO: How to make it part of mnt.search() api, similarly to run_full_compress() API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be improved once everything is self contained in modelopt. We dont need separate function for mip_only. We can re-run same run_full_compress but internally for each sub-step, it should check if checkpoint already exists and skip that step.

This generic solution will also help in other cases where whole compress pipeline takes too long and we want to resume from some intermediate step

@@ -0,0 +1,64 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename from modelopt/torch/_compress/dataset/prepare_dataset.py to modelopt/torch/_compress/utils/dataset_utils.py and later unify with https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/dataset_utils.py

We already have nemotron-post-training-dataset-v2 supported in modelopt/torch/utils/dataset_utils.py so ideally we should be able to just used that

return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")


def timestamped(message: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be changed to print_timestamped() and moved to https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/logging.py

Ideally configuring standard python logger to do this would be better than custom timestamped message. See this example of using logging: https://stackoverflow.com/a/44175370

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielkorzekwa Can all the print(timestamped(...)) be refactored as logging.info(...) and we just adjust the standard logger later?

The supported modifications are:

- `ffn_intermediate_size`: different FFN intermediate sizes
- `attention op/noop`: complete removal of attention layers
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt we decide to keep PoC just ffn pruning and no attn module replacement?


**_NOTE:_**
How to choose `intermediate_size_list`?
The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a multiple of 2) to avoid issues with tensor operations in subsequent steps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's recommend divisible by 64 instead? FFN value is 14336 so having only multiples of 64 in search space should be more than enougha no?


```bash
...
block_0: attention gqa_4 ffn intermediate_14336
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GQA4 will only work with TP4 if training in Megatron-fw. Maybe deployment also but I dont know for sure. Should we remove GQA pruning from search space?

```bash
...
block_0: attention gqa_4 ffn intermediate_14336
block_1: attention gqa_4 ffn intermediate_14336
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is no ffn being pruned here? Is it because we use memory target and attention takes more memory so its pruned first by puzzle?

```bash
lm_eval --model hf \
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
--tasks mmlu_humanities \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why mmlu_humanities instead of generic mmlu?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a dockerfile? If puzzle is self contained modelopt and dependencies are in setup.py then install modelopt will install everything needed and then we just need users to use a trtllm docker image withou any custom dockerfile or docker build step

@kevalmorabia97
Copy link
Collaborator

Bunch of code quality checks are also failing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants