-
Notifications
You must be signed in to change notification settings - Fork 193
Compress tutorial (PoC) #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/compress
Are you sure you want to change the base?
Conversation
using MIP-based NAS search algorithm. Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
…ation. Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
…ress module. Signed-off-by: Daniel Korzekwa <[email protected]>
…ntal/ folder to not be run by CICD yet. Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Keval Morabia <[email protected]>
…tmp_path. Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
…thm. Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
…o_decilm_convertion
…as_convert Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Signed-off-by: Daniel Korzekwa <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/compress #492 +/- ##
=================================================
Coverage 73.40% 73.40%
=================================================
Files 180 180
Lines 18127 18127
=================================================
Hits 13306 13306
Misses 4821 4821 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
## What does this PR do? **Type of change:** Documentation **Overview:** Updated the tutorial with more details on how to choose the required config parameters and added MMLU evaluation. --------- Signed-off-by: Liana Mikaelyan <[email protected]>
| @@ -0,0 +1,64 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be its own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is how it was design, any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rename from modelopt/torch/_compress/dataset/prepare_dataset.py to modelopt/torch/_compress/utils/dataset_utils.py and later unify with https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/dataset_utils.py
We already have nemotron-post-training-dataset-v2 supported in modelopt/torch/utils/dataset_utils.py so ideally we should be able to just used that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems made for the Nemotron post-training dataset rather than being generic. Which file even uses this?
| @@ -0,0 +1,41 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevalmorabia97 Would we want to adopt such timestamped logger for ModelOpt globally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging approach should one across modelopt, I made a little one for now, we should use some standard python logging library. Created internal Nvidia modelopt issue: issues/40
| config={}, # this is not used as the search space is defined in the hydra config | ||
| ) | ||
|
|
||
| print(timestamped("Compress Progress 8/8: compression pipeline completed (multi-gpu)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just make the print part of the function def print_timestamped("")
| ) | ||
|
|
||
| # mip_and_realize_models (distributed processing) | ||
| # TODO: How to make it part of mnt.search() api, similarly to run_full_compress() API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be improved once everything is self contained in modelopt. We dont need separate function for mip_only. We can re-run same run_full_compress but internally for each sub-step, it should check if checkpoint already exists and skip that step.
This generic solution will also help in other cases where whole compress pipeline takes too long and we want to resume from some intermediate step
| @@ -0,0 +1,64 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rename from modelopt/torch/_compress/dataset/prepare_dataset.py to modelopt/torch/_compress/utils/dataset_utils.py and later unify with https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/dataset_utils.py
We already have nemotron-post-training-dataset-v2 supported in modelopt/torch/utils/dataset_utils.py so ideally we should be able to just used that
| return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
|
|
||
|
|
||
| def timestamped(message: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be changed to print_timestamped() and moved to https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/logging.py
Ideally configuring standard python logger to do this would be better than custom timestamped message. See this example of using logging: https://stackoverflow.com/a/44175370
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielkorzekwa Can all the print(timestamped(...)) be refactored as logging.info(...) and we just adjust the standard logger later?
| The supported modifications are: | ||
|
|
||
| - `ffn_intermediate_size`: different FFN intermediate sizes | ||
| - `attention op/noop`: complete removal of attention layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt we decide to keep PoC just ffn pruning and no attn module replacement?
|
|
||
| **_NOTE:_** | ||
| How to choose `intermediate_size_list`? | ||
| The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a multiple of 2) to avoid issues with tensor operations in subsequent steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's recommend divisible by 64 instead? FFN value is 14336 so having only multiples of 64 in search space should be more than enougha no?
|
|
||
| ```bash | ||
| ... | ||
| block_0: attention gqa_4 ffn intermediate_14336 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GQA4 will only work with TP4 if training in Megatron-fw. Maybe deployment also but I dont know for sure. Should we remove GQA pruning from search space?
| ```bash | ||
| ... | ||
| block_0: attention gqa_4 ffn intermediate_14336 | ||
| block_1: attention gqa_4 ffn intermediate_14336 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is no ffn being pruned here? Is it because we use memory target and attention takes more memory so its pruned first by puzzle?
| ```bash | ||
| lm_eval --model hf \ | ||
| --model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \ | ||
| --tasks mmlu_humanities \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why mmlu_humanities instead of generic mmlu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a dockerfile? If puzzle is self contained modelopt and dependencies are in setup.py then install modelopt will install everything needed and then we just need users to use a trtllm docker image withou any custom dockerfile or docker build step
|
Bunch of code quality checks are also failing |
What does this PR do?
Compress tutorial (PoC) + compress cli app.