|
3 | 3 | KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement: |
4 | 4 |
|
5 | 5 | - **CPU Weights (`convert_cpu_weights.py`)**: Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" experts |
6 | | -- **GPU Weights (`convert_gpu_weights.py`)**: Apply GPTQ quantization (W4A16/W8A16) for GPU-resident "hot" experts |
| 6 | +- **GPU Weights (`convert_gpu_weights.py`)**: Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" experts |
7 | 7 |
|
8 | 8 | --- |
9 | 9 |
|
@@ -165,97 +165,210 @@ pip install accelerate transformers llmcompressor datasets |
165 | 165 | **Required packages:** |
166 | 166 | - `accelerate`: For distributed model loading and device mapping |
167 | 167 | - `transformers`: For model and tokenizer loading |
168 | | -- `llmcompressor`: For GPTQ quantization |
169 | | -- `datasets`: For calibration data loading |
| 168 | +- `llmcompressor`: For quantization (supports GPTQ and RTN methods) |
| 169 | +- `datasets`: For calibration data loading (GPTQ only) |
| 170 | + |
| 171 | +**Documentation:** This tool is based on llmcompressor. For more details, see [llmcompressor quantization guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress/#select-a-quantization-method-and-scheme). |
170 | 172 |
|
171 | 173 | ### Overview |
172 | 174 |
|
173 | | -Apply GPTQ quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with `convert_cpu_weights.py` to enable heterogeneous expert placement: |
| 175 | +Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with `convert_cpu_weights.py` to enable heterogeneous expert placement: |
174 | 176 |
|
175 | | -- **GPU-resident experts** ("hot" experts) use GPTQ quantization (this tool) for efficient GPU memory usage |
| 177 | +- **GPU-resident experts** ("hot" experts) use GPTQ/RTN quantization (this tool) for efficient GPU memory usage |
176 | 178 | - **CPU-resident experts** ("cold" experts) use AMX-optimized INT4/INT8 quantization (convert_cpu_weights.py) |
177 | 179 | - **Attention layers, gates, and shared experts** remain in higher precision |
178 | 180 |
|
179 | 181 | This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs. |
180 | 182 |
|
| 183 | +### Quantization Methods |
| 184 | + |
| 185 | +#### 1. GPTQ (Calibration-based, Default) |
| 186 | +**Pros:** |
| 187 | +- Higher accuracy through calibration-based quantization |
| 188 | +- Recommended for production deployments |
| 189 | + |
| 190 | +**Cons:** |
| 191 | +- Requires calibration dataset |
| 192 | +- Slower quantization process |
| 193 | +- Higher memory requirements (needs Hessian matrix) |
| 194 | + |
| 195 | +#### 2. RTN (Round-To-Nearest) |
| 196 | +**Pros:** |
| 197 | +- Fast quantization (no calibration needed) |
| 198 | +- Lower memory requirements |
| 199 | +- Good for quick testing and prototyping |
| 200 | + |
| 201 | +**Cons:** |
| 202 | +- Slightly lower accuracy compared to GPTQ |
| 203 | +- No calibration optimization |
| 204 | + |
181 | 205 | ### Quantization Types |
182 | 206 |
|
183 | | -- **W4A16**: 4-bit weights, 16-bit activations (GPTQ4) |
184 | | -- **W8A16**: 8-bit weights, 16-bit activations (GPTQ8) |
| 207 | +- **W4A16**: 4-bit weights, 16-bit activations (INT4) |
| 208 | +- **W8A16**: 8-bit weights, 16-bit activations (INT8) |
185 | 209 |
|
186 | 210 | ### Basic Usage |
187 | 211 |
|
| 212 | +#### GPTQ Quantization (Recommended for Production) |
188 | 213 | ```bash |
189 | 214 | python scripts/convert_gpu_weights.py \ |
190 | 215 | --model_id /path/to/model \ |
191 | 216 | --output_dir /path/to/output \ |
| 217 | + --quant_method GPTQ \ |
192 | 218 | --quant_type W4A16 |
193 | 219 | ``` |
194 | 220 |
|
| 221 | +#### RTN Quantization (Fast, for Testing) |
| 222 | +```bash |
| 223 | +python scripts/convert_gpu_weights.py \ |
| 224 | + --model_id /path/to/model \ |
| 225 | + --output_dir /path/to/output \ |
| 226 | + --quant_method RTN \ |
| 227 | + --quant_type W4A16 |
| 228 | +``` |
| 229 | + |
| 230 | +### Memory Requirements |
| 231 | + |
| 232 | +Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods. |
| 233 | + |
| 234 | +#### RTN Memory Requirements |
| 235 | + |
| 236 | +RTN only requires memory for quantization parameters (scales/zero-points): |
| 237 | + |
| 238 | +| Component | Requirement | |
| 239 | +|-----------|-------------| |
| 240 | +| **DRAM (CPU Memory)** | ≥ Total model parameters | |
| 241 | +| **VRAM (GPU Memory)** | ≥ Single layer parameters | |
| 242 | + |
| 243 | +**Example: DeepSeek-R1-0528-BF16 (684B parameters)** |
| 244 | +- DRAM: ~1368 GB (684B params × 2 bytes) |
| 245 | +- VRAM: ~22.4 GB (1 layer) |
| 246 | + |
| 247 | +#### GPTQ Memory Requirements |
| 248 | + |
| 249 | +GPTQ requires additional memory for Hessian matrices during calibration: |
| 250 | + |
| 251 | +| Component | Requirement | |
| 252 | +|-----------|-------------| |
| 253 | +| **DRAM (CPU Memory)** | ≥ Total model parameters | |
| 254 | +| **VRAM (GPU Memory)** | ≥ Single layer parameters × 2 | |
| 255 | + |
| 256 | +The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery. |
| 257 | + |
| 258 | +**Example: DeepSeek-R1-0528-BF16 (684B parameters)** |
| 259 | +- DRAM: ~1368 GB (684B params × 2 bytes) |
| 260 | +- VRAM: ~44.8 GB (1 layer × 2 for Hessian matrix) |
| 261 | + |
| 262 | +#### Method Comparison |
| 263 | + |
| 264 | +| Method | Speed | VRAM | Accuracy | Use Case | |
| 265 | +|--------|-------|------|----------|----------| |
| 266 | +| **RTN** | Fast | Low (~22GB) | Good | Testing, prototyping | |
| 267 | +| **GPTQ** | Slow | High (~45GB) | Better | Production deployment | |
| 268 | + |
195 | 269 | ### Advanced Options |
196 | 270 |
|
197 | | -#### Calibration Configuration |
| 271 | +#### Calibration Configuration (GPTQ Only) |
198 | 272 |
|
199 | | -Control the calibration process for better quantization quality: |
| 273 | +For GPTQ quantization, control the calibration process for better quantization quality: |
200 | 274 |
|
201 | 275 | ```bash |
202 | 276 | python scripts/convert_gpu_weights.py \ |
203 | 277 | --model_id /path/to/model \ |
204 | 278 | --output_dir /path/to/output \ |
| 279 | + --quant_method GPTQ \ |
205 | 280 | --quant_type W4A16 \ |
206 | 281 | --num_calibration_samples 512 \ |
207 | 282 | --max_sequence_length 2048 \ |
208 | 283 | --dataset HuggingFaceH4/ultrachat_200k \ |
209 | 284 | --dataset_split train_sft |
210 | 285 | ``` |
211 | 286 |
|
212 | | -**Options:** |
| 287 | +**Options (GPTQ only):** |
213 | 288 | - `--num_calibration_samples`: Number of samples for calibration (default: 512) |
214 | 289 | - `--max_sequence_length`: Maximum sequence length (default: 2048) |
215 | 290 | - `--dataset`: HuggingFace dataset for calibration |
216 | 291 | - `--dataset_split`: Dataset split to use |
| 292 | +- `--dampening_frac`: Dampening fraction to reduce quantization noise (default: 0.1) |
217 | 293 |
|
218 | | -#### Memory Management (Avoiding OOM) |
| 294 | +#### Memory Management |
219 | 295 |
|
220 | | -GPTQ quantization requires additional GPU memory for Hessian matrix computation beyond model weights. Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU: |
| 296 | +Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU: |
221 | 297 |
|
222 | 298 | ```bash |
223 | 299 | python scripts/convert_gpu_weights.py \ |
224 | 300 | --model_id /path/to/model \ |
225 | 301 | --output_dir /path/to/output \ |
| 302 | + --quant_method GPTQ \ |
226 | 303 | --quant_type W4A16 \ |
227 | 304 | --max_gpu_memory "40GiB" |
228 | 305 | ``` |
229 | 306 |
|
230 | | -**Recommended settings:** |
| 307 | +**Recommended settings for GPTQ:** |
| 308 | + |
| 309 | +| GPU VRAM | Suggested `--max_gpu_memory` | Notes | |
| 310 | +|----------|------------------------------|-------| |
| 311 | +| 24 GiB | 10-12 GiB | Reserve ~50% for Hessian | |
| 312 | +| 48 GiB | 24-30 GiB | Reserve ~40% for Hessian | |
| 313 | +| 80 GiB | 40-50 GiB | Reserve ~40% for Hessian | |
231 | 314 |
|
232 | | -| GPU VRAM | Suggested `--max_gpu_memory` | |
233 | | -|----------|------------------------------| |
234 | | -| 24 GiB | 14-16 GiB | |
235 | | -| 48 GiB | 30-35 GiB | |
236 | | -| 80 GiB | 50-60 GiB | |
| 315 | +**Recommended settings for RTN:** |
237 | 316 |
|
238 | | -Reserve 40-50% of GPU memory for GPTQ's Hessian matrix computation. |
| 317 | +| GPU VRAM | Suggested `--max_gpu_memory` | Notes | |
| 318 | +|----------|------------------------------|-------| |
| 319 | +| 24 GiB | 18-20 GiB | No Hessian needed | |
| 320 | +| 48 GiB | 40-45 GiB | No Hessian needed | |
| 321 | +| 80 GiB | 70-75 GiB | No Hessian needed | |
239 | 322 |
|
240 | 323 | **Options:** |
241 | 324 | - `--max_gpu_memory`: Maximum GPU memory for model weights per device (e.g., '40GiB') |
242 | 325 | - `--max_cpu_memory`: Maximum CPU memory (default: 1000GiB when `--max_gpu_memory` is set) |
243 | 326 |
|
244 | 327 | **Important:** llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM: |
245 | | -1. Reduce `--num_calibration_samples` (e.g., 256) |
246 | | -2. Reduce `--max_sequence_length` (e.g., 1024) |
247 | | -3. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM) |
| 328 | +1. Use RTN instead of GPTQ (requires less memory) |
| 329 | +2. Reduce `--num_calibration_samples` (GPTQ only, e.g., 256) |
| 330 | +3. Reduce `--max_sequence_length` (GPTQ only, e.g., 1024) |
| 331 | +4. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM) |
248 | 332 |
|
249 | 333 | ### Examples |
250 | 334 |
|
251 | | -#### Example 1: Quantize Qwen3-Next-80B for Hybrid Inference (W4A16) |
| 335 | +#### Example 1: GPTQ Quantization for Production (Qwen3-Next-80B, W4A16) |
252 | 336 |
|
253 | 337 | ```bash |
254 | 338 | python scripts/convert_gpu_weights.py \ |
255 | | - --model_id /mnt/data/models/Qwen3-Next-80B-A3B-Thinking \ |
256 | | - --output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Thinking-GPTQ4 \ |
| 339 | + --model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \ |
| 340 | + --output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \ |
| 341 | + --quant_method GPTQ \ |
257 | 342 | --quant_type W4A16 \ |
258 | 343 | --num_calibration_samples 512 \ |
259 | 344 | --max_sequence_length 2048 \ |
| 345 | + --max_gpu_memory "40GiB" \ |
| 346 | + --trust_remote_code |
| 347 | +``` |
| 348 | + |
| 349 | +#### Example 2: RTN Quantization for Fast Testing (DeepSeek-R1, W4A16) |
| 350 | + |
| 351 | +```bash |
| 352 | +python scripts/convert_gpu_weights.py \ |
| 353 | + --model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \ |
| 354 | + --output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \ |
| 355 | + --quant_method RTN \ |
| 356 | + --quant_type W4A16 \ |
| 357 | + --max_gpu_memory "70GiB" \ |
| 358 | + --trust_remote_code |
| 359 | +``` |
| 360 | + |
| 361 | +#### Example 3: GPTQ with Custom Calibration Dataset (GLM-4.5-Air, W8A16) |
| 362 | + |
| 363 | +```bash |
| 364 | +python scripts/convert_gpu_weights.py \ |
| 365 | + --model_id /mnt/data/models/GLM-4.5-Air \ |
| 366 | + --output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \ |
| 367 | + --quant_method GPTQ \ |
| 368 | + --quant_type W8A16 \ |
| 369 | + --dataset "tatsu-lab/alpaca" \ |
| 370 | + --dataset_split "train" \ |
| 371 | + --num_calibration_samples 256 \ |
| 372 | + --max_gpu_memory "40GiB" \ |
260 | 373 | --trust_remote_code |
261 | 374 | ``` |
0 commit comments