Skip to content

Commit fd78fe5

Browse files
authored
fix(scripts): resolve OOM when converting gpu weights and update README (#1640)
1 parent e637fed commit fd78fe5

File tree

2 files changed

+266
-73
lines changed

2 files changed

+266
-73
lines changed

kt-kernel/scripts/README.md

Lines changed: 138 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement:
44

55
- **CPU Weights (`convert_cpu_weights.py`)**: Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" experts
6-
- **GPU Weights (`convert_gpu_weights.py`)**: Apply GPTQ quantization (W4A16/W8A16) for GPU-resident "hot" experts
6+
- **GPU Weights (`convert_gpu_weights.py`)**: Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" experts
77

88
---
99

@@ -165,97 +165,210 @@ pip install accelerate transformers llmcompressor datasets
165165
**Required packages:**
166166
- `accelerate`: For distributed model loading and device mapping
167167
- `transformers`: For model and tokenizer loading
168-
- `llmcompressor`: For GPTQ quantization
169-
- `datasets`: For calibration data loading
168+
- `llmcompressor`: For quantization (supports GPTQ and RTN methods)
169+
- `datasets`: For calibration data loading (GPTQ only)
170+
171+
**Documentation:** This tool is based on llmcompressor. For more details, see [llmcompressor quantization guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress/#select-a-quantization-method-and-scheme).
170172

171173
### Overview
172174

173-
Apply GPTQ quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with `convert_cpu_weights.py` to enable heterogeneous expert placement:
175+
Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with `convert_cpu_weights.py` to enable heterogeneous expert placement:
174176

175-
- **GPU-resident experts** ("hot" experts) use GPTQ quantization (this tool) for efficient GPU memory usage
177+
- **GPU-resident experts** ("hot" experts) use GPTQ/RTN quantization (this tool) for efficient GPU memory usage
176178
- **CPU-resident experts** ("cold" experts) use AMX-optimized INT4/INT8 quantization (convert_cpu_weights.py)
177179
- **Attention layers, gates, and shared experts** remain in higher precision
178180

179181
This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs.
180182

183+
### Quantization Methods
184+
185+
#### 1. GPTQ (Calibration-based, Default)
186+
**Pros:**
187+
- Higher accuracy through calibration-based quantization
188+
- Recommended for production deployments
189+
190+
**Cons:**
191+
- Requires calibration dataset
192+
- Slower quantization process
193+
- Higher memory requirements (needs Hessian matrix)
194+
195+
#### 2. RTN (Round-To-Nearest)
196+
**Pros:**
197+
- Fast quantization (no calibration needed)
198+
- Lower memory requirements
199+
- Good for quick testing and prototyping
200+
201+
**Cons:**
202+
- Slightly lower accuracy compared to GPTQ
203+
- No calibration optimization
204+
181205
### Quantization Types
182206

183-
- **W4A16**: 4-bit weights, 16-bit activations (GPTQ4)
184-
- **W8A16**: 8-bit weights, 16-bit activations (GPTQ8)
207+
- **W4A16**: 4-bit weights, 16-bit activations (INT4)
208+
- **W8A16**: 8-bit weights, 16-bit activations (INT8)
185209

186210
### Basic Usage
187211

212+
#### GPTQ Quantization (Recommended for Production)
188213
```bash
189214
python scripts/convert_gpu_weights.py \
190215
--model_id /path/to/model \
191216
--output_dir /path/to/output \
217+
--quant_method GPTQ \
192218
--quant_type W4A16
193219
```
194220

221+
#### RTN Quantization (Fast, for Testing)
222+
```bash
223+
python scripts/convert_gpu_weights.py \
224+
--model_id /path/to/model \
225+
--output_dir /path/to/output \
226+
--quant_method RTN \
227+
--quant_type W4A16
228+
```
229+
230+
### Memory Requirements
231+
232+
Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods.
233+
234+
#### RTN Memory Requirements
235+
236+
RTN only requires memory for quantization parameters (scales/zero-points):
237+
238+
| Component | Requirement |
239+
|-----------|-------------|
240+
| **DRAM (CPU Memory)** | ≥ Total model parameters |
241+
| **VRAM (GPU Memory)** | ≥ Single layer parameters |
242+
243+
**Example: DeepSeek-R1-0528-BF16 (684B parameters)**
244+
- DRAM: ~1368 GB (684B params × 2 bytes)
245+
- VRAM: ~22.4 GB (1 layer)
246+
247+
#### GPTQ Memory Requirements
248+
249+
GPTQ requires additional memory for Hessian matrices during calibration:
250+
251+
| Component | Requirement |
252+
|-----------|-------------|
253+
| **DRAM (CPU Memory)** | ≥ Total model parameters |
254+
| **VRAM (GPU Memory)** | ≥ Single layer parameters × 2 |
255+
256+
The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery.
257+
258+
**Example: DeepSeek-R1-0528-BF16 (684B parameters)**
259+
- DRAM: ~1368 GB (684B params × 2 bytes)
260+
- VRAM: ~44.8 GB (1 layer × 2 for Hessian matrix)
261+
262+
#### Method Comparison
263+
264+
| Method | Speed | VRAM | Accuracy | Use Case |
265+
|--------|-------|------|----------|----------|
266+
| **RTN** | Fast | Low (~22GB) | Good | Testing, prototyping |
267+
| **GPTQ** | Slow | High (~45GB) | Better | Production deployment |
268+
195269
### Advanced Options
196270

197-
#### Calibration Configuration
271+
#### Calibration Configuration (GPTQ Only)
198272

199-
Control the calibration process for better quantization quality:
273+
For GPTQ quantization, control the calibration process for better quantization quality:
200274

201275
```bash
202276
python scripts/convert_gpu_weights.py \
203277
--model_id /path/to/model \
204278
--output_dir /path/to/output \
279+
--quant_method GPTQ \
205280
--quant_type W4A16 \
206281
--num_calibration_samples 512 \
207282
--max_sequence_length 2048 \
208283
--dataset HuggingFaceH4/ultrachat_200k \
209284
--dataset_split train_sft
210285
```
211286

212-
**Options:**
287+
**Options (GPTQ only):**
213288
- `--num_calibration_samples`: Number of samples for calibration (default: 512)
214289
- `--max_sequence_length`: Maximum sequence length (default: 2048)
215290
- `--dataset`: HuggingFace dataset for calibration
216291
- `--dataset_split`: Dataset split to use
292+
- `--dampening_frac`: Dampening fraction to reduce quantization noise (default: 0.1)
217293

218-
#### Memory Management (Avoiding OOM)
294+
#### Memory Management
219295

220-
GPTQ quantization requires additional GPU memory for Hessian matrix computation beyond model weights. Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU:
296+
Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU:
221297

222298
```bash
223299
python scripts/convert_gpu_weights.py \
224300
--model_id /path/to/model \
225301
--output_dir /path/to/output \
302+
--quant_method GPTQ \
226303
--quant_type W4A16 \
227304
--max_gpu_memory "40GiB"
228305
```
229306

230-
**Recommended settings:**
307+
**Recommended settings for GPTQ:**
308+
309+
| GPU VRAM | Suggested `--max_gpu_memory` | Notes |
310+
|----------|------------------------------|-------|
311+
| 24 GiB | 10-12 GiB | Reserve ~50% for Hessian |
312+
| 48 GiB | 24-30 GiB | Reserve ~40% for Hessian |
313+
| 80 GiB | 40-50 GiB | Reserve ~40% for Hessian |
231314

232-
| GPU VRAM | Suggested `--max_gpu_memory` |
233-
|----------|------------------------------|
234-
| 24 GiB | 14-16 GiB |
235-
| 48 GiB | 30-35 GiB |
236-
| 80 GiB | 50-60 GiB |
315+
**Recommended settings for RTN:**
237316

238-
Reserve 40-50% of GPU memory for GPTQ's Hessian matrix computation.
317+
| GPU VRAM | Suggested `--max_gpu_memory` | Notes |
318+
|----------|------------------------------|-------|
319+
| 24 GiB | 18-20 GiB | No Hessian needed |
320+
| 48 GiB | 40-45 GiB | No Hessian needed |
321+
| 80 GiB | 70-75 GiB | No Hessian needed |
239322

240323
**Options:**
241324
- `--max_gpu_memory`: Maximum GPU memory for model weights per device (e.g., '40GiB')
242325
- `--max_cpu_memory`: Maximum CPU memory (default: 1000GiB when `--max_gpu_memory` is set)
243326

244327
**Important:** llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
245-
1. Reduce `--num_calibration_samples` (e.g., 256)
246-
2. Reduce `--max_sequence_length` (e.g., 1024)
247-
3. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM)
328+
1. Use RTN instead of GPTQ (requires less memory)
329+
2. Reduce `--num_calibration_samples` (GPTQ only, e.g., 256)
330+
3. Reduce `--max_sequence_length` (GPTQ only, e.g., 1024)
331+
4. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM)
248332

249333
### Examples
250334

251-
#### Example 1: Quantize Qwen3-Next-80B for Hybrid Inference (W4A16)
335+
#### Example 1: GPTQ Quantization for Production (Qwen3-Next-80B, W4A16)
252336

253337
```bash
254338
python scripts/convert_gpu_weights.py \
255-
--model_id /mnt/data/models/Qwen3-Next-80B-A3B-Thinking \
256-
--output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Thinking-GPTQ4 \
339+
--model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
340+
--output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \
341+
--quant_method GPTQ \
257342
--quant_type W4A16 \
258343
--num_calibration_samples 512 \
259344
--max_sequence_length 2048 \
345+
--max_gpu_memory "40GiB" \
346+
--trust_remote_code
347+
```
348+
349+
#### Example 2: RTN Quantization for Fast Testing (DeepSeek-R1, W4A16)
350+
351+
```bash
352+
python scripts/convert_gpu_weights.py \
353+
--model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \
354+
--output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \
355+
--quant_method RTN \
356+
--quant_type W4A16 \
357+
--max_gpu_memory "70GiB" \
358+
--trust_remote_code
359+
```
360+
361+
#### Example 3: GPTQ with Custom Calibration Dataset (GLM-4.5-Air, W8A16)
362+
363+
```bash
364+
python scripts/convert_gpu_weights.py \
365+
--model_id /mnt/data/models/GLM-4.5-Air \
366+
--output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \
367+
--quant_method GPTQ \
368+
--quant_type W8A16 \
369+
--dataset "tatsu-lab/alpaca" \
370+
--dataset_split "train" \
371+
--num_calibration_samples 256 \
372+
--max_gpu_memory "40GiB" \
260373
--trust_remote_code
261374
```

0 commit comments

Comments
 (0)