Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions docs/compression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# 模型压缩

------------------------------------------------------------------------------------------

## **简介**

PaddleFleetX 集成了 PaddleSlim 中的常见的压缩方法:量化训练(Qutization Aware Training,QAT)、结构化稀疏(Structured Pruning,SP)和知识蒸馏(Knowledge Distillation,KD)。本文会介绍如何在 PaddleFleetX 中使用这些功能,来压缩并且导出压缩后的模型。

## **特性**

- <a href=https://github.com/PaddlePaddle/PaddleSlim/tree/release/2.4/demo/dygraph/quant>量化训练</a>:通过将全连接层的矩阵乘计算由 Float 浮点型优化为 INT8 整型来优化推理性能;
- <a href=https://github.com/PaddlePaddle/PaddleSlim/tree/release/2.4/demo/dygraph/pruning>结构化稀疏</a>:通过剪裁全连接层权重的通道数目来优化推理性能;
- <a href=#知识蒸馏>知识蒸馏</a>:通过使用高精度的大模型(教师模型)来蒸馏低精度的小模型(学生模型)来提升小模型精度



## **配置文档**

模型压缩开关通过 Compress 字段控制,预训练的模型参数路径由 pretrained 指定。接下来就是量化训练、结构化稀疏和知识蒸馏各自的技术参数。

```yaml
Compress:
pretrained: // 预训练模型参数的保存路径

Quantization: // 量化训练参数

Prune: // 结构化稀疏参数

Distillation: // 知识蒸馏参数
```

**注意**: 我们正在开发上述三种压缩方法的联合使用,请先单独使用上述各个方法。

### **量化训练参数**

```yaml
Compress:
pretrained:
Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
weight_preprocess_type: None
activation_preprocess_type: 'PACT'
weight_bits: 8
activation_bits: 8
quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
onnx_format: True
```

其中参数说明:

| **参数名** | **参数释义** |
|-----------------------------|-----------------------------------------|
| pretrained | 预训练模型的加载目录,若设置该参数,将在量化之前加载预训练模型;若需要加载量化后参数,将此参数设置为None,直接设置Engine.save_load.ckpt_dir即可 |
| enable | 是否开启量化训练 |
| weight_quantize_type | weight量化方法, 默认为`channel_wise_abs_max`, 此外还支持`abs_max` |
| activation_quantize_type | activation量化方法, 默认为`moving_average_abs_max` |
| weight_preprocess_type | weight预处理方法,默认为None,代表不进行预处理;当需要使用`PACT`方法时设置为`PACT` |
| activation_preprocess_type | activation预处理方法,默认为None,代表不进行预处理 |
| weight_bits | weight量化比特数, 默认为 8 |
| activation_bits | activation量化比特数, 默认为 8 |
| quantizable_layer_type | 需要量化的算子类型 |
| onnx_format | 是否使用新量化格式,默认为False |

更详细的量化训练参数介绍可参考[PaddleSlim动态图量化训练接口介绍](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/dygraph/quanter/qat.rst)。

### **结构化稀疏参数**

```yaml
Compress:
pretrained:
Prune:
enable: True
criterion: l1_norm
ratio: 0.125
```

其中参数说明:

| **参数名** | **参数释义** |
|-----------------------------|-----------------------------------------|
| pretrained | 预训练模型的加载目录 |
| enable | 是否开启结构化稀疏训练 |
| criterion | 权重的重要性指标,目前支持l1_norm 和 l2_norm|
| ratio | 权重稀疏的比例。例如,0.125的意思是12.5%的权重会被稀疏掉 |
29 changes: 29 additions & 0 deletions ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
_base_: ./pretrain_gpt_345M_single_card.yaml


Engine:
save_load:
ckpt_dir: output/epoch_0_step_1000/


Model:
module: GPTEvalModule
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1


Compress:
Prune:
enable: True
cal_sens: False
criterion: l1_norm
ratio: 0.125


Offline_Eval:
eval_path: ./lambada_test.jsonl
cloze_eval: True
overlapping_eval: 32
batch_size: 8
max_seq_len: 1024
logging_freq: 10
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
_base_: ./pretrain_gpt_345M_single_card.yaml

Model:
module: GPTGenerationModule

Prune:
enable: True
criterion: l1_norm
ratio: 0.125

Generation:
top_k: 50
top_p: 0.75
temperature: 1.0
min_dec_len: 1
max_dec_len: 200
num_return_sequences: 1
decode_strategy: "sampling"
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
_base_: ./generation_gpt_345M_single_card.yaml


Inference:
model_dir: output/epoch_7500
mp_degree: 1

Engine:
save_load:
ckpt_dir: output/epoch_7500

Prune:
enable: False
criterion: l1_norm
ratio: 0.125

Model:
vocab_size: 50304
hidden_size: 3584
num_layers: 32
num_attention_heads: 28
ffn_hidden_size: 14336
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 1024
type_vocab_size: 16
initializer_range: 0.02
use_recompute: False
recompute_granularity:
no_recompute_layers:

Distributed:
dp_degree:
mp_degree: 1
pp_degree: 1
sharding:
sharding_degree: 1
sharding_stage: 1
sharding_offload: False
reduce_overlap: False
broadcast_overlap: False

Data:
Test:
dataset:
name: GPTDataset
input_dir: ./data/
split: [949, 50, 1]
max_seq_len: 1024
sampler:
name: GPTBatchSampler
shuffle: False
drop_last: True
loader:
num_workers: 1
return_list: False
collate_fn: gpt_collate_fn
78 changes: 78 additions & 0 deletions ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
_base_: ./pretrain_gpt_base.yaml

Global:
global_batch_size:
local_batch_size: 8
micro_batch_size: 8


Model:
vocab_size: 50304
hidden_size: 1024
num_layers: 24
num_attention_heads: 16
ffn_hidden_size: 4096
hidden_dropout_prob: 0.0
attention_probs_dropout_prob: 0.0
max_position_embeddings: 1024
type_vocab_size: 16
initializer_range: 0.02
use_recompute: False
recompute_granularity:
no_recompute_layers:


Data:
Train:
dataset:
name: GPTDataset
input_dir: ./data/
split: [949, 50, 1]
max_seq_len: 1024
sampler:
name: GPTBatchSampler
shuffle: False
drop_last: True
loader:
num_workers: 1
return_list: False
collate_fn: gpt_collate_fn

Eval:
dataset:
name: GPTDataset
input_dir: ./data/

Distributed:
dp_degree: 1
mp_degree: 1
pp_degree: 1
sharding:
sharding_degree: 1
sharding_stage: 1
sharding_offload: False
comm_overlap: False

Engine:
max_steps: 100000
save_load:
save_steps: 1000
save_epoch: 1
output_dir: ./output
ckpt_dir: ./PaddleFleetX_GPT_345M_220826

Optimizer:
weight_decay: 0.0
lr:
decay_steps: 90000
warmup_rate: 0.00
max_lr: 2.5e-5
min_lr: 5.0e-6

Compress:
pretrained: ./PaddleFleetX_GPT_345M_220826
Prune:
enable: True
cal_sens: False
criterion: l1_norm
ratio: 0.125
19 changes: 10 additions & 9 deletions ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@ Distributed:
broadcast_overlap: False


Quantization:
enable: True
pretrained:
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
weight_bits: 8
activation_bits: 8
quantizable_layer_type: ['Conv2D', 'Linear', 'Conv2DTranspose', 'ColumnParallelLinear', 'RowParallelLinear']
onnx_format: True
Compress:
pretrained:
Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
weight_bits: 8
activation_bits: 8
quantizable_layer_type: ['Conv2D', 'Linear', 'Conv2DTranspose', 'ColumnParallelLinear', 'RowParallelLinear']
onnx_format: True
19 changes: 10 additions & 9 deletions ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,13 @@ Optimizer:
tensor_fusion: True


Quantization:
enable: True
pretrained:
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
weight_bits: 8
activation_bits: 8
quantizable_layer_type: ['Conv2D', 'Linear', 'Conv2DTranspose', 'ColumnParallelLinear', 'RowParallelLinear']
onnx_format: True
Compress:
pretrained:
Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
weight_bits: 8
activation_bits: 8
quantizable_layer_type: ['Conv2D', 'Linear', 'Conv2DTranspose', 'ColumnParallelLinear', 'RowParallelLinear']
onnx_format: True
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,12 @@ Data:
use_shared_memory: True


Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
activation_preprocess_type: 'PACT'
weight_bits: 8
activation_bits: 8
onnx_format: True
Compress:
Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
activation_preprocess_type: 'PACT'
weight_bits: 8
activation_bits: 8
onnx_format: True
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,12 @@ Data:
num_workers: 8
use_shared_memory: True

Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
activation_preprocess_type: 'PACT'
weight_bits: 8
activation_bits: 8
onnx_format: True
Compress:
Quantization:
enable: True
weight_quantize_type: 'abs_max'
activation_quantize_type: 'moving_average_abs_max'
activation_preprocess_type: 'PACT'
weight_bits: 8
activation_bits: 8
onnx_format: True
Loading