Skip to content

Commit ab8ad0a

Browse files
authored
[docs]: update web doc (#1625)
1 parent be6db6f commit ab8ad0a

File tree

3 files changed

+27
-44
lines changed

3 files changed

+27
-44
lines changed

doc/en/SFT/KTransformers-Fine-Tuning_Developer-Technical-Notes.md

Lines changed: 8 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,13 @@
1-
- [KTransformers Fine-Tuning × LLaMA-Factory Integration – Developer Technical Notes](#ktransformers-fine-tuning-x-llama-factory-integration-–-developer-technical-notes)
21
- [Introduction](#introduction)
3-
42
- [Overall View of the KT Fine-Tuning Framework](#overall-view-of-the-kt-fine-tuning-framework)
53
- [Attention (LoRA + KT coexist)](#attention-lora--kt-coexist)
64
- [MoE (operator encapsulation + backward)](#moe-operator-encapsulation--backward)
7-
- [Encapsulation](#encapsulation)
8-
- [Backward (CPU)](#backward-cpu)
95
- [Multi-GPU Loading/Training: Placement strategy instead of DataParallel](#multi-gpu-loadingtraining-placement-strategy-instead-of-dataparallel)
10-
116
- [KT-LoRA Fine-Tuning Evaluation](#kt-lora-fine-tuning-evaluation)
127
- [Setup](#setup)
138
- [Results](#results)
14-
- [Stylized Dialogue (CatGirl tone)](#stylized-dialogue-catgirl-tone)
15-
- [Translational-Style benchmark (generative)](#translational-style-benchmark-generative)
16-
- [Medical Vertical Benchmark (AfriMed-SAQ/MCQ)](#medical-vertical-benchmark-afrimed-saqmcq)
17-
- [Limitations](#limitations)
18-
19-
- [Speed Tests](#speed-tests)
20-
- [End-to-End Performance](#end-to-end-performance)
21-
- [MoE Compute (DeepSeek-V3-671B)](#moe-compute-deepseek-v3-671b)
9+
- [Speed Tests](#speed-tests)
2210
- [Memory Footprint](#memory-footprint)
23-
2411
- [Conclusion](#conclusion)
2512

2613

@@ -36,7 +23,7 @@ This architecture bridges resource gaps, enabling **local fine-tuning of ultra-l
3623

3724
Architecturally, LLaMA-Factory orchestrates data/config/training, LoRA insertion, and inference; KTransformers is a pluggable, high-performance operator backend that takes over Attention and MoE under the same training code, enabling **GPU+CPU heterogeneity** to accelerate training and reduce GPU memory.
3825

39-
![image-20251011010558909](../assets/image-20251011010558909.png)
26+
![image-20251011010558909](../../assets/image-20251011010558909.png)
4027

4128
We evaluated LoRA fine-tuning with HuggingFace default, Unsloth, and KTransformers backends (same settings and data). **KTransformers** is currently the only solution feasible on **2–4×24GB 4090s** for **671B-scale MoE**, and also shows higher throughput and lower GPU memory for 14B MoEs.
4229

@@ -51,7 +38,7 @@ We evaluated LoRA fine-tuning with HuggingFace default, Unsloth, and KTransforme
5138

5239
From the table above, it can be seen that for the 14B model, the KTransformers backend achieves approximately 75% higher throughput than the default HuggingFace solution, while using only about one-fifth of the GPU memory. For the 671B model, both HuggingFace and Unsloth fail to run on a single 4090 GPU, whereas KTransformers is able to perform LoRA fine-tuning at 40 tokens/s, keeping the GPU memory usage within 70 GB.
5340

54-
![按照模型划分的对比图_02](../assets/image-compare_model.png)
41+
![按照模型划分的对比图_02](../../assets/image-compare_model.png)
5542

5643

5744

@@ -68,11 +55,11 @@ KTransformers provides operator injection (`BaseInjectedModule`), and PEFT provi
6855
- **Inheritance:** `KTransformersLinearLora` retains KT’s high-performance paths (`prefill_linear`/`generate_linear`) while accepting LoRA parameters (`lora_A/lora_B`).
6956
- **Replacement:** During preparation, we replace original `KTransformersLinear` layers (Q/K/V/O) with `KTransformersLinearLora`, preserving KT optimizations while enabling LoRA trainability.
7057

71-
![image-20251016182810716](../assets/image-20251016182810716.png)
58+
![image-20251016182810716](../../assets/image-20251016182810716.png)
7259

7360
After replacement, LoRA is inserted at Q/K/V/O linear transforms (left), and `KTransformersLinearLora` contains both KT fast paths and LoRA matrices (right).
7461

75-
![image-20251016182920722](../assets/image-20251016182920722.png)
62+
![image-20251016182920722](../../assets/image-20251016182920722.png)
7663

7764
### MoE (operator encapsulation + backward)
7865

@@ -83,13 +70,13 @@ Given large parameters and sparse compute, we encapsulate the expert computation
8370
- **Upstream (PyTorch graph):** we register a custom Autograd Function so the MoE layer appears as **a single node**. In the left figure (red box), only `KSFTExpertsCPU` is visible; on the right, the unencapsulated graph expands routing, dispatch, and FFN experts. Encapsulation makes the MoE layer behave like a standard `nn.Module` with gradients.
8471
- **Downstream (backend):** inside the Autograd Function, pybind11 calls C++ extensions for forward/backward. Multiple **pluggable backends** exist (AMX BF16/INT8; **llamafile**). The backend can be switched via YAML (e.g., `"backend": "AMXBF16"` vs. `"llamafile"`).
8572

86-
![image-20250801174623919](../assets/image-20250801174623919.png)
73+
![image-20250801174623919](../../assets/image-20250801174623919.png)
8774

8875
#### Backward (CPU)
8976

9077
MoE backward frequently needs the transposed weights $W^\top$. To avoid repeated runtime transposes, we **precompute/cache** $W^\top$ at load time (blue box). We also **cache necessary intermediate activations** (e.g., expert projections, red box) to reuse in backward and reduce recomputation. We provide backward implementations for **llamafile** and **AMX (INT8/BF16)**, with NUMA-aware optimizations.
9178

92-
<img src="../assets/image-20251016182942726.png" alt="image-20251016182942726" style="zoom:33%;" />
79+
<img src="../../assets/image-20251016182942726.png" alt="image-20251016182942726" style="zoom:33%;" />
9380

9481
### Multi-GPU Loading/Training: Placement strategy instead of DataParallel
9582

@@ -117,7 +104,7 @@ LLaMA-Factory orchestration, KTransformers backend, LoRA (rank=8, α=32, dropout
117104

118105
Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). The fine-tuned model consistently exhibits the target style (red boxes) versus neutral/rational base (blue). This shows **KT-LoRA injects style features** into the generation distribution with low GPU cost.
119106

120-
![image-20251016175848143](../assets/image-20251016175848143.png)
107+
![image-20251016175848143](../../assets/image-20251016175848143.png)
121108

122109
#### Translational-Style benchmark (generative)
123110

doc/en/SFT/KTransformers-Fine-Tuning_User-Guide.md

Lines changed: 6 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,13 @@
1-
- [KTransformers Fine-Tuning × LLaMA-Factory Integration – User Guide](#ktransformers-fine-tuning-x-llama-factory-integration-–-user-guide)
21
- [Introduction](#introduction)
3-
4-
- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
5-
- [Stylized Dialogue (CatGirl tone)](#stylized-dialogue-catgirl-tone)
6-
- [Benchmarks](#benchmarks)
7-
- [Translational-Style dataset](#translational-style-dataset)
8-
- [AfriMed-QA (short answer)](#afrimed-qa-short-answer)
9-
- [AfriMed-QA (multiple choice)](#afrimed-qa-multiple-choice)
10-
2+
- [Fine-Tuning Results (Examples)](#fine-tuning-results-examples)
113
- [Quick to Start](#quick-to-start)
124
- [Environment Setup](#environment-setup)
135
- [Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models](#core-feature-1-use-ktransformers-backend-to-fine-tune-ultra-large-moe-models)
146
- [Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)](#core-feature-2-chat-with-the-fine-tuned-model-base--lora-adapter)
157
- [Core Feature 3: Batch inference + metrics (base + LoRA adapter)](#core-feature-3-batch-inference--metrics-base--lora-adapter)
16-
178
- [KT Fine-Tuning Speed (User-Side View)](#kt-fine-tuning-speed-user-side-view)
189
- [End-to-End Performance](#end-to-end-performance)
1910
- [GPU/CPU Memory Footprint](#gpucpu-memory-footprint)
20-
2111
- [Conclusion](#conclusion)
2212

2313

@@ -33,7 +23,7 @@ Our goal is to give resource-constrained researchers a **local path to explore f
3323

3424
As shown below, LLaMA-Factory is the unified orchestration/configuration layer for the whole fine-tuning workflow—handling data, training scheduling, LoRA injection, and inference interfaces. **KTransformers** acts as a pluggable high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**.
3525

36-
![image-20251011010558909](../assets/image-20251011010558909.png)
26+
![image-20251011010558909](../../assets/image-20251011010558909.png)
3727

3828
Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unsloth**, and **KTransformers** backends. KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher throughput and lower GPU memory on smaller MoE models (e.g., DeepSeek-14B).
3929

@@ -46,7 +36,7 @@ Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unslo
4636

4737
**1400 GB** is a **theoretical** FP16 full-parameter resident footprint (not runnable). **70 GB** is the **measured peak** with KT strategy (Attention on GPU + layered MoE offload).
4838

49-
![按照模型划分的对比图_02](../assets/image-compare_model.png)
39+
![按照模型划分的对比图_02](../../assets/image-compare_model.png)
5040

5141
### Fine-Tuning Results (Examples)
5242

@@ -56,7 +46,7 @@ Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). Goal: i
5646

5747
The figure compares responses from the base vs. fine-tuned models. The fine-tuned model maintains the target tone and address terms more consistently (red boxes), validating the effectiveness of **style-transfer fine-tuning**.
5848

59-
![image-20251016175046882](../assets/image-20251016175046882.png)
49+
![image-20251016175046882](../../assets/image-20251016175046882.png)
6050

6151
#### Benchmarks
6252

@@ -219,7 +209,7 @@ We recommend **AMX acceleration** where available (`lscpu | grep amx`). AMX supp
219209

220210
Outputs go to `output_dir` in safetensors format plus adapter metadata for later loading.
221211

222-
![image-20251016171537997](../assets/image-20251016171537997.png)
212+
![image-20251016171537997](../../assets/image-20251016171537997.png)
223213

224214
### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)
225215

@@ -244,7 +234,7 @@ We also support **GGUF** adapters: for safetensors, set the **directory**; for G
244234

245235
During loading, LLaMA-Factory maps layer names to KT’s naming. You’ll see logs like `Loaded adapter weight: XXX -> XXX`:
246236

247-
![image-20251016171526210](../assets/image-20251016171526210.png)
237+
![image-20251016171526210](../../assets/image-20251016171526210.png)
248238

249239
### Core Feature 3: Batch inference + metrics (base + LoRA adapter)
250240

doc/en/SFT/injection_tutorial.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,16 @@
44
55
## TL;DR
66
This tutorial will guide you through the process of injecting custom operators into a model using the KTransformers framework. We will use the DeepSeekV2-Chat model as an example to demonstrate how to inject custom operators into the model step by step. The tutorial will cover the following topics:
7-
* [How to write injection rules](#how-to-write-injection-rules)
8-
* [Understanding the structure of the model](#understanding-model-structure)
9-
* [Multi-GPU](#muti-gpu)
10-
* [How to write a new operator and inject it into the model](#how-to-write-a-new-operator-and-inject-into-the-model)
7+
- [TL;DR](#tldr)
8+
- [How to Write Injection Rules](#how-to-write-injection-rules)
9+
- [Understanding Model Structure](#understanding-model-structure)
10+
- [Matrix Absorption-based MLA Injection](#matrix-absorption-based-mla-injection)
11+
- [Injection of Routed Experts](#injection-of-routed-experts)
12+
- [Injection of Linear Layers](#injection-of-linear-layers)
13+
- [Injection of Modules with Pre-calculated Buffers](#injection-of-modules-with-pre-calculated-buffers)
14+
- [Specifying Running Devices for Modules](#specifying-running-devices-for-modules)
15+
- [Muti-GPU](#muti-gpu)
16+
- [How to Write a New Operator and Inject into the Model](#how-to-write-a-new-operator-and-inject-into-the-model)
1117

1218
## How to Write Injection Rules
1319
The basic form of the injection rules for the Inject framework is as follows:
@@ -38,7 +44,7 @@ Using [deepseek-ai/DeepSeek-V2-Lite-Chat](https://huggingface.co/deepseek-ai/Dee
3844
Fortunately, knowing the structure of a model is very simple. Open the file list on the [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat/tree/main) homepage, and you can see the following files:
3945
<p align="center">
4046
<picture>
41-
<img alt="Inject-Struction" src="../assets/model_structure_guild.png" width=60%>
47+
<img alt="Inject-Struction" src="../../assets/model_structure_guild.png" width=60%>
4248
</picture>
4349
</p>
4450
@@ -48,7 +54,7 @@ From the `modeling_deepseek.py` file, we can see the specific implementation of
4854
The structure of the DeepSeekV2 model from the `.saftensors` and `modeling_deepseek.py` files is as follows:
4955
<p align="center">
5056
<picture>
51-
<img alt="Inject-Struction" src="../assets/deepseekv2_structure.png" width=60%>
57+
<img alt="Inject-Struction" src="../../assets/deepseekv2_structure.png" width=60%>
5258
</picture>
5359
</p>
5460

@@ -171,7 +177,7 @@ DeepseekV2-Chat got 60 layers, if we got 2 GPUs, we can allocate 30 layers to ea
171177

172178
<p align="center">
173179
<picture>
174-
<img alt="Inject-Struction" src="../assets/multi_gpu.png" width=60%>
180+
<img alt="Inject-Struction" src="../../assets/multi_gpu.png" width=60%>
175181
</picture>
176182
</p>
177183

0 commit comments

Comments
 (0)