You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -36,7 +23,7 @@ This architecture bridges resource gaps, enabling **local fine-tuning of ultra-l
36
23
37
24
Architecturally, LLaMA-Factory orchestrates data/config/training, LoRA insertion, and inference; KTransformers is a pluggable, high-performance operator backend that takes over Attention and MoE under the same training code, enabling **GPU+CPU heterogeneity** to accelerate training and reduce GPU memory.
We evaluated LoRA fine-tuning with HuggingFace default, Unsloth, and KTransformers backends (same settings and data). **KTransformers** is currently the only solution feasible on **2–4×24GB 4090s** for **671B-scale MoE**, and also shows higher throughput and lower GPU memory for 14B MoEs.
42
29
@@ -51,7 +38,7 @@ We evaluated LoRA fine-tuning with HuggingFace default, Unsloth, and KTransforme
51
38
52
39
From the table above, it can be seen that for the 14B model, the KTransformers backend achieves approximately 75% higher throughput than the default HuggingFace solution, while using only about one-fifth of the GPU memory. For the 671B model, both HuggingFace and Unsloth fail to run on a single 4090 GPU, whereas KTransformers is able to perform LoRA fine-tuning at 40 tokens/s, keeping the GPU memory usage within 70 GB.
-**Replacement:** During preparation, we replace original `KTransformersLinear` layers (Q/K/V/O) with `KTransformersLinearLora`, preserving KT optimizations while enabling LoRA trainability.
After replacement, LoRA is inserted at Q/K/V/O linear transforms (left), and `KTransformersLinearLora` contains both KT fast paths and LoRA matrices (right).
@@ -83,13 +70,13 @@ Given large parameters and sparse compute, we encapsulate the expert computation
83
70
-**Upstream (PyTorch graph):** we register a custom Autograd Function so the MoE layer appears as **a single node**. In the left figure (red box), only `KSFTExpertsCPU` is visible; on the right, the unencapsulated graph expands routing, dispatch, and FFN experts. Encapsulation makes the MoE layer behave like a standard `nn.Module` with gradients.
84
71
-**Downstream (backend):** inside the Autograd Function, pybind11 calls C++ extensions for forward/backward. Multiple **pluggable backends** exist (AMX BF16/INT8; **llamafile**). The backend can be switched via YAML (e.g., `"backend": "AMXBF16"` vs. `"llamafile"`).
MoE backward frequently needs the transposed weights $W^\top$. To avoid repeated runtime transposes, we **precompute/cache** $W^\top$ at load time (blue box). We also **cache necessary intermediate activations** (e.g., expert projections, red box) to reuse in backward and reduce recomputation. We provide backward implementations for **llamafile** and **AMX (INT8/BF16)**, with NUMA-aware optimizations.
Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). The fine-tuned model consistently exhibits the target style (red boxes) versus neutral/rational base (blue). This shows **KT-LoRA injects style features** into the generation distribution with low GPU cost.
@@ -33,7 +23,7 @@ Our goal is to give resource-constrained researchers a **local path to explore f
33
23
34
24
As shown below, LLaMA-Factory is the unified orchestration/configuration layer for the whole fine-tuning workflow—handling data, training scheduling, LoRA injection, and inference interfaces. **KTransformers** acts as a pluggable high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**.
Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unsloth**, and **KTransformers** backends. KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher throughput and lower GPU memory on smaller MoE models (e.g., DeepSeek-14B).
39
29
@@ -46,7 +36,7 @@ Within LLaMA-Factory, we compared LoRA fine-tuning with **HuggingFace**, **Unslo
46
36
47
37
† **1400 GB** is a **theoretical** FP16 full-parameter resident footprint (not runnable). **70 GB** is the **measured peak** with KT strategy (Attention on GPU + layered MoE offload).
@@ -56,7 +46,7 @@ Dataset: [NekoQA-10K](https://zhuanlan.zhihu.com/p/1934983798233231689). Goal: i
56
46
57
47
The figure compares responses from the base vs. fine-tuned models. The fine-tuned model maintains the target tone and address terms more consistently (red boxes), validating the effectiveness of **style-transfer fine-tuning**.
Copy file name to clipboardExpand all lines: doc/en/SFT/injection_tutorial.md
+13-7Lines changed: 13 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,16 @@
4
4
5
5
## TL;DR
6
6
This tutorial will guide you through the process of injecting custom operators into a model using the KTransformers framework. We will use the DeepSeekV2-Chat model as an example to demonstrate how to inject custom operators into the model step by step. The tutorial will cover the following topics:
7
-
*[How to write injection rules](#how-to-write-injection-rules)
8
-
*[Understanding the structure of the model](#understanding-model-structure)
9
-
*[Multi-GPU](#muti-gpu)
10
-
*[How to write a new operator and inject it into the model](#how-to-write-a-new-operator-and-inject-into-the-model)
7
+
-[TL;DR](#tldr)
8
+
-[How to Write Injection Rules](#how-to-write-injection-rules)
9
+
-[Understanding Model Structure](#understanding-model-structure)
-[Injection of Routed Experts](#injection-of-routed-experts)
12
+
-[Injection of Linear Layers](#injection-of-linear-layers)
13
+
-[Injection of Modules with Pre-calculated Buffers](#injection-of-modules-with-pre-calculated-buffers)
14
+
-[Specifying Running Devices for Modules](#specifying-running-devices-for-modules)
15
+
-[Muti-GPU](#muti-gpu)
16
+
-[How to Write a New Operator and Inject into the Model](#how-to-write-a-new-operator-and-inject-into-the-model)
11
17
12
18
## How to Write Injection Rules
13
19
The basic form of the injection rules for the Inject framework is as follows:
@@ -38,7 +44,7 @@ Using [deepseek-ai/DeepSeek-V2-Lite-Chat](https://huggingface.co/deepseek-ai/Dee
38
44
Fortunately, knowing the structure of a model is very simple. Open the file list on the [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat/tree/main) homepage, and you can see the following files:
0 commit comments