Skip to content

Commit 718a71b

Browse files
authored
Merge pull request #316 from KMSorSMS/main
📝 update V0.2.1 Doc
2 parents f9f9f74 + 13382f8 commit 718a71b

File tree

2 files changed

+67
-7
lines changed

2 files changed

+67
-7
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
2323

2424
<h2 id="Updates">🔥 Updates</h2>
2525

26+
* **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
2627
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
2728
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
2829
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
@@ -159,7 +160,7 @@ If you are interested in our design principles and the implementation of the inj
159160

160161
<h2 id="ack">Acknowledgment and Contributors</h2>
161162

162-
The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, and Marlin. We are planning to contribute back to the community by upstreaming our modifications.
163+
The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
163164

164165
KTransformer is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformer faster and easier to use.
165166

doc/en/DeepseekR1_V3_tutorial.md

Lines changed: 65 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,28 @@
33
- [SUMMARY](#summary)
44
- [Show Case Environment](#show-case-environment)
55
- [Bench Result](#bench-result)
6-
- [V0.2](#v02)
7-
- [Settings](#settings)
6+
- [V0.2.1](#v021)
87
- [Memory consumption:](#memory-consumption)
8+
- [Change Log](#change-log)
99
- [Benchmark Results](#benchmark-results)
10+
- [V0.2](#v02)
11+
- [Settings](#settings)
12+
- [Memory consumption:](#memory-consumption-1)
13+
- [Benchmark Results](#benchmark-results-1)
1014
- [V0.3-Preview](#v03-preview)
1115
- [Settings](#settings-1)
1216
- [Memory consumptions:](#memory-consumptions)
13-
- [Benchmark results](#benchmark-results-1)
17+
- [Benchmark results](#benchmark-results-2)
1418
- [How to Run](#how-to-run)
15-
- [V0.2 Showcase](#v02-showcase)
19+
- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
1620
- [Single socket version (32 cores)](#single-socket-version-32-cores)
1721
- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
1822
- [V0.3 Showcase](#v03-showcase)
1923
- [Dual socket version (64 cores)](#dual-socket-version-64-cores-1)
2024
- [Some Explanations](#some-explanations)
25+
- [Next](#next)
26+
- [Faster](#faster)
27+
- [Easier](#easier)
2128
- [FAQ](#faq)
2229
- [R1 No Thinking](#r1-no-thinking)
2330
- [More FAQ](#more-faq)
@@ -49,13 +56,54 @@ https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
4956
We also give our upcoming optimizations previews, including an Intel AMX-accelerated kernel and a selective expert activation method, which will significantly enhance performance. With V0.3-preview, we achieve up to 286 tokens/s for prefill, making it up to **28× faster than llama.cpp** for local inference.
5057
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)
5158

59+
> **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
60+
61+
We speed up the decode and prefill speed a littlt bit. The reason for the limited performance improvement mainly lies in the fact that the inference process is still constrained by the CPU's computational speed and memory bandwidth. The MLA part handled by the GPU accounts for a relatively small proportion.
62+
63+
Besides the improvements in speed, we've also significantly updated the documentation to enhance usability, including:<br>
64+
- Added Multi-GPU configuration tutorial.
65+
- Consolidated installation guide.
66+
- Add a detailed tutorial on registering extra GPU memory with ExpertMarlin;
67+
5268

5369
## Show Case Environment
5470
We run our best performance tests (V0.2) on <br>
5571
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
5672
GPU: 4090D 24G VRAM <br>
57-
Memory: standard DDR5-4800 server DRAM (1 TB)
73+
Memory: standard DDR5-4800 server DRAM (1 TB), each socket with 8×DDR5-4800
5874
## Bench Result
75+
### V0.2.1
76+
- Model: DeepseekV3-q4km (int4)<br>
77+
- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
78+
- GPU: 4090 24G VRAM
79+
- We test after enough warm up
80+
#### Memory consumption:
81+
- Single socket: 382G DRAM, at least 14GB VRAM
82+
- Dual socket: 1T DRAM, at least 14GB VRAM
83+
#### Change Log
84+
- Longer Context (from 4K to 8K for 24GB VRAM) and Slightly Faster Speed (+15%):<br>
85+
Integrated the highly efficient Triton MLA Kernel from the fantastic sglang project, enable much longer context length and slightly faster prefill/decode speed
86+
- We suspect the impressive improvement comes from the change of hardwre platform (4090D->4090)
87+
#### Benchmark Results
88+
89+
90+
"6 experts" case is part of V0.3's preview
91+
92+
93+
| Prompt | hi (2) | 1K (969) | 2K (1930) | 4K (3846) | llama.cpp (8 experts) |
94+
| --- | --- | --- | --- | --- | --- |
95+
| Output length | 10tokens | 300tokens | 300tokens | 300tokens | 300tokens |
96+
| **6 experts V0.2.0** | | | | | |
97+
| Prefill token/s | 13 | 105 | 102 | 88 | CUDA OOM |
98+
| decode token/s | 16.8 | 15.4 | 14.2 | 13.0 | CUDA OOM |
99+
| **6 experts V0.2.1** | | | | | |
100+
| Prefill token/s | 13 | 111 | 112.5 | 102 **(1.16x speedup)** | 101 |
101+
| decode token/s | 16.8 | 15.9 | 15.4 | 14.9 **(1.15x speedup)** | 13.9 |
102+
| **8 experts V0.2.1** | | | | | |
103+
| Prefill token/s | 12.2 | 88.2 | 88.5 | 81.9 | 80 |
104+
| Decode token/s | 13.4 | 13.5 | 13.4 | 13.2 | 12.4 |
105+
106+
59107
### V0.2
60108
#### Settings
61109
- Model: DeepseekV3-q4km (int4)<br>
@@ -106,7 +154,7 @@ the output quality doesn't change. But the speed of decoding and prefill
106154
is speed up which is inspiring. So our showcase makes use of this finding*
107155

108156
## How to Run
109-
### V0.2 Showcase
157+
### V0.2 & V0.2.1 Showcase
110158
#### Single socket version (32 cores)
111159
Our local_chat test command is:
112160
``` shell
@@ -170,6 +218,17 @@ DeepSeek's MLA operators are highly computationally intensive. While running eve
170218

171219
5. Why Intel CPUs?
172220
Intel is currently the only CPU vendor that supports AMX-like instructions, which delivers significantly better performance compared to AVX-only alternatives.
221+
## Next
222+
### Faster
223+
* The FlashInfer (https://github.com/flashinfer-ai/flashinfer) project is releasing an even more efficient fused MLA operator, promising further speedups
224+
* vLLM has explored multi-token prediction in DeepSeek-V3, and support is on our roadmap for even better performance
225+
* We are collaborating with Intel to enhance the AMX kernel (v0.3) and optimize for Xeon6/MRDIMM
226+
### Easier
227+
* Official Docker images to simplify installation
228+
* Fix the server integration for web API access
229+
* Support for more quantization types, including the highly requested dynamic quantization from unsloth
230+
231+
Stay tuned for more updates!
173232
## FAQ
174233
### R1 No Thinking
175234
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. The detail is in [FAQ](./FAQ.md) part <br>

0 commit comments

Comments
 (0)