|
3 | 3 | - [SUMMARY](#summary) |
4 | 4 | - [Show Case Environment](#show-case-environment) |
5 | 5 | - [Bench Result](#bench-result) |
6 | | - - [V0.2](#v02) |
7 | | - - [Settings](#settings) |
| 6 | + - [V0.2.1](#v021) |
8 | 7 | - [Memory consumption:](#memory-consumption) |
| 8 | + - [Change Log](#change-log) |
9 | 9 | - [Benchmark Results](#benchmark-results) |
| 10 | + - [V0.2](#v02) |
| 11 | + - [Settings](#settings) |
| 12 | + - [Memory consumption:](#memory-consumption-1) |
| 13 | + - [Benchmark Results](#benchmark-results-1) |
10 | 14 | - [V0.3-Preview](#v03-preview) |
11 | 15 | - [Settings](#settings-1) |
12 | 16 | - [Memory consumptions:](#memory-consumptions) |
13 | | - - [Benchmark results](#benchmark-results-1) |
| 17 | + - [Benchmark results](#benchmark-results-2) |
14 | 18 | - [How to Run](#how-to-run) |
15 | | - - [V0.2 Showcase](#v02-showcase) |
| 19 | + - [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase) |
16 | 20 | - [Single socket version (32 cores)](#single-socket-version-32-cores) |
17 | 21 | - [Dual socket version (64 cores)](#dual-socket-version-64-cores) |
18 | 22 | - [V0.3 Showcase](#v03-showcase) |
19 | 23 | - [Dual socket version (64 cores)](#dual-socket-version-64-cores-1) |
20 | 24 | - [Some Explanations](#some-explanations) |
| 25 | + - [Next](#next) |
| 26 | + - [Faster](#faster) |
| 27 | + - [Easier](#easier) |
21 | 28 | - [FAQ](#faq) |
22 | 29 | - [R1 No Thinking](#r1-no-thinking) |
23 | 30 | - [More FAQ](#more-faq) |
@@ -49,13 +56,54 @@ https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285 |
49 | 56 | We also give our upcoming optimizations previews, including an Intel AMX-accelerated kernel and a selective expert activation method, which will significantly enhance performance. With V0.3-preview, we achieve up to 286 tokens/s for prefill, making it up to **28× faster than llama.cpp** for local inference. |
50 | 57 | The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl) |
51 | 58 |
|
| 59 | +> **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/). |
| 60 | +
|
| 61 | +We speed up the decode and prefill speed a littlt bit. The reason for the limited performance improvement mainly lies in the fact that the inference process is still constrained by the CPU's computational speed and memory bandwidth. The MLA part handled by the GPU accounts for a relatively small proportion. |
| 62 | + |
| 63 | +Besides the improvements in speed, we've also significantly updated the documentation to enhance usability, including:<br> |
| 64 | +- Added Multi-GPU configuration tutorial. |
| 65 | +- Consolidated installation guide. |
| 66 | +- Add a detailed tutorial on registering extra GPU memory with ExpertMarlin; |
| 67 | + |
52 | 68 |
|
53 | 69 | ## Show Case Environment |
54 | 70 | We run our best performance tests (V0.2) on <br> |
55 | 71 | CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br> |
56 | 72 | GPU: 4090D 24G VRAM <br> |
57 | | -Memory: standard DDR5-4800 server DRAM (1 TB) |
| 73 | +Memory: standard DDR5-4800 server DRAM (1 TB), each socket with 8×DDR5-4800 |
58 | 74 | ## Bench Result |
| 75 | +### V0.2.1 |
| 76 | +- Model: DeepseekV3-q4km (int4)<br> |
| 77 | +- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes |
| 78 | +- GPU: 4090 24G VRAM |
| 79 | +- We test after enough warm up |
| 80 | +#### Memory consumption: |
| 81 | + - Single socket: 382G DRAM, at least 14GB VRAM |
| 82 | + - Dual socket: 1T DRAM, at least 14GB VRAM |
| 83 | +#### Change Log |
| 84 | +- Longer Context (from 4K to 8K for 24GB VRAM) and Slightly Faster Speed (+15%):<br> |
| 85 | +Integrated the highly efficient Triton MLA Kernel from the fantastic sglang project, enable much longer context length and slightly faster prefill/decode speed |
| 86 | +- We suspect the impressive improvement comes from the change of hardwre platform (4090D->4090) |
| 87 | +#### Benchmark Results |
| 88 | + |
| 89 | + |
| 90 | +"6 experts" case is part of V0.3's preview |
| 91 | + |
| 92 | + |
| 93 | +| Prompt | hi (2) | 1K (969) | 2K (1930) | 4K (3846) | llama.cpp (8 experts) | |
| 94 | +| --- | --- | --- | --- | --- | --- | |
| 95 | +| Output length | 10tokens | 300tokens | 300tokens | 300tokens | 300tokens | |
| 96 | +| **6 experts V0.2.0** | | | | | | |
| 97 | +| Prefill token/s | 13 | 105 | 102 | 88 | CUDA OOM | |
| 98 | +| decode token/s | 16.8 | 15.4 | 14.2 | 13.0 | CUDA OOM | |
| 99 | +| **6 experts V0.2.1** | | | | | | |
| 100 | +| Prefill token/s | 13 | 111 | 112.5 | 102 **(1.16x speedup)** | 101 | |
| 101 | +| decode token/s | 16.8 | 15.9 | 15.4 | 14.9 **(1.15x speedup)** | 13.9 | |
| 102 | +| **8 experts V0.2.1** | | | | | | |
| 103 | +| Prefill token/s | 12.2 | 88.2 | 88.5 | 81.9 | 80 | |
| 104 | +| Decode token/s | 13.4 | 13.5 | 13.4 | 13.2 | 12.4 | |
| 105 | + |
| 106 | + |
59 | 107 | ### V0.2 |
60 | 108 | #### Settings |
61 | 109 | - Model: DeepseekV3-q4km (int4)<br> |
@@ -106,7 +154,7 @@ the output quality doesn't change. But the speed of decoding and prefill |
106 | 154 | is speed up which is inspiring. So our showcase makes use of this finding* |
107 | 155 |
|
108 | 156 | ## How to Run |
109 | | -### V0.2 Showcase |
| 157 | +### V0.2 & V0.2.1 Showcase |
110 | 158 | #### Single socket version (32 cores) |
111 | 159 | Our local_chat test command is: |
112 | 160 | ``` shell |
@@ -170,6 +218,17 @@ DeepSeek's MLA operators are highly computationally intensive. While running eve |
170 | 218 |
|
171 | 219 | 5. Why Intel CPUs? |
172 | 220 | Intel is currently the only CPU vendor that supports AMX-like instructions, which delivers significantly better performance compared to AVX-only alternatives. |
| 221 | +## Next |
| 222 | +### Faster |
| 223 | +* The FlashInfer (https://github.com/flashinfer-ai/flashinfer) project is releasing an even more efficient fused MLA operator, promising further speedups |
| 224 | +* vLLM has explored multi-token prediction in DeepSeek-V3, and support is on our roadmap for even better performance |
| 225 | +* We are collaborating with Intel to enhance the AMX kernel (v0.3) and optimize for Xeon6/MRDIMM |
| 226 | +### Easier |
| 227 | +* Official Docker images to simplify installation |
| 228 | +* Fix the server integration for web API access |
| 229 | +* Support for more quantization types, including the highly requested dynamic quantization from unsloth |
| 230 | + |
| 231 | +Stay tuned for more updates! |
173 | 232 | ## FAQ |
174 | 233 | ### R1 No Thinking |
175 | 234 | Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. The detail is in [FAQ](./FAQ.md) part <br> |
|
0 commit comments