You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+217-8Lines changed: 217 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,13 +23,14 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
23
23
24
24
<h2id="Updates">🔥 Updates</h2>
25
25
26
-
***Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
26
+
***Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md).
27
+
***Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
27
28
***Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
28
-
***Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
29
+
***Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
29
30
***Aug 14, 2024**: Support llamfile as linear backend.
30
31
***Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
31
32
***Aug 9, 2024**: Support windows native.
32
-
<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
***Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
91
92
92
93
***Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
93
-
-->
94
+
94
95
95
96
96
97
<strong>More advanced features will coming soon, so stay tuned!</strong>
97
98
98
99
<h2id="quick-start">🚀 Quick Start</h2>
99
100
101
+
<h3>Preparation</h3>
102
+
Some preparation:
103
+
104
+
- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
121
+
122
+
```sh
123
+
conda create --name ktransformers python=3.11
124
+
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
125
+
```
126
+
127
+
- Make sure that PyTorch, packaging, ninja is installed
1. Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)
136
+
137
+
2. You can install using Pypi (for linux):
138
+
139
+
```
140
+
pip install ktransformers --no-build-isolation
141
+
```
142
+
143
+
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.
- [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
157
+
158
+
- Compile and install (for Linux)
159
+
160
+
```
161
+
bash install.sh
162
+
```
163
+
164
+
- Compile and install(for Windows)
165
+
166
+
```
167
+
install.bat
168
+
```
169
+
4. If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
170
+
<h3>Local Chat</h3>
171
+
We provide a simple command-line local chat Python script that you can run for testing.
172
+
173
+
> Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666). We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test.
174
+
175
+
<h4>Run Example</h4>
176
+
177
+
```shell
178
+
# Begin from root of your cloned repo!
179
+
# Begin from root of your cloned repo!!
180
+
# Begin from root of your cloned repo!!!
181
+
182
+
# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
201
+
202
+
> Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.
203
+
204
+
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
205
+
206
+
- `--optimize_rule_path` (required except forQwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-writtenin the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
207
+
208
+
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
209
+
210
+
- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
211
+
212
+
<h3 id="suggested-model"> Suggested Model</h3>
213
+
214
+
| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
More will come soon. Please let us know which models you are most interested in.
229
+
230
+
Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
231
+
232
+
<details>
233
+
<summary>Click To Show how to run other examples</summary>
More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
107
316
108
317
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
109
318
At the heart of KTransformers is a user-friendly, template-based injection framework.
Copy file name to clipboardExpand all lines: doc/en/DeepseekR1_V3_tutorial.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
<!-- omit in toc -->
2
2
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
3
3
-[SUMMARY](#summary)
4
-
- [Show Case Environment](#show-case-environment)
4
+
- [Prerequisites](#prerequisites)
5
5
- [Bench Result](#bench-result)
6
6
- [V0.2](#v02)
7
7
- [Settings](#settings)
@@ -50,7 +50,7 @@ We also give our upcoming optimizations previews, including an Intel AMX-acceler
50
50
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)
51
51
52
52
53
-
## Show Case Environment
53
+
## Prerequisites
54
54
We run our best performance tests (V0.2) on <br>
55
55
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
56
56
GPU: 4090D 24G VRAM <br>
@@ -110,35 +110,35 @@ is speed up which is inspiring. So our showcase makes use of this finding*
<when you see chat, then press enter to load the text prompt_file>
115
119
```
116
120
`<your model path>` can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
117
121
`<your gguf path>` can also be online, but as its large we recommend you download it and quantize the model to what you want (notice it's the dir path) <br>
118
122
`--max_new_tokens 1000` is the max output token length. If you find the answer is truncated, you
119
123
can increase the number for longer answer (But be aware of OOM, and increase it will slow down the generation rate.).
120
-
121
-
The command `numactl -N 1 -m 1` aims to advoid data transfer between numa nodes<br>
124
+
<br>
125
+
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes<br>
122
126
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. This is explained in [FAQ](#faq) part
123
127
124
128
#### Dual socket version (64 cores)
125
-
126
-
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details. <br>
127
-
128
-
Test Command:
129
+
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
130
+
Our local_chat test command is:
129
131
```shell
130
-
# ---For those who have not installed ktransformers---
Copy file name to clipboardExpand all lines: doc/en/deepseek-v2-injection.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Tutorial: Heterogeneous and Local MoE Inference
1
+
# Tutorial: Heterogeneous and Local DeepSeek-V2 Inference
2
2
3
-
DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4. DeepSeek-R1 uses a similar architecture to DeepSeek-V2, but with a bigger number of parameters.
3
+
DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4.
0 commit comments