You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-217Lines changed: 8 additions & 217 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,14 +23,13 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
23
23
24
24
<h2id="Updates">🔥 Updates</h2>
25
25
26
-
***Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md).
27
-
***Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
26
+
***Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
28
27
***Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
29
-
***Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
28
+
***Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
30
29
***Aug 14, 2024**: Support llamfile as linear backend.
31
30
***Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
32
31
***Aug 9, 2024**: Support windows native.
33
-
32
+
<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
92
91
93
92
* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
94
-
93
+
-->
95
94
96
95
97
96
<strong>More advanced features will coming soon, so stay tuned!</strong>
98
97
99
98
<h2id="quick-start">🚀 Quick Start</h2>
100
99
101
-
<h3>Preparation</h3>
102
-
Some preparation:
103
-
104
-
- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
121
-
122
-
```sh
123
-
conda create --name ktransformers python=3.11
124
-
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
125
-
```
126
-
127
-
- Make sure that PyTorch, packaging, ninja is installed
1. Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)
136
-
137
-
2. You can install using Pypi (for linux):
138
-
139
-
```
140
-
pip install ktransformers --no-build-isolation
141
-
```
142
-
143
-
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.
- [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
157
-
158
-
- Compile and install (for Linux)
159
-
160
-
```
161
-
bash install.sh
162
-
```
163
-
164
-
- Compile and install(for Windows)
165
-
166
-
```
167
-
install.bat
168
-
```
169
-
4. If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
170
-
<h3>Local Chat</h3>
171
-
We provide a simple command-line local chat Python script that you can run for testing.
172
-
173
-
> Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666). We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test.
174
-
175
-
<h4>Run Example</h4>
176
-
177
-
```shell
178
-
# Begin from root of your cloned repo!
179
-
# Begin from root of your cloned repo!!
180
-
# Begin from root of your cloned repo!!!
181
-
182
-
# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
201
-
202
-
> Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.
203
-
204
-
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
205
-
206
-
- `--optimize_rule_path` (required except forQwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-writtenin the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
207
-
208
-
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
209
-
210
-
- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
211
-
212
-
<h3 id="suggested-model"> Suggested Model</h3>
213
-
214
-
| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
More will come soon. Please let us know which models you are most interested in.
229
-
230
-
Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
231
-
232
-
<details>
233
-
<summary>Click To Show how to run other examples</summary>
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/).
314
106
315
-
More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
316
107
317
108
<h2id="tutorial">📃 Brief Injection Tutorial</h2>
318
109
At the heart of KTransformers is a user-friendly, template-based injection framework.
0 commit comments