Skip to content

Commit ef89b15

Browse files
committed
* Reorganize documentation/README
* Consolidate the installation section, as it's currently too cluttered * Move the Multi-GPU section to the top-level structure * Add a **detailed** tutorial on registering extra GPU memory with Marlin
1 parent b0b9027 commit ef89b15

File tree

7 files changed

+420
-241
lines changed

7 files changed

+420
-241
lines changed

README.md

Lines changed: 8 additions & 217 deletions
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,13 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
2323

2424
<h2 id="Updates">🔥 Updates</h2>
2525

26-
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md).
27-
* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
26+
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
2827
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
29-
* **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
28+
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
3029
* **Aug 14, 2024**: Support llamfile as linear backend.
3130
* **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
3231
* **Aug 9, 2024**: Support windows native.
33-
32+
<!-- * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md). -->
3433
<h2 id="show-cases">🌟 Show Cases</h2>
3534

3635
<div>
@@ -69,7 +68,7 @@ https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
6968

7069
</p>
7170

72-
<h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
71+
<!-- <h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
7372
<p align="center">
7473
7574
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
@@ -91,228 +90,20 @@ https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
9190
* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
9291
9392
* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
94-
93+
-->
9594

9695

9796
<strong>More advanced features will coming soon, so stay tuned!</strong>
9897

9998
<h2 id="quick-start">🚀 Quick Start</h2>
10099

101-
<h3>Preparation</h3>
102-
Some preparation:
103-
104-
- CUDA 12.1 and above, if you didn't have it yet, you may install from [here](https://developer.nvidia.com/cuda-downloads).
105-
106-
```sh
107-
# Adding CUDA to PATH
108-
export PATH=/usr/local/cuda/bin:$PATH
109-
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
110-
export CUDA_PATH=/usr/local/cuda
111-
```
112-
113-
- Linux-x86_64 with gcc, g++ and cmake
114-
115-
```sh
116-
sudo apt-get update
117-
sudo apt-get install gcc g++ cmake ninja-build
118-
```
119-
120-
- We recommend using [Conda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh) to create a virtual environment with Python=3.11 to run our program.
121-
122-
```sh
123-
conda create --name ktransformers python=3.11
124-
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
125-
```
126-
127-
- Make sure that PyTorch, packaging, ninja is installed
128-
129-
```
130-
pip install torch packaging ninja cpufeature numpy
131-
```
132-
133-
<h3>Installation</h3>
134-
135-
1. Use a Docker image, see [documentation for Docker](./doc/en/Docker.md)
136-
137-
2. You can install using Pypi (for linux):
138-
139-
```
140-
pip install ktransformers --no-build-isolation
141-
```
142-
143-
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced.
144-
145-
3. Or you can download source code and compile:
146-
147-
- init source code
148-
149-
```sh
150-
git clone https://github.com/kvcache-ai/ktransformers.git
151-
cd ktransformers
152-
git submodule init
153-
git submodule update
154-
```
155-
156-
- [Optional] If you want to run with website, please [compile the website](./doc/en/api/server/website.md) before execute ```bash install.sh```
157-
158-
- Compile and install (for Linux)
159-
160-
```
161-
bash install.sh
162-
```
163-
164-
- Compile and install(for Windows)
165-
166-
```
167-
install.bat
168-
```
169-
4. If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./doc/en/makefile_usage.md)
170-
<h3>Local Chat</h3>
171-
We provide a simple command-line local chat Python script that you can run for testing.
172-
173-
> Note that this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to [RESTful API and Web UI](#id_666). We use the DeepSeek-V2-Lite-Chat-GGUF model as an example here. But we also support other models, you can replace it with any other model that you want to test.
174-
175-
<h4>Run Example</h4>
176-
177-
```shell
178-
# Begin from root of your cloned repo!
179-
# Begin from root of your cloned repo!!
180-
# Begin from root of your cloned repo!!!
181-
182-
# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
183-
mkdir DeepSeek-V2-Lite-Chat-GGUF
184-
cd DeepSeek-V2-Lite-Chat-GGUF
185-
186-
wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
187-
188-
cd .. # Move to repo's root dir
189-
190-
# Start local chat
191-
python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
192-
193-
# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
194-
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
195-
# python ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
196-
```
197-
198-
It features the following arguments:
199-
200-
- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
201-
202-
> Note: <strong>.safetensors</strong> files are not required in the directory. We only need config files to build model and tokenizer.
203-
204-
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
205-
206-
- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
207-
208-
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
209-
210-
- `--cpu_infer`: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
211-
212-
<h3 id="suggested-model"> Suggested Model</h3>
213-
214-
| Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
215-
| ------------------------------ | ---------- | ----- | --------------- | ----------------- |
216-
| DeepSeek-R1-q4_k_m | 377G | 14G | 382G | 512G |
217-
| DeepSeek-V3-q4_k_m | 377G | 14G | 382G | 512G |
218-
| DeepSeek-V2-q4_k_m | 133G | 11G | 136G | 192G |
219-
| DeepSeek-V2.5-q4_k_m | 133G | 11G | 136G | 192G |
220-
| DeepSeek-V2.5-IQ4_XS | 117G | 10G | 107G | 128G |
221-
| Qwen2-57B-A14B-Instruct-q4_k_m | 33G | 8G | 34G | 64G |
222-
| DeepSeek-V2-Lite-q4_k_m | 9.7G | 3G | 13G | 16G |
223-
| Mixtral-8x7B-q4_k_m | 25G | 1.6G | 51G | 64G |
224-
| Mixtral-8x22B-q4_k_m | 80G | 4G | 86.1G | 96G |
225-
| InternLM2.5-7B-Chat-1M | 15.5G | 15.5G | 8G(32K context) | 150G (1M context) |
226-
227-
228-
More will come soon. Please let us know which models you are most interested in.
229-
230-
Be aware that you need to be subject to their corresponding model licenses when using [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/LICENSE) and [QWen](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE).
231-
232-
<details>
233-
<summary>Click To Show how to run other examples</summary>
234-
235-
* Qwen2-57B
236-
237-
```sh
238-
pip install flash_attn # For Qwen2
239-
240-
mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF
241-
242-
wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf
243-
244-
cd ..
245-
246-
python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF
247-
248-
# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
249-
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct
250-
# python ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
251-
```
252-
253-
* DeepseekV2
254-
255-
```sh
256-
mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF
257-
# Download weights
258-
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf
259-
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf
260-
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf
261-
wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf
262100

263-
cd ..
101+
Getting started with KTransformers is simple! Follow the steps below to set up and start using it.
264102

265-
python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
103+
### 📥 Installation
266104

267-
# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
268-
269-
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628
270-
271-
# python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
272-
```
273-
274-
| model name | weights download link |
275-
|----------|----------|
276-
| Qwen2-57B | [Qwen2-57B-A14B-gguf-Q4K-M](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/tree/main) |
277-
| DeepseekV2-coder |[DeepSeek-Coder-V2-Instruct-gguf-Q4K-M](https://huggingface.co/LoneStriker/DeepSeek-Coder-V2-Instruct-GGUF/tree/main) |
278-
| DeepseekV2-chat |[DeepSeek-V2-Chat-gguf-Q4K-M](https://huggingface.co/bullerwins/DeepSeek-V2-Chat-0628-GGUF/tree/main) |
279-
| DeepseekV2-lite | [DeepSeek-V2-Lite-Chat-GGUF-Q4K-M](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main) |
280-
281-
</details>
282-
283-
<!-- pin block for jump -->
284-
<span id='id_666'>
285-
286-
<h3>RESTful API and Web UI</h3>
287-
288-
289-
Start without website:
290-
291-
```sh
292-
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
293-
```
294-
295-
Start with website:
296-
297-
```sh
298-
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002 --web True
299-
```
300-
301-
Or you want to start server with transformers, the model_path should include safetensors
302-
303-
```bash
304-
ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
305-
```
306-
307-
Access website with url [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) :
308-
309-
<p align="center">
310-
<picture>
311-
<img alt="Web UI" src="https://github.com/user-attachments/assets/615dca9b-a08c-4183-bbd3-ad1362680faf" width=90%>
312-
</picture>
313-
</p>
105+
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/).
314106

315-
More information about the RESTful API server can be found [here](doc/en/api/server/server.md). You can also find an example of integrating with Tabby [here](doc/en/api/server/tabby.md).
316107

317108
<h2 id="tutorial">📃 Brief Injection Tutorial</h2>
318109
At the heart of KTransformers is a user-friendly, template-based injection framework.

doc/SUMMARY.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
# Ktransformer
22

33
[Introduction](./README.md)
4-
# DeepSeek
5-
- [Deepseek-R1/V3 Tutorial](en/DeepseekR1_V3_tutorial.md)
6-
- [Deepseek-V2 Injection](en/deepseek-v2-injection.md)
7-
- [Injection Tutorial](en/injection_tutorial.md)
4+
# Install
5+
- [Installation Guide](en/install.md)
86

9-
# Server
7+
# Tutorial
8+
- [Deepseek-R1/V3 Show Case](en/DeepseekR1_V3_tutorial.md)
9+
- [Why KTransformers So Fast](en/deepseek-v2-injection.md)
10+
- [Injection Tutorial](en/injection_tutorial.md)
11+
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
12+
# Server(Temperary Deprected)
1013
- [Server](en/api/server/server.md)
1114
- [Website](en/api/server/website.md)
1215
- [Tabby](en/api/server/tabby.md)
-1.23 MB
Binary file not shown.

doc/en/DeepseekR1_V3_tutorial.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<!-- omit in toc -->
22
# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
33
- [SUMMARY](#summary)
4-
- [Prerequisites](#prerequisites)
4+
- [Show Case Environment](#show-case-environment)
55
- [Bench Result](#bench-result)
66
- [V0.2](#v02)
77
- [Settings](#settings)
@@ -50,7 +50,7 @@ We also give our upcoming optimizations previews, including an Intel AMX-acceler
5050
The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)
5151

5252

53-
## Prerequisites
53+
## Show Case Environment
5454
We run our best performance tests (V0.2) on <br>
5555
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
5656
GPU: 4090D 24G VRAM <br>
@@ -110,35 +110,35 @@ is speed up which is inspiring. So our showcase makes use of this finding*
110110
#### Single socket version (32 cores)
111111
Our local_chat test command is:
112112
``` shell
113-
git clone https://github.com/kvcache-ai/ktransformers.git
114-
cd ktransformers
115-
git submodule init
116-
git submodule update
117113
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your prompt txt file> --cpu_infer 33 --max_new_tokens 1000
118114
<when you see chat, then press enter to load the text prompt_file>
119115
```
120116
`<your model path>` can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
121117
`<your gguf path>` can also be online, but as its large we recommend you download it and quantize the model to what you want (notice it's the dir path) <br>
122118
`--max_new_tokens 1000` is the max output token length. If you find the answer is truncated, you
123119
can increase the number for longer answer (But be aware of OOM, and increase it will slow down the generation rate.).
124-
<br>
125-
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes<br>
120+
121+
The command `numactl -N 1 -m 1` aims to advoid data transfer between numa nodes<br>
126122
Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. This is explained in [FAQ](#faq) part
127123

128124
#### Dual socket version (64 cores)
129-
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
130-
Our local_chat test command is:
125+
126+
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set). You may check the doc [here](./install.md) for install details. <br>
127+
128+
Test Command:
131129
``` shell
132-
git clone https://github.com/kvcache-ai/ktransformers.git
133-
cd ktransformers
134-
git submodule init
135-
git submodule update
136-
export USE_NUMA=1
137-
make dev_install # or sh ./install.sh
130+
# ---For those who have not installed ktransformers---
131+
# git clone https://github.com/kvcache-ai/ktransformers.git
132+
# cd ktransformers
133+
# git submodule init
134+
# git submodule update
135+
# export USE_NUMA=1
136+
# make dev_install # or sh ./install.sh
137+
# ----------------------------------------------------
138138
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your prompt txt file> --cpu_infer 65 --max_new_tokens 1000
139139
<when you see chat, then press enter to load the text prompt_file>
140140
```
141-
The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65
141+
The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65
142142

143143
### V0.3 Showcase
144144
#### Dual socket version (64 cores)

doc/en/deepseek-v2-injection.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Tutorial: Heterogeneous and Local DeepSeek-V2 Inference
1+
# Tutorial: Heterogeneous and Local MoE Inference
22

3-
DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4.
3+
DeepSeek-(Code)-V2 is a series of strong mixture-of-experts (MoE) models, featuring a total of 236 billion parameters, with 21 billion parameters activated per token. This model has demonstrated remarkable reasoning capabilities across various benchmarks, positioning it as one of the SOTA open models and nearly comparable in performance to GPT-4. DeepSeek-R1 uses a similar architecture to DeepSeek-V2, but with a bigger number of parameters.
44

55
<p align="center">
66
<picture>

0 commit comments

Comments
 (0)