Skip to content

Commit 99f6e42

Browse files
authored
Merge pull request #668 from KMSorSMS/main
📝 update benchmark.md
2 parents 31bc990 + 3ad1275 commit 99f6e42

File tree

2 files changed

+30
-12
lines changed

2 files changed

+30
-12
lines changed

doc/SUMMARY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,5 @@
2121
- [FAQ](en/FAQ.md)
2222
# V3 Reproduction
2323
- [Success List](en/V3-success.md)
24+
# Benchmark
25+
- [Benchmark](en/benchmark.md)

doc/en/benchmark.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ We set the argument `temperature=0.6`, and to simplify the test process, we skip
1212

1313
Given that we have only tested 1,000 cases, which provides only a preliminary judgment, some fluctuations in the results are reasonable. We selected all datasets and shuffled them with a fixed random seed to ensure consistency.
1414

15-
## Some Detail
15+
## Some Details
1616

1717
- The bf16 model of DeepSeek-V3 is available [here](https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main) (you may convert it to gguf by llama.cpp). The q4km model can be found [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M).
1818

19-
- The optimization YAML file is located [here](https://github.com/kvcache-ai/ktransformers/tree/main/ktransformers/optimize/optimize_rules). For the Matrix MUL Kernel, you can change `KLinearMarlin` to `KLinearTorch`.
19+
- The optimization YAML file is located [here](https://github.com/kvcache-ai/ktransformers/tree/main/ktransformers/optimize/optimize_rules). For the GEMM Kernel, you can change `KLinearMarlin` to `KLinearTorch`.
2020

2121
- To switch the MLA Kernel from Triton to Torch, you can check and modify [this file](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/attention.py), specifically by using the `forward_windows` method.
2222

@@ -29,15 +29,31 @@ Given that we have only tested 1,000 cases, which provides only a preliminary ju
2929

3030
| | | | | | | | |
3131
| ------------------------ | ----------------- | ---------- | ----------------- | ------- | ---------- | ------------------------------------------------------ | ------------ |
32-
| DataSet | CPU Weight Format | CPU Kernel | GPU Weight Format | GEMM | MLA Kernel | [Siliconflow](https://cloud.siliconflow.cn/models)<br> | Ktrans Point |
33-
| MMLU<br><br>(shuffle 1k) | bf16 | cpuinfer | bf16 | torch | torch | 81.6 | 81.9 |
34-
| | int8 | cpuinfer | bf16 | torch | torch | 81.6 | 83.1 |
35-
| | q4km | cpuinfer | bf16 | torch | torch | 81.6 | 82.8 |
36-
| | q4km | cpuinfer | bf16 | torch | triton | 81.6 | 81.4 |
37-
| | q4km | cpuinfer | q4km->marlin 8 | marlin | triton | 81.6 | 81.1 |
38-
| | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 81.6 | 81 |
39-
| | q4km | cpuinfer | fp8 | marlin | triton | 81.6 | 81.5 |
40-
| MMLU-pro | q4km | cpuinfer | fp8 | fp8gemm | triton | 57.7 | 57.6 |
41-
| MMLU-pro | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 57.7 | 57.5 |
32+
| DataSet | CPU Weight Format | CPU Kernel | GPU Weight Format | GEMM Kernel | MLA Kernel | [Siliconflow](https://cloud.siliconflow.cn/models)<br> | Ktrans Point |
33+
| MMLU<br><br>(shuffle 1k) | | | | | | | |
34+
| 1 | bf16 | cpuinfer | bf16 | torch | torch | 81.6 | 81.9 |
35+
| 2 | q8_0 | cpuinfer | bf16 | torch | torch | 81.6 | 83.1 |
36+
| 3 | q4km | cpuinfer | bf16 | torch | triton | 81.6 | 81.4 |
37+
| 4 | q4km | cpuinfer | q4km->marlin 8 | marlin | triton | 81.6 | 81.1 |
38+
| 5 | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 81.6 | 81 |
39+
| 6 | q4km | cpuinfer | fp8 | fp8gemm | triton | 81.6 | 81.5 |
40+
| MMLU-pro | | | | | | | |
41+
| 1 | q4km | cpuinfer | fp8 | fp8gemm | triton | 57.7 | 57.6 |
42+
| 2 | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 57.7 | 57.5 |
4243
| HumanEval | tbd | tbd | tbd | tbd | tbd | tbd | tbd |
4344
| GSM8K | tbd | tbd | tbd | tbd | tbd | tbd | tbd |
45+
46+
**The details for each case are listed below**:
47+
48+
By default, The MLA kernel uses triton in linux and torch in windows. But we need to test torch in linux, so we manually modify the [file](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/attention.py#L592). Just get rid of all the if branch and force it to use `self.forward_windows`
49+
50+
- MMLU test
51+
1. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml) change all the `KLinearMarlin` to `KLinearTorch` (just find all the usage in this file). The source weight comes from [there](https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16) (you need to use llama.cpp to convert it to gguf)
52+
2. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). You need to modify the code to separately load cpu's expert weight. We leave this as comment in these places: [1](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L122), [2](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L136), [3](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L137) (note in 3, change the path to your local weight file path). The weight file for q8_0 is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q8_0)
53+
3. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). You need to modify the code to separately load cpu's expert weight. We leave this as comment in these places: [1](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L122), [2](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L136), [3](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/operators/experts.py#L137) (note in 3, change the path to your local weight file path). The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
54+
4. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). You don't need to change the source code as they both use q4km. But note the yaml file [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml#L29) and [here](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml#L18), below these lines you need to add `num_bits: 8` (in other words: add this kwargs to all that use `KLinearMarlin`). The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
55+
5. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). No need to change yaml, just use the default. The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)
56+
6. You should check the [doc](./fp8_kernel.md) to learn how to test this case. This is a mixture tensor case.
57+
- MMLU-pro test
58+
1. You should check the [doc](./fp8_kernel.md) to learn how to test this case. This is a mixture tensor case.
59+
2. [v3-chat_yaml](https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml). No need to change yaml, just use the default. The weight file for q4km is [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)

0 commit comments

Comments
 (0)