You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Llamafile CPU Backend**: AVX2/AVX512-based MoE backend built on Llamafile for universal CPU deployment.
46
+
-**NUMA-Aware Execution**: Thread pool and memory layout designed for multi-socket / multi-NUMA machines.
44
47
45
48
## Installation
46
49
@@ -62,18 +65,18 @@ conda activate kt-kernel
62
65
63
66
You can now install in two clear steps using the same script.
64
67
65
-
Option A: Two-step (explicit)
68
+
Option A: Two-step (specify dependencies installation and build separately)
66
69
67
70
```bash
68
71
# 1) Install system prerequisites (cmake, hwloc, pkg-config)
69
72
./install.sh deps
70
73
71
-
# 2) Build and install kt-kernel (auto-detects CPU)
72
-
# By default, the script cleans the local ./build directory before compiling.
74
+
# 2) Build and install kt-kernel (auto-detects CPU instruction set)
75
+
# By default, the script cleans the local ./build directory before compiling
73
76
./install.sh build
74
77
```
75
78
76
-
Option B: One-step (deps + build)
79
+
Option B: One-step
77
80
78
81
```bash
79
82
./install.sh
@@ -88,7 +91,9 @@ The install script will:
88
91
- AMX CPU detected → `NATIVE + AMX=ON`
89
92
- No AMX detected → `NATIVE + AMX=OFF`
90
93
91
-
⚠️ **Important for LLAMAFILE backend users:** If you have an AMX-capable CPU and plan to use the LLAMAFILE backend, do NOT use auto-detection. Use manual mode with `AVX512` or `AVX2` instead of `NATIVE` to avoid compilation issues (see below).
94
+
⚠️ **Important for LLAMAFILE backend users:**
95
+
If you have an AMX-capable CPU but plan to use the LLAMAFILE backend, do NOT use the default auto-detection build.
96
+
Use "manual mode" with `CPUINFER_CPU_INSTRUCT` set to `AVX512` or `AVX2` instead of `NATIVE` to avoid compilation issues (see below).
92
97
93
98
### Manual Configuration (Advanced)
94
99
@@ -99,7 +104,7 @@ If you need specific build options (e.g., for LLAMAFILE backend, compatibility,
LLAMAFILE uses pre-quantized **GGUF** weights on the CPU side directly, without running `convert_cpu_weights.py`. You need to:
149
160
150
-
**Note:** LLAMAFILE backend supports GGUF format directly, but this feature is still in preview.
161
+
- Download a GGUF model directly from the web (e.g., GGUF repos on Hugging Face / Modelscope);
162
+
- In SGLang integration, use that GGUF directory as `--kt-weight-path`.
163
+
KT-Kernel supports multiple GGUF quantization formats such as `Q4_KM`, `Q4_K`, `Q5_K`, etc. Choose based on your latency and accuracy requirements.
151
164
152
165
#### 3. Launch SGLang Server
153
166
@@ -177,14 +190,12 @@ See [KT-Kernel Parameters](#kt-kernel-parameters) section below for detailed par
177
190
178
191
### Complete Example: Qwen3-30B-A3B
179
192
180
-
This example demonstrates the full workflow from downloading weights to launching the server.
193
+
This example demonstrates the full workflow from downloading weights to launching the server, showing both **AMX backend** and **LLAMAFILE backend** options.
KT-Kernel provides weight quantization tools for CPU-GPU hybrid inference (e.g., integrating with SGLang). Both tools work together to enable heterogeneous expert placement across CPUs and GPUs.
446
-
447
-
### CPU Weights (for "cold" experts on CPU)
448
-
449
-
Quantize weights to INT4/INT8 format optimized for AMX inference:
503
+
For AMX backends (`AMXINT4` / `AMXINT8`), CPU-side experts must be converted to AMX-friendly INT4/INT8 format using the provided script:
For LLAMAFILE backend (`LLAMAFILE`), CPU-side experts are loaded directly from **GGUF** weights. You do **not** need to run the AMX conversion script; instead, download a GGUF model from the web (e.g., a GGUF repo on Hugging Face) and point `weight_path` / SGLang `--kt-weight-path` (or `--model` when appropriate) to that GGUF directory. KT-Kernel supports multiple GGUF quantization types such as `Q4_KM`, `Q4_K`, `Q5_K`, etc.
477
516
478
517
---
479
518
480
519
For detailed documentation, advanced options, and low-memory mode, see [scripts/README.md](scripts/README.md).
481
520
482
521
## Before Commit!
483
-
your msg should match: Conventional Commits (https://www.conventionalcommits.org/) <br>and format your code before commit:
522
+
523
+
Commit messages should follow the Conventional Commits specification: https://www.conventionalcommits.org/
524
+
525
+
Please format your code before committing:
526
+
484
527
```shell
485
528
cmake -B build
486
529
cd build
487
530
make format
488
531
```
489
-
and you may need a new clang-format at least 18, use this command in conda env:
532
+
533
+
You may need a newer clang-format (at least version 18). In a conda environment:
534
+
490
535
```shell
491
536
conda install -c conda-forge clang-format=18
492
537
rm -rf build
493
538
```
494
-
and you may need black for python format:
539
+
540
+
It's also recommended to install black for Python code formatting:
0 commit comments