Skip to content

Commit 4e7450d

Browse files
adds llamacpp benchmarking support (#263)
1 parent ed30c98 commit 4e7450d

File tree

5 files changed

+678
-99
lines changed

5 files changed

+678
-99
lines changed

docs/llamacpp.md

Lines changed: 105 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,126 @@
11
# LLAMA.CPP
22

3-
Run transformer models using a Llama.cpp binary and checkpoint. This model can then be used with chatting or benchmarks such as MMLU.
3+
Run transformer models using llama.cpp. This integration allows you to:
4+
1. Load and run llama.cpp models
5+
2. Benchmark model performance
6+
3. Use the models with other tools like chat or MMLU accuracy testing
47

58
## Prerequisites
69

7-
This flow has been verified with a generic Llama.cpp model.
10+
You need:
11+
1. A compiled llama.cpp executable (llama-cli or llama-cli.exe)
12+
2. A GGUF model file
813

9-
These instructions are only for linux or Windows with wsl. It may be necessary to be running WSL in an Administrator command prompt.
14+
### Building llama.cpp (if needed)
1015

11-
These instructions also assumes that lemonade has been installed.
16+
#### Linux
17+
```bash
18+
git clone https://github.com/ggerganov/llama.cpp
19+
cd llama.cpp
20+
make
21+
```
22+
23+
#### Windows
24+
```bash
25+
git clone https://github.com/ggerganov/llama.cpp
26+
cd llama.cpp
27+
cmake -B build
28+
cmake --build build --config Release
29+
```
1230

31+
The executable will be in `build/bin/Release/llama-cli.exe` on Windows or `llama-cli` in the root directory on Linux.
1332

14-
### Set up Environment (Assumes TurnkeyML is already installed)
33+
## Usage
1534

16-
Build or obtain the Llama.cpp model and desired checkpoint.
17-
For example (see the [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md
18-
) source for more details):
19-
1. cd ~
20-
1. git clone https://github.com/ggerganov/llama.cpp
21-
1. cd llama.cpp
22-
1. make
23-
1. cd models
24-
1. wget https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGUF/resolve/main/dolphin-llama2-7b.Q5_K_M.gguf
35+
### Loading a Model
2536

37+
Use the `load-llama-cpp` tool to load a model:
2638

27-
## Usage
39+
```bash
40+
lemonade -i MODEL_NAME load-llama-cpp \
41+
--executable PATH_TO_EXECUTABLE \
42+
--model-binary PATH_TO_GGUF_FILE
43+
```
2844

29-
The Llama.cpp tool currently supports the following parameters
45+
Parameters:
46+
| Parameter | Required | Default | Description |
47+
|--------------|----------|---------|-------------------------------------------------------|
48+
| executable | Yes | - | Path to llama-cli/llama-cli.exe |
49+
| model-binary | Yes | - | Path to .gguf model file |
50+
| threads | No | 1 | Number of threads for generation |
51+
| context-size | No | 512 | Context window size |
52+
| output-tokens| No | 512 | Maximum number of tokens to generate |
3053

31-
| Parameter | Definition | Default |
32-
| --------- | ---------------------------------------------------- | ------- |
33-
| executable | Path to the Llama.cpp-generated application binary | None |
34-
| model-binary | Model checkpoint (do not use if --input is passed to lemonade) | None |
35-
| threads | Number of threads to use for computation | 1 |
36-
| context-size | Maximum context length | 512 |
37-
| temp | Temperature to use for inference (leave out to use the application default) | None |
54+
### Benchmarking
3855

39-
### Example (assuming Llama.cpp built and a checkpoint loaded as above)
56+
After loading a model, you can benchmark it using `llama-cpp-bench`:
4057

4158
```bash
42-
lemonade --input ~/llama.cpp/models/dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~/llama.cpp/llama-cli accuracy-mmlu --ntrain 5
59+
lemonade -i MODEL_NAME \
60+
load-llama-cpp \
61+
--executable PATH_TO_EXECUTABLE \
62+
--model-binary PATH_TO_GGUF_FILE \
63+
llama-cpp-bench
4364
```
4465

45-
On windows, the llama.cpp binary might be in a different location (such as llama.cpp\build\bin\Release\), in which case the command mgiht be something like:
66+
Benchmark parameters:
67+
| Parameter | Default | Description |
68+
|------------------|----------------------------|-------------------------------------------|
69+
| prompt | "Hello, I am conscious and"| Input prompt for benchmarking |
70+
| context-size | 512 | Context window size |
71+
| output-tokens | 512 | Number of tokens to generate |
72+
| iterations | 1 | Number of benchmark iterations |
73+
| warmup-iterations| 0 | Number of warmup iterations (not counted) |
74+
75+
The benchmark will measure and report:
76+
- Time to first token (prompt evaluation time)
77+
- Token generation speed (tokens per second)
78+
79+
### Example Commands
80+
81+
#### Windows Example
4682
```bash
47-
lemonade --input ~\llama.cpp\models\dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~\llama.cpp\build\bin\Release\llama-cli accuracy-mmlu --ntrain 5
83+
# Load and benchmark a model
84+
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
85+
load-llama-cpp \
86+
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
87+
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
88+
llama-cpp-bench \
89+
--iterations 3 \
90+
--warmup-iterations 1
91+
92+
# Run MMLU accuracy test
93+
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
94+
load-llama-cpp \
95+
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
96+
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
97+
accuracy-mmlu \
98+
--tests management \
99+
--max-evals 2
48100
```
101+
102+
#### Linux Example
103+
```bash
104+
# Load and benchmark a model
105+
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
106+
load-llama-cpp \
107+
--executable "./llama-cli" \
108+
--model-binary "./models/qwen2.5-0.5b-instruct-fp16.gguf" \
109+
llama-cpp-bench \
110+
--iterations 3 \
111+
--warmup-iterations 1
112+
```
113+
114+
## Integration with Other Tools
115+
116+
After loading with `load-llama-cpp`, the model can be used with any tool that supports the ModelAdapter interface, including:
117+
- accuracy-mmlu
118+
- llm-prompt
119+
- accuracy-humaneval
120+
- and more
121+
122+
The integration provides:
123+
- Platform-independent path handling (works on both Windows and Linux)
124+
- Proper error handling with detailed messages
125+
- Performance metrics collection
126+
- Configurable generation parameters (temperature, top_p, top_k)

src/lemonade/cli.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
from lemonade.tools.huggingface_bench import HuggingfaceBench
1616
from lemonade.tools.ort_genai.oga_bench import OgaBench
17-
17+
from lemonade.tools.llamacpp_bench import LlamaCppBench
1818
from lemonade.tools.llamacpp import LoadLlamaCpp
1919

2020
import lemonade.cache as cache
@@ -30,6 +30,7 @@ def main():
3030
tools = [
3131
HuggingfaceLoad,
3232
LoadLlamaCpp,
33+
LlamaCppBench,
3334
AccuracyMMLU,
3435
AccuracyHumaneval,
3536
AccuracyPerplexity,

0 commit comments

Comments
 (0)