|
1 | 1 | # LLAMA.CPP |
2 | 2 |
|
3 | | -Run transformer models using a Llama.cpp binary and checkpoint. This model can then be used with chatting or benchmarks such as MMLU. |
| 3 | +Run transformer models using llama.cpp. This integration allows you to: |
| 4 | +1. Load and run llama.cpp models |
| 5 | +2. Benchmark model performance |
| 6 | +3. Use the models with other tools like chat or MMLU accuracy testing |
4 | 7 |
|
5 | 8 | ## Prerequisites |
6 | 9 |
|
7 | | -This flow has been verified with a generic Llama.cpp model. |
| 10 | +You need: |
| 11 | +1. A compiled llama.cpp executable (llama-cli or llama-cli.exe) |
| 12 | +2. A GGUF model file |
8 | 13 |
|
9 | | -These instructions are only for linux or Windows with wsl. It may be necessary to be running WSL in an Administrator command prompt. |
| 14 | +### Building llama.cpp (if needed) |
10 | 15 |
|
11 | | -These instructions also assumes that lemonade has been installed. |
| 16 | +#### Linux |
| 17 | +```bash |
| 18 | +git clone https://github.com/ggerganov/llama.cpp |
| 19 | +cd llama.cpp |
| 20 | +make |
| 21 | +``` |
| 22 | + |
| 23 | +#### Windows |
| 24 | +```bash |
| 25 | +git clone https://github.com/ggerganov/llama.cpp |
| 26 | +cd llama.cpp |
| 27 | +cmake -B build |
| 28 | +cmake --build build --config Release |
| 29 | +``` |
12 | 30 |
|
| 31 | +The executable will be in `build/bin/Release/llama-cli.exe` on Windows or `llama-cli` in the root directory on Linux. |
13 | 32 |
|
14 | | -### Set up Environment (Assumes TurnkeyML is already installed) |
| 33 | +## Usage |
15 | 34 |
|
16 | | -Build or obtain the Llama.cpp model and desired checkpoint. |
17 | | -For example (see the [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md |
18 | | -) source for more details): |
19 | | -1. cd ~ |
20 | | -1. git clone https://github.com/ggerganov/llama.cpp |
21 | | -1. cd llama.cpp |
22 | | -1. make |
23 | | -1. cd models |
24 | | -1. wget https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGUF/resolve/main/dolphin-llama2-7b.Q5_K_M.gguf |
| 35 | +### Loading a Model |
25 | 36 |
|
| 37 | +Use the `load-llama-cpp` tool to load a model: |
26 | 38 |
|
27 | | -## Usage |
| 39 | +```bash |
| 40 | +lemonade -i MODEL_NAME load-llama-cpp \ |
| 41 | + --executable PATH_TO_EXECUTABLE \ |
| 42 | + --model-binary PATH_TO_GGUF_FILE |
| 43 | +``` |
28 | 44 |
|
29 | | -The Llama.cpp tool currently supports the following parameters |
| 45 | +Parameters: |
| 46 | +| Parameter | Required | Default | Description | |
| 47 | +|--------------|----------|---------|-------------------------------------------------------| |
| 48 | +| executable | Yes | - | Path to llama-cli/llama-cli.exe | |
| 49 | +| model-binary | Yes | - | Path to .gguf model file | |
| 50 | +| threads | No | 1 | Number of threads for generation | |
| 51 | +| context-size | No | 512 | Context window size | |
| 52 | +| output-tokens| No | 512 | Maximum number of tokens to generate | |
30 | 53 |
|
31 | | -| Parameter | Definition | Default | |
32 | | -| --------- | ---------------------------------------------------- | ------- | |
33 | | -| executable | Path to the Llama.cpp-generated application binary | None | |
34 | | -| model-binary | Model checkpoint (do not use if --input is passed to lemonade) | None | |
35 | | -| threads | Number of threads to use for computation | 1 | |
36 | | -| context-size | Maximum context length | 512 | |
37 | | -| temp | Temperature to use for inference (leave out to use the application default) | None | |
| 54 | +### Benchmarking |
38 | 55 |
|
39 | | -### Example (assuming Llama.cpp built and a checkpoint loaded as above) |
| 56 | +After loading a model, you can benchmark it using `llama-cpp-bench`: |
40 | 57 |
|
41 | 58 | ```bash |
42 | | -lemonade --input ~/llama.cpp/models/dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~/llama.cpp/llama-cli accuracy-mmlu --ntrain 5 |
| 59 | +lemonade -i MODEL_NAME \ |
| 60 | + load-llama-cpp \ |
| 61 | + --executable PATH_TO_EXECUTABLE \ |
| 62 | + --model-binary PATH_TO_GGUF_FILE \ |
| 63 | + llama-cpp-bench |
43 | 64 | ``` |
44 | 65 |
|
45 | | -On windows, the llama.cpp binary might be in a different location (such as llama.cpp\build\bin\Release\), in which case the command mgiht be something like: |
| 66 | +Benchmark parameters: |
| 67 | +| Parameter | Default | Description | |
| 68 | +|------------------|----------------------------|-------------------------------------------| |
| 69 | +| prompt | "Hello, I am conscious and"| Input prompt for benchmarking | |
| 70 | +| context-size | 512 | Context window size | |
| 71 | +| output-tokens | 512 | Number of tokens to generate | |
| 72 | +| iterations | 1 | Number of benchmark iterations | |
| 73 | +| warmup-iterations| 0 | Number of warmup iterations (not counted) | |
| 74 | + |
| 75 | +The benchmark will measure and report: |
| 76 | +- Time to first token (prompt evaluation time) |
| 77 | +- Token generation speed (tokens per second) |
| 78 | + |
| 79 | +### Example Commands |
| 80 | + |
| 81 | +#### Windows Example |
46 | 82 | ```bash |
47 | | -lemonade --input ~\llama.cpp\models\dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~\llama.cpp\build\bin\Release\llama-cli accuracy-mmlu --ntrain 5 |
| 83 | +# Load and benchmark a model |
| 84 | +lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \ |
| 85 | + load-llama-cpp \ |
| 86 | + --executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \ |
| 87 | + --model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \ |
| 88 | + llama-cpp-bench \ |
| 89 | + --iterations 3 \ |
| 90 | + --warmup-iterations 1 |
| 91 | + |
| 92 | +# Run MMLU accuracy test |
| 93 | +lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \ |
| 94 | + load-llama-cpp \ |
| 95 | + --executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \ |
| 96 | + --model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \ |
| 97 | + accuracy-mmlu \ |
| 98 | + --tests management \ |
| 99 | + --max-evals 2 |
48 | 100 | ``` |
| 101 | + |
| 102 | +#### Linux Example |
| 103 | +```bash |
| 104 | +# Load and benchmark a model |
| 105 | +lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \ |
| 106 | + load-llama-cpp \ |
| 107 | + --executable "./llama-cli" \ |
| 108 | + --model-binary "./models/qwen2.5-0.5b-instruct-fp16.gguf" \ |
| 109 | + llama-cpp-bench \ |
| 110 | + --iterations 3 \ |
| 111 | + --warmup-iterations 1 |
| 112 | +``` |
| 113 | + |
| 114 | +## Integration with Other Tools |
| 115 | + |
| 116 | +After loading with `load-llama-cpp`, the model can be used with any tool that supports the ModelAdapter interface, including: |
| 117 | +- accuracy-mmlu |
| 118 | +- llm-prompt |
| 119 | +- accuracy-humaneval |
| 120 | +- and more |
| 121 | + |
| 122 | +The integration provides: |
| 123 | +- Platform-independent path handling (works on both Windows and Linux) |
| 124 | +- Proper error handling with detailed messages |
| 125 | +- Performance metrics collection |
| 126 | +- Configurable generation parameters (temperature, top_p, top_k) |
0 commit comments