Skip to content

Commit ff12b67

Browse files
Add NPU OGA support for RyzenAI 1.3 EA (#232)
* Add NPU OGA support for RyzenAI 1.3 EA * retrigger checks
1 parent 3f97d74 commit ff12b67

File tree

3 files changed

+181
-33
lines changed

3 files changed

+181
-33
lines changed
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Introduction
2+
3+
onnxruntime-genai (aka OGA) is a new framework created by Microsoft for running ONNX LLMs: https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file
4+
5+
## NPU instructions
6+
7+
### Warnings
8+
9+
- Users have experienced inconsistent results across models and machines. If one model isn't working well on your laptop, try one of the other models.
10+
- The OGA wheels need to be installed in a specific order or you will end up with the wrong packages in your environment. If you see pip dependency errors, please delete your conda env and start over with a fresh environment.
11+
12+
### Installation
13+
14+
1. NOTE: ⚠️ DO THESE STEPS IN EXACTLY THIS ORDER ⚠️
15+
1. Install `lemonade`:
16+
1. Create a conda environment: `conda create -n oga-npu python=3.10` (Python 3.10 is required)
17+
1. Activate: `conda activate oga-npu`
18+
1. `cd REPO_ROOT`
19+
1. `pip install -e .[oga-npu]`
20+
1. Download required OGA packages
21+
1. Access the [AMD RyzenAI EA Lounge](https://account.amd.com/en/member/ryzenai-sw-ea.html#tabs-a5e122f973-item-4757898120-tab) and download `amd_oga_Oct4_2024.zip` from `Ryzen AI 1.3 Preview Release`.
22+
1. Unzip `amd_oga_Oct4_2024.zip`
23+
1. Setup your folder structure:
24+
1. Copy all of the content inside `amd_oga` to lemonade's `REPO_ROOT\src\lemonade\tools\ort_genai\models\`
25+
1. Move all dlls from `REPO_ROOT\src\lemonade\tools\ort_genai\models\libs` to `REPO_ROOT\src\lemonade\tools\ort_genai\models\`
26+
1. Install the wheels:
27+
1. `cd amd_oga\wheels`
28+
1. `pip install onnxruntime_genai-0.5.0.dev0-cp310-cp310-win_amd64.whl`
29+
1. `pip install onnxruntime_vitisai-1.20.0-cp310-cp310-win_amd64.whl`
30+
1. `pip install voe-1.2.0-cp310-cp310-win_amd64.whl`
31+
1. Ensure you have access to the models on Hungging Face:
32+
1. Ensure you can access the models under [quark-quantized-onnx-llms-for-ryzen-ai-13-ea](https://huggingface.co/collections/amd/quark-quantized-onnx-llms-for-ryzen-ai-13-ea-66fc8e24927ec45504381902) on Hugging Face. Models are gated and you may have to request access.
33+
1. Create a Hugging Face Access Token [here](https://huggingface.co/settings/tokens). Ensure you select `Read access to contents of all public gated repos you can access` if creating a finegrained token.
34+
1. Set your Hugging Face token as an environment variable: `set HF_TOKEN=<your token>`
35+
1. Install driver
36+
1. Access the [AMD RyzenAI EA Lounge](https://account.amd.com/en/member/ryzenai-sw-ea.html#tabs-a5e122f973-item-4757898120-tab) and download `Win24AIDriver.zip` from `Ryzen AI 1.3 Preview Release`.
37+
1. Unzip `Win24AIDriver.zip`
38+
1. Right click `kipudrv.inf` and select `Install`
39+
1. Check under `Device Manager` to ensure that `NPU Compute Accelerator` is using version `32.0.203.219`.
40+
41+
### Runtime
42+
43+
To test basic functionality, point lemonade to any of the models under under [quark-quantized-onnx-llms-for-ryzen-ai-13-ea](https://huggingface.co/collections/amd/quark-quantized-onnx-llms-for-ryzen-ai-13-ea-66fc8e24927ec45504381902):
44+
45+
```
46+
lemonade -i amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix oga-load --device npu --dtype int4 llm-prompt -p "hello whats your name?" --max-new-tokens 15
47+
```
48+
49+
```
50+
Building "amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix"
51+
[Vitis AI EP] No. of Operators : CPU 73 MATMULNBITS 99
52+
[Vitis AI EP] No. of Subgraphs :MATMULNBITS 33
53+
✓ Loading OnnxRuntime-GenAI model
54+
✓ Prompting LLM
55+
56+
amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix:
57+
<built-in function input> (executed 1x)
58+
Build dir: C:\Users\danie/.cache/lemonade\amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix
59+
Status: Successful build!
60+
Dtype: int4
61+
Device: npu
62+
Response: hello whats your name?
63+
Hi, I'm a 21 year old male from the
64+
```
65+
66+
To test/use the websocket server:
67+
68+
```
69+
lemonade -i amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix oga-load --device npu --dtype int4 serve --max-new-tokens 50
70+
```
71+
72+
Then open the address (http://localhost:8000) in a browser and chat with it.
73+
74+
```
75+
Building "amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix"
76+
[Vitis AI EP] No. of Operators : CPU 73 MATMULNBITS 99
77+
[Vitis AI EP] No. of Subgraphs :MATMULNBITS 33
78+
✓ Loading OnnxRuntime-GenAI model
79+
80+
81+
INFO: Started server process [27752]
82+
INFO: Waiting for application startup.
83+
INFO: Application startup complete.
84+
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
85+
INFO: ::1:54973 - "GET / HTTP/1.1" 200 OK
86+
INFO: ('::1', 54975) - "WebSocket /ws" [accepted]
87+
INFO: connection open
88+
I'm a newbie here. I'm looking for a good place to buy a domain name. I've been looking around and i've found a few good places.
89+
```
90+
91+
To run a single MMLU test:
92+
93+
```
94+
lemonade -i amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix oga-load --device npu --dtype int4 accuracy-mmlu --tests management
95+
```
96+
97+
```
98+
Building "amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix"
99+
[Vitis AI EP] No. of Operators : CPU 73 MATMULNBITS 99
100+
[Vitis AI EP] No. of Subgraphs :MATMULNBITS 33
101+
✓ Loading OnnxRuntime-GenAI model
102+
✓ Measuring accuracy with MMLU
103+
104+
amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix:
105+
<built-in function input> (executed 1x)
106+
Build dir: C:\Users\danie/.cache/lemonade\amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix
107+
Status: Successful build!
108+
Dtype: int4
109+
Device: npu
110+
Mmlu Management Accuracy: 56.31 %
111+
```

src/turnkeyml/llm/leap.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ def from_pretrained(
9090

9191
return state.model, state.tokenizer
9292

93-
if recipe == "hf-dgpu":
93+
elif recipe == "hf-dgpu":
9494
# Huggingface Transformers recipe for discrete GPU (Nvidia, Instinct, Radeon)
9595

9696
import torch

src/turnkeyml/llm/tools/ort_genai/oga.py

Lines changed: 69 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@
66
import os
77
import time
88
import json
9+
from fnmatch import fnmatch
910
from queue import Queue
11+
from huggingface_hub import snapshot_download, login
1012
import onnxruntime_genai as og
1113
from turnkeyml.state import State
1214
from turnkeyml.tools import FirstTool
@@ -18,7 +20,6 @@
1820
)
1921
from turnkeyml.llm.cache import Keys
2022

21-
2223
class OrtGenaiTokenizer(TokenizerAdapter):
2324
def __init__(self, model: og.Model):
2425
# Initialize the tokenizer and produce the initial tokens.
@@ -74,9 +75,9 @@ def __init__(self, input_folder):
7475
self.config = self.load_config(input_folder)
7576

7677
def load_config(self, input_folder):
77-
config_path = os.path.join(input_folder, 'genai_config.json')
78+
config_path = os.path.join(input_folder, "genai_config.json")
7879
if os.path.exists(config_path):
79-
with open(config_path, 'r', encoding='utf-8') as f:
80+
with open(config_path, "r", encoding="utf-8") as f:
8081
return json.load(f)
8182
return None
8283

@@ -99,21 +100,23 @@ def generate(
99100
max_length = len(input_ids) + max_new_tokens
100101

101102
params.input_ids = input_ids
102-
if self.config and 'search' in self.config:
103-
search_config = self.config['search']
103+
if self.config and "search" in self.config:
104+
search_config = self.config["search"]
104105
params.set_search_options(
105-
do_sample=search_config.get('do_sample', do_sample),
106-
top_k=search_config.get('top_k', top_k),
107-
top_p=search_config.get('top_p', top_p),
108-
temperature=search_config.get('temperature', temperature),
106+
do_sample=search_config.get("do_sample", do_sample),
107+
top_k=search_config.get("top_k", top_k),
108+
top_p=search_config.get("top_p", top_p),
109+
temperature=search_config.get("temperature", temperature),
109110
max_length=max_length,
110111
min_length=0,
111-
early_stopping=search_config.get('early_stopping', False),
112-
length_penalty=search_config.get('length_penalty', 1.0),
113-
num_beams=search_config.get('num_beams', 1),
114-
num_return_sequences=search_config.get('num_return_sequences', 1),
115-
repetition_penalty=search_config.get('repetition_penalty', 1.0),
116-
past_present_share_buffer=search_config.get('past_present_share_buffer', True),
112+
early_stopping=search_config.get("early_stopping", False),
113+
length_penalty=search_config.get("length_penalty", 1.0),
114+
num_beams=search_config.get("num_beams", 1),
115+
num_return_sequences=search_config.get("num_return_sequences", 1),
116+
repetition_penalty=search_config.get("repetition_penalty", 1.0),
117+
past_present_share_buffer=search_config.get(
118+
"past_present_share_buffer", True
119+
),
117120
# Not currently supported by OGA
118121
# diversity_penalty=search_config.get('diversity_penalty', 0.0),
119122
# no_repeat_ngram_size=search_config.get('no_repeat_ngram_size', 0),
@@ -192,6 +195,7 @@ class OgaLoad(FirstTool):
192195
llama_2 = "meta-llama/Llama-2-7b-chat-hf"
193196
phi_3_mini_4k = "microsoft/Phi-3-mini-4k-instruct"
194197
phi_3_mini_128k = "microsoft/Phi-3-mini-128k-instruct"
198+
And models on Hugging Face that follow the "amd/**-onnx-ryzen-strix" pattern
195199
196200
Output:
197201
state.model: handle to a Huggingface-style LLM loaded on DirectML device
@@ -244,7 +248,7 @@ def run(
244248
checkpoint = input
245249

246250
# Map of models[device][dtype][checkpoint] to the name of the model folder on disk
247-
supported_models = {
251+
local_supported_models = {
248252
"igpu": {
249253
"int4": {
250254
phi_3_mini_128k: os.path.join(
@@ -261,6 +265,7 @@ def run(
261265
},
262266
"npu": {
263267
"int4": {
268+
# Legacy RyzenAI 1.2 models for NPU
264269
llama_2: "llama2-7b-int4",
265270
llama_3: "llama3-8b-int4",
266271
qwen_1dot5: "qwen1.5-7b-int4",
@@ -277,28 +282,60 @@ def run(
277282
},
278283
}
279284

285+
hf_supported_models = {"npu": {"int4": "amd/**-onnx-ryzen-strix"}}
286+
287+
supported_locally = True
280288
try:
281-
dir_name = supported_models[device][dtype][checkpoint]
289+
dir_name = local_supported_models[device][dtype][checkpoint]
282290
except KeyError as e:
283-
raise ValueError(
284-
"The device;dtype;checkpoint combination is not supported: "
285-
f"{device};{dtype};{checkpoint}. The supported combinations "
286-
f"are: {supported_models}"
287-
) from e
288-
289-
model_dir = os.path.join(
290-
os.path.dirname(os.path.realpath(__file__)),
291-
"models",
292-
dir_name,
293-
)
291+
supported_locally = False
292+
hf_supported = (
293+
device in hf_supported_models
294+
and dtype in hf_supported_models[device]
295+
and fnmatch(checkpoint, hf_supported_models[device][dtype])
296+
)
297+
if not hf_supported:
298+
raise ValueError(
299+
"The device;dtype;checkpoint combination is not supported: "
300+
f"{device};{dtype};{checkpoint}. The supported combinations "
301+
f"are: {local_supported_models} for local models and {hf_supported_models}"
302+
" for models on Hugging Face."
303+
) from e
304+
305+
# Create models dir if it doesn't exist
306+
models_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "models")
307+
if not os.path.exists(models_dir):
308+
os.makedirs(models_dir)
309+
310+
# If the model is supported though Hugging Face, download it
311+
if not supported_locally:
312+
hf_model_name = checkpoint.split("amd/")[1]
313+
dir_name = "_".join(hf_model_name.split("-")[:6]).lower()
314+
api_key = os.getenv("HF_TOKEN")
315+
login(api_key)
316+
snapshot_download(
317+
repo_id=checkpoint,
318+
local_dir=os.path.join(models_dir, dir_name),
319+
ignore_patterns=["*.md", "*.txt"],
320+
)
294321

295-
# The NPU requires the CWD to be in the model folder
296322
current_cwd = os.getcwd()
297323
if device == "npu":
298-
os.chdir(model_dir)
299-
# Required environment variable for NPU
300-
os.environ["DOD_ROOT"] = ".\\bins"
324+
# Change to the models directory
325+
os.chdir(models_dir)
326+
327+
# Common environment variables for all NPU models
328+
os.environ["DD_ROOT"] = ".\\bins"
329+
os.environ["DEVICE"] = "stx"
330+
os.environ["XLNX_ENABLE_CACHE"] = "0"
331+
332+
# Phi models require USE_AIE_RoPE=0
333+
if "phi-" in checkpoint.lower():
334+
os.environ["USE_AIE_RoPE"] = "0"
335+
else:
336+
os.environ["USE_AIE_RoPE"] = "1"
301337

338+
model_dir = os.path.join(models_dir, dir_name)
302339
state.model = OrtGenaiModel(model_dir)
303340
state.tokenizer = OrtGenaiTokenizer(state.model.model)
304341
state.dtype = dtype

0 commit comments

Comments
 (0)