|
1 | | -# Evaluate LLMs with OpenCompass |
| 1 | +# Model Evaluation Guide |
2 | 2 |
|
3 | | -The LLMs accelerated by lmdeploy can be evaluated with OpenCompass. |
| 3 | +This document describes how to evaluate a model's capabilities on academic datasets using OpenCompass and LMDeploy. The complete evaluation process consists of two main stages: inference stage and evaluation stage. |
4 | 4 |
|
5 | | -## Setup |
| 5 | +During the inference stage, the target model is first deployed as an inference service using LMDeploy. OpenCompass then sends dataset content as requests to this service and collects the generated responses. |
6 | 6 |
|
7 | | -In this part, we are going to setup the environment for evaluation. |
| 7 | +In the evaluation stage, the OpenCompass evaluation model `opencompass/CompassVerifier-32B` is deployed as a service via LMDeploy. OpenCompass subsequently submits the inference results to this service to obtain final evaluation scores. |
8 | 8 |
|
9 | | -### Install lmdeploy |
| 9 | +If sufficient computational resources are available, please refer to the [End-to-End Evaluation](#end-to-end-evaluation) section for complete workflow execution. Otherwise, we recommend following the [Step-by-Step Evaluation](#step-by-step-evaluation) section to execute both stages sequentially. |
10 | 10 |
|
11 | | -Please follow the [installation guide](../get_started/installation.md) to install lmdeploy. |
12 | | - |
13 | | -### Install OpenCompass |
14 | | - |
15 | | -Install OpenCompass from source. Refer to [installation](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) for more information. |
| 11 | +## Environment Setup |
16 | 12 |
|
17 | 13 | ```shell |
18 | | -git clone https://github.com/open-compass/opencompass.git |
19 | | -cd opencompass |
20 | | -pip install -e . |
| 14 | +pip install lmdeploy |
| 15 | +pip install "opencompass[full]" |
| 16 | + |
| 17 | +# Download the lmdeploy source code, which will be used in subsequent steps to access eval script and configuration |
| 18 | +git clone --depth=1 https://github.com/InternLM/lmdeploy.git |
21 | 19 | ``` |
22 | 20 |
|
23 | | -At present, you can check the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html#) |
24 | | -to get to know the basic usage of OpenCompass. |
| 21 | +It is recommended to install LMDeploy and OpenCompass in separate Python virtual environments to avoid potential dependency conflicts. |
25 | 22 |
|
26 | | -### Download datasets |
| 23 | +## End-to-End Evaluation |
27 | 24 |
|
28 | | -Download the core datasets |
| 25 | +1. **Deploy Target Model** |
29 | 26 |
|
30 | 27 | ```shell |
31 | | -# Run in the OpenCompass directory |
32 | | -cd opencompass |
33 | | -wget https://github.com/open-compass/opencompass/releases/download/0.1.8.rc1/OpenCompassData-core-20231110.zip |
34 | | -unzip OpenCompassData-core-20231110.zip |
| 28 | +lmdeploy serve api_server <model_path> --server-port 10000 <--other-options> |
35 | 29 | ``` |
36 | 30 |
|
37 | | -## Prepare Evaluation Config |
| 31 | +2. **Deploy Evaluation Model (Judger)** |
38 | 32 |
|
39 | | -OpenCompass uses the configuration files as the OpenMMLab style. One can define a python config and start evaluating at ease. |
40 | | -OpenCompass has supported the evaluation for lmdeploy's TurboMind engine using python API. |
| 33 | +```shell |
| 34 | +lmdeploy serve api_server opencompass/CompassVerifier-32B --server-port 20000 --tp 2 |
| 35 | +``` |
41 | 36 |
|
42 | | -### Dataset Config |
| 37 | +3. **Generate Evaluation Configuration and Execute** |
43 | 38 |
|
44 | | -In the home directory of OpenCompass, we are writing the config file `$OPENCOMPASS_DIR/configs/eval_lmdeploy.py`. |
45 | | -We select multiple predefined datasets and import them from OpenCompass base dataset configs as `datasets`. |
| 39 | +```shell |
| 40 | + |
| 41 | +cd {the/root/path/of/lmdeploy/repo} |
| 42 | + |
| 43 | +## Specify the dataset path. OC will download the datasets automatically if they are |
| 44 | +## not found in the path |
| 45 | +export HF_DATASETS_CACHE=/nvme4/huggingface_hub/datasets |
| 46 | +export COMPASS_DATA_CACHE=/nvme1/shared/opencompass/.cache |
| 47 | +python eval/eval.py {task_name} \ |
| 48 | + --mode all \ |
| 49 | + --api-server http://{api-server-ip}:10000 \ |
| 50 | + --judger-server http://{judger-server-ip}:20000 \ |
| 51 | + -w {oc_output_dir} |
| 52 | +``` |
46 | 53 |
|
47 | | -```python |
48 | | -from mmengine.config import read_base |
| 54 | +For detailed usage instructions about `eval.py`, such as specifying evaluation datasets, please run `python eval/eval.py --help`. |
49 | 55 |
|
| 56 | +After evaluation completion, results are saved in `{oc_output_dir}/{yyyymmdd_hhmmss}`, where `{yyyymmdd_hhmmss}` represents the task timestamp. |
50 | 57 |
|
51 | | -with read_base(): |
52 | | - # choose a list of datasets |
53 | | - from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets |
54 | | - from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets |
55 | | - from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets |
56 | | - from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets |
57 | | - from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets |
58 | | - from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets |
59 | | - from .datasets.race.race_gen_69ee4f import race_datasets |
60 | | - from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets |
61 | | - # and output the results in a chosen format |
62 | | - from .summarizers.medium import summarizer |
| 58 | +## Step-by-Step Evaluation |
63 | 59 |
|
64 | | -datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), []) |
65 | | -``` |
| 60 | +### Inference Stage |
66 | 61 |
|
67 | | -### Model Config |
68 | | - |
69 | | -This part shows how to setup model config for LLMs. Let's check some examples: |
70 | | - |
71 | | -`````{tabs} |
72 | | -````{tab} internlm-20b |
73 | | -
|
74 | | -```python |
75 | | -from opencompass.models.turbomind import TurboMindModel |
76 | | -
|
77 | | -internlm_20b = dict( |
78 | | - type=TurboMindModel, |
79 | | - abbr='internlm-20b-turbomind', |
80 | | - path="internlm/internlm-20b", # this path should be same as in huggingface |
81 | | - engine_config=dict(session_len=2048, |
82 | | - max_batch_size=8, |
83 | | - rope_scaling_factor=1.0), |
84 | | - gen_config=dict(top_k=1, top_p=0.8, |
85 | | - temperature=1.0, |
86 | | - max_new_tokens=100), |
87 | | - max_out_len=100, |
88 | | - max_seq_len=2048, |
89 | | - batch_size=8, |
90 | | - concurrency=8, |
91 | | - run_cfg=dict(num_gpus=1, num_procs=1), |
92 | | - ) |
93 | | -
|
94 | | -models = [internlm_20b] |
| 62 | +This stage generates model responses for the dataset. |
| 63 | + |
| 64 | +1. **Deploy Target Model** |
| 65 | + |
| 66 | +```shell |
| 67 | +lmdeploy serve api_server <model_path> --server-port 10000 <--other-options> |
95 | 68 | ``` |
96 | 69 |
|
97 | | -```` |
98 | | -
|
99 | | -````{tab} internlm-chat-20b |
100 | | -
|
101 | | -For Chat models, you have to pass `meta_template` for chat models. Different Chat models may have different `meta_template` and it's important |
102 | | -to keep it the same as in training settings. You can read [meta_template](https://opencompass.readthedocs.io/en/latest/prompt/meta_template.html) for more information. |
103 | | -
|
104 | | -
|
105 | | -```python |
106 | | -from opencompass.models.turbomind import TurboMindModel |
107 | | -
|
108 | | -internlm_meta_template = dict(round=[ |
109 | | - dict(role='HUMAN', begin='<|User|>:', end='\n'), |
110 | | - dict(role='BOT', begin='<|Bot|>:', end='<eoa>\n', generate=True), |
111 | | -], |
112 | | - eos_token_id=103028) |
113 | | -
|
114 | | -internlm_chat_20b = dict( |
115 | | - type=TurboMindModel, |
116 | | - abbr='internlm-chat-20b-turbomind', |
117 | | - path='internlm/internlm-chat-20b', |
118 | | - engine_config=dict(session_len=2048, |
119 | | - max_batch_size=8, |
120 | | - rope_scaling_factor=1.0), |
121 | | - gen_config=dict(top_k=1, |
122 | | - top_p=0.8, |
123 | | - temperature=1.0, |
124 | | - max_new_tokens=100), |
125 | | - max_out_len=100, |
126 | | - max_seq_len=2048, |
127 | | - batch_size=8, |
128 | | - concurrency=8, |
129 | | - meta_template=internlm_meta_template, |
130 | | - run_cfg=dict(num_gpus=1, num_procs=1), |
131 | | - end_str='<eoa>' |
132 | | -) |
133 | | -
|
134 | | -models = [internlm_chat_20b] |
| 70 | +2. **Generate Inference Configuration and Execute** |
135 | 71 |
|
| 72 | +```shell |
| 73 | +cd {the/root/path/of/lmdeploy/repo} |
| 74 | + |
| 75 | +## Specify the dataset path. OC will download the datasets automatically if they are |
| 76 | +## not found in the path |
| 77 | +export COMPASS_DATA_CACHE=/nvme1/shared/opencompass/.cache |
| 78 | +export HF_DATASETS_CACHE=/nvme4/huggingface_hub/datasets |
| 79 | +# Run inference task |
| 80 | +python eval/eval.py {task_name} \ |
| 81 | + --mode infer \ |
| 82 | + --api-server http://{api-server-ip}:10000 \ |
| 83 | + -w {oc_output_dir} |
136 | 84 | ``` |
137 | 85 |
|
138 | | -```` |
| 86 | +For detailed usage instructions about `eval.py`, such as specifying evaluation datasets, please run `python eval/eval.py --help`. |
139 | 87 |
|
140 | | -````` |
| 88 | +### Evaluation Stage |
141 | 89 |
|
142 | | -**Note** |
| 90 | +This stage uses the evaluation model (Judger) to assess the quality of inference results. |
143 | 91 |
|
144 | | -- If you want to pass more arguments for `engine_config`和`gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L114) |
145 | | - and [EngineGenerationConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L56) |
| 92 | +1. **Deploy Evaluation Model (Judger)** |
146 | 93 |
|
147 | | -## Execute Evaluation Task |
| 94 | +```shell |
| 95 | +lmdeploy serve api_server opencompass/CompassVerifier-32B --server-port 20000 --tp 2 --session-len 65536 |
| 96 | +``` |
148 | 97 |
|
149 | | -After defining the evaluation config, we can run the following command to start evaluating models. |
150 | | -You can check [Execution Task](https://opencompass.readthedocs.io/en/latest/user_guides/experimentation.html#task-execution-and-monitoring) |
151 | | -for more arguments of `run.py`. |
| 98 | +2. **Generate Evaluation Configuration and Execute** |
152 | 99 |
|
153 | 100 | ```shell |
154 | | -# in the root directory of opencompass |
155 | | -python3 run.py configs/eval_lmdeploy.py --work-dir ./workdir |
| 101 | +cd {the/root/path/of/lmdeploy/repo} |
| 102 | + |
| 103 | +## Specify the dataset path. OC will download the datasets automatically if they are |
| 104 | +## not found in the path |
| 105 | +export COMPASS_DATA_CACHE=/nvme1/shared/opencompass/.cache |
| 106 | +export HF_DATASETS_CACHE=/nvme4/huggingface_hub/datasets |
| 107 | +# Run evaluation task |
| 108 | +opencompass /path/to/judger_config.py -m eval -w {oc_output_dir} -r {yyyymmdd_hhmmss} |
156 | 109 | ``` |
| 110 | + |
| 111 | +Important Notes: |
| 112 | + |
| 113 | +- `task_name` must be identical to the one used in the inference stage |
| 114 | +- The `oc_output_dir` specified with `-w` must match the directory used in the inference stage |
| 115 | +- The `-r` parameter indicates "previous outputs & results" and should specify the timestamp directory generated during the inference stage (the subdirectory under `{oc_output_dir}`) |
| 116 | + |
| 117 | +For detailed usage instructions about `eval.py`, such as specifying evaluation datasets, please run `python eval/eval.py --help`. |
0 commit comments