Skip to content

Commit 96e998a

Browse files
authored
Update model evalution guide (#4094)
* commit gen_config.py * update * update * update * update * update mmlu_pro * update * simplify * update doc * update config * set session_len 64k when serving the judger model * missing a doc
1 parent cf75374 commit 96e998a

File tree

6 files changed

+569
-236
lines changed

6 files changed

+569
-236
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ wheels/
2626
.installed.cfg
2727
*.egg
2828
MANIFEST
29+
tmp/
2930

3031
# PyInstaller
3132
# Usually these files are written by a python script from a template

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ repos:
33
rev: 5.0.4
44
hooks:
55
- id: flake8
6-
args: ["--max-line-length=120"]
6+
args: ['--extend-ignore=E231', "--max-line-length=120"]
77
- repo: https://github.com/PyCQA/isort
88
rev: 5.11.5
99
hooks:
Lines changed: 81 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -1,156 +1,117 @@
1-
# Evaluate LLMs with OpenCompass
1+
# Model Evaluation Guide
22

3-
The LLMs accelerated by lmdeploy can be evaluated with OpenCompass.
3+
This document describes how to evaluate a model's capabilities on academic datasets using OpenCompass and LMDeploy. The complete evaluation process consists of two main stages: inference stage and evaluation stage.
44

5-
## Setup
5+
During the inference stage, the target model is first deployed as an inference service using LMDeploy. OpenCompass then sends dataset content as requests to this service and collects the generated responses.
66

7-
In this part, we are going to setup the environment for evaluation.
7+
In the evaluation stage, the OpenCompass evaluation model `opencompass/CompassVerifier-32B` is deployed as a service via LMDeploy. OpenCompass subsequently submits the inference results to this service to obtain final evaluation scores.
88

9-
### Install lmdeploy
9+
If sufficient computational resources are available, please refer to the [End-to-End Evaluation](#end-to-end-evaluation) section for complete workflow execution. Otherwise, we recommend following the [Step-by-Step Evaluation](#step-by-step-evaluation) section to execute both stages sequentially.
1010

11-
Please follow the [installation guide](../get_started/installation.md) to install lmdeploy.
12-
13-
### Install OpenCompass
14-
15-
Install OpenCompass from source. Refer to [installation](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) for more information.
11+
## Environment Setup
1612

1713
```shell
18-
git clone https://github.com/open-compass/opencompass.git
19-
cd opencompass
20-
pip install -e .
14+
pip install lmdeploy
15+
pip install "opencompass[full]"
16+
17+
# Download the lmdeploy source code, which will be used in subsequent steps to access eval script and configuration
18+
git clone --depth=1 https://github.com/InternLM/lmdeploy.git
2119
```
2220

23-
At present, you can check the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html#)
24-
to get to know the basic usage of OpenCompass.
21+
It is recommended to install LMDeploy and OpenCompass in separate Python virtual environments to avoid potential dependency conflicts.
2522

26-
### Download datasets
23+
## End-to-End Evaluation
2724

28-
Download the core datasets
25+
1. **Deploy Target Model**
2926

3027
```shell
31-
# Run in the OpenCompass directory
32-
cd opencompass
33-
wget https://github.com/open-compass/opencompass/releases/download/0.1.8.rc1/OpenCompassData-core-20231110.zip
34-
unzip OpenCompassData-core-20231110.zip
28+
lmdeploy serve api_server <model_path> --server-port 10000 <--other-options>
3529
```
3630

37-
## Prepare Evaluation Config
31+
2. **Deploy Evaluation Model (Judger)**
3832

39-
OpenCompass uses the configuration files as the OpenMMLab style. One can define a python config and start evaluating at ease.
40-
OpenCompass has supported the evaluation for lmdeploy's TurboMind engine using python API.
33+
```shell
34+
lmdeploy serve api_server opencompass/CompassVerifier-32B --server-port 20000 --tp 2
35+
```
4136

42-
### Dataset Config
37+
3. **Generate Evaluation Configuration and Execute**
4338

44-
In the home directory of OpenCompass, we are writing the config file `$OPENCOMPASS_DIR/configs/eval_lmdeploy.py`.
45-
We select multiple predefined datasets and import them from OpenCompass base dataset configs as `datasets`.
39+
```shell
40+
41+
cd {the/root/path/of/lmdeploy/repo}
42+
43+
## Specify the dataset path. OC will download the datasets automatically if they are
44+
## not found in the path
45+
export HF_DATASETS_CACHE=/nvme4/huggingface_hub/datasets
46+
export COMPASS_DATA_CACHE=/nvme1/shared/opencompass/.cache
47+
python eval/eval.py {task_name} \
48+
--mode all \
49+
--api-server http://{api-server-ip}:10000 \
50+
--judger-server http://{judger-server-ip}:20000 \
51+
-w {oc_output_dir}
52+
```
4653

47-
```python
48-
from mmengine.config import read_base
54+
For detailed usage instructions about `eval.py`, such as specifying evaluation datasets, please run `python eval/eval.py --help`.
4955

56+
After evaluation completion, results are saved in `{oc_output_dir}/{yyyymmdd_hhmmss}`, where `{yyyymmdd_hhmmss}` represents the task timestamp.
5057

51-
with read_base():
52-
# choose a list of datasets
53-
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
54-
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
55-
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
56-
from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets
57-
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
58-
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
59-
from .datasets.race.race_gen_69ee4f import race_datasets
60-
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets
61-
# and output the results in a chosen format
62-
from .summarizers.medium import summarizer
58+
## Step-by-Step Evaluation
6359

64-
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
65-
```
60+
### Inference Stage
6661

67-
### Model Config
68-
69-
This part shows how to setup model config for LLMs. Let's check some examples:
70-
71-
`````{tabs}
72-
````{tab} internlm-20b
73-
74-
```python
75-
from opencompass.models.turbomind import TurboMindModel
76-
77-
internlm_20b = dict(
78-
type=TurboMindModel,
79-
abbr='internlm-20b-turbomind',
80-
path="internlm/internlm-20b", # this path should be same as in huggingface
81-
engine_config=dict(session_len=2048,
82-
max_batch_size=8,
83-
rope_scaling_factor=1.0),
84-
gen_config=dict(top_k=1, top_p=0.8,
85-
temperature=1.0,
86-
max_new_tokens=100),
87-
max_out_len=100,
88-
max_seq_len=2048,
89-
batch_size=8,
90-
concurrency=8,
91-
run_cfg=dict(num_gpus=1, num_procs=1),
92-
)
93-
94-
models = [internlm_20b]
62+
This stage generates model responses for the dataset.
63+
64+
1. **Deploy Target Model**
65+
66+
```shell
67+
lmdeploy serve api_server <model_path> --server-port 10000 <--other-options>
9568
```
9669

97-
````
98-
99-
````{tab} internlm-chat-20b
100-
101-
For Chat models, you have to pass `meta_template` for chat models. Different Chat models may have different `meta_template` and it's important
102-
to keep it the same as in training settings. You can read [meta_template](https://opencompass.readthedocs.io/en/latest/prompt/meta_template.html) for more information.
103-
104-
105-
```python
106-
from opencompass.models.turbomind import TurboMindModel
107-
108-
internlm_meta_template = dict(round=[
109-
dict(role='HUMAN', begin='<|User|>:', end='\n'),
110-
dict(role='BOT', begin='<|Bot|>:', end='<eoa>\n', generate=True),
111-
],
112-
eos_token_id=103028)
113-
114-
internlm_chat_20b = dict(
115-
type=TurboMindModel,
116-
abbr='internlm-chat-20b-turbomind',
117-
path='internlm/internlm-chat-20b',
118-
engine_config=dict(session_len=2048,
119-
max_batch_size=8,
120-
rope_scaling_factor=1.0),
121-
gen_config=dict(top_k=1,
122-
top_p=0.8,
123-
temperature=1.0,
124-
max_new_tokens=100),
125-
max_out_len=100,
126-
max_seq_len=2048,
127-
batch_size=8,
128-
concurrency=8,
129-
meta_template=internlm_meta_template,
130-
run_cfg=dict(num_gpus=1, num_procs=1),
131-
end_str='<eoa>'
132-
)
133-
134-
models = [internlm_chat_20b]
70+
2. **Generate Inference Configuration and Execute**
13571

72+
```shell
73+
cd {the/root/path/of/lmdeploy/repo}
74+
75+
## Specify the dataset path. OC will download the datasets automatically if they are
76+
## not found in the path
77+
export COMPASS_DATA_CACHE=/nvme1/shared/opencompass/.cache
78+
export HF_DATASETS_CACHE=/nvme4/huggingface_hub/datasets
79+
# Run inference task
80+
python eval/eval.py {task_name} \
81+
--mode infer \
82+
--api-server http://{api-server-ip}:10000 \
83+
-w {oc_output_dir}
13684
```
13785

138-
````
86+
For detailed usage instructions about `eval.py`, such as specifying evaluation datasets, please run `python eval/eval.py --help`.
13987

140-
`````
88+
### Evaluation Stage
14189

142-
**Note**
90+
This stage uses the evaluation model (Judger) to assess the quality of inference results.
14391

144-
- If you want to pass more arguments for `engine_config``gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L114)
145-
and [EngineGenerationConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L56)
92+
1. **Deploy Evaluation Model (Judger)**
14693

147-
## Execute Evaluation Task
94+
```shell
95+
lmdeploy serve api_server opencompass/CompassVerifier-32B --server-port 20000 --tp 2 --session-len 65536
96+
```
14897

149-
After defining the evaluation config, we can run the following command to start evaluating models.
150-
You can check [Execution Task](https://opencompass.readthedocs.io/en/latest/user_guides/experimentation.html#task-execution-and-monitoring)
151-
for more arguments of `run.py`.
98+
2. **Generate Evaluation Configuration and Execute**
15299

153100
```shell
154-
# in the root directory of opencompass
155-
python3 run.py configs/eval_lmdeploy.py --work-dir ./workdir
101+
cd {the/root/path/of/lmdeploy/repo}
102+
103+
## Specify the dataset path. OC will download the datasets automatically if they are
104+
## not found in the path
105+
export COMPASS_DATA_CACHE=/nvme1/shared/opencompass/.cache
106+
export HF_DATASETS_CACHE=/nvme4/huggingface_hub/datasets
107+
# Run evaluation task
108+
opencompass /path/to/judger_config.py -m eval -w {oc_output_dir} -r {yyyymmdd_hhmmss}
156109
```
110+
111+
Important Notes:
112+
113+
- `task_name` must be identical to the one used in the inference stage
114+
- The `oc_output_dir` specified with `-w` must match the directory used in the inference stage
115+
- The `-r` parameter indicates "previous outputs & results" and should specify the timestamp directory generated during the inference stage (the subdirectory under `{oc_output_dir}`)
116+
117+
For detailed usage instructions about `eval.py`, such as specifying evaluation datasets, please run `python eval/eval.py --help`.

0 commit comments

Comments
 (0)