Skip to content

Commit 8d77faf

Browse files
committed
adding Rewardbench
1 parent 9fa3cf2 commit 8d77faf

File tree

3 files changed

+534
-1
lines changed

3 files changed

+534
-1
lines changed

09-lmunit-rewardbench/README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# LMUnit Evaluation Script for RewardBench
2+
3+
This [script](rewardbench_lmunit.py) provides functionality to evaluate language model responses using the [RewardBench framework](https://github.com/allenai/reward-bench) and LMUnit API. It assesses responses across multiple dimensions including factuality, focus, mathematical accuracy, instruction following, safety, and helpfulness.
4+
The script is designed to be adaptable to other datasets needing long running evaluation.
5+
6+
## Prerequisites
7+
8+
- Python 3.x
9+
- Contextual AI API key
10+
- Hugging Face token (optional, for accessing private datasets)
11+
12+
## Installation
13+
14+
```bash
15+
pip install rewardbench
16+
pip install aiohttp aiolimiter transformers datasets torch tqdm
17+
```
18+
19+
## Environment Variables
20+
21+
- `HF_TOKEN` (optional): Hugging Face token for accessing private datasets
22+
23+
## Usage
24+
25+
Basic usage:
26+
27+
```bash
28+
python rewardbench_lmunit.py --model lmunit-api --api_key your-api-key
29+
```
30+
Start with --debug so you can make sure everything runs on a small sample.
31+
32+
### Command Line Arguments
33+
34+
- `--model`: Model identifier (required)
35+
- `--api_key`: Contextual AI API key (required)
36+
- `--dataset`: Dataset to use (default: "allenai/reward-bench-2")
37+
- `--batch_size`: Batch size for inference (default: 64)
38+
- `--debug`: Enable debug mode with small example set
39+
- `--torch_dtype`: PyTorch dtype (default: float16)
40+
- `--mode_unit_test`: Unit test mode (default: "default")
41+
- `--out-dataset-path`: Path to save output dataset
42+
43+
### Evaluation Dimensions
44+
45+
The script by default evaluates responses across multiple dimensions:
46+
- Factuality: Checks for factual accuracy without hallucinations
47+
- Focus: Assesses response relevance to the question
48+
- Math: Verifies mathematical accuracy
49+
- Precise IF: Evaluates instruction following
50+
- Safety: Checks response safety
51+
- Ties: Assesses overall helpfulness
52+
53+
## Customization
54+
The script is designed to be adaptable for different evaluation needs:
55+
56+
### Custom Datasets
57+
1. Ensure your dataset has a 'text' field containing query/response pairs
58+
2. The dataset should be compatible with HuggingFace's Dataset format
59+
3. Modify `prepare_dialogue()` if your data format differs from the default structure
60+
61+
### Custom Evaluation Criteria
62+
1. Add new dimensions to `CUSTOM_GLOBAL_PROMPTS` in the script
63+
2. Use `--mode_unit_test custom_per_subset` to enable subset-specific evaluation
64+
3. Each subset can have its own evaluation prompt
65+
66+
### API Configuration
67+
- Adjust rate limits (default: 1 request/second)
68+
- Modify retry parameters
69+
- Customize API client for different services
70+
71+
### Custom Processing Pipeline
72+
The main evaluation pipeline can be customized by:
73+
1. Modifying dataset loading/processing
74+
2. Adding new scoring methods
75+
3. Changing output format
76+
4. Implementing different evaluation strategies
77+
78+
## Output
79+
80+
The script generates:
81+
- Evaluation scores for each dimension
82+
- Overall average score
83+
- Results saved in JSON format (`results_grouped_{model_name}.json`)
84+
85+
## Rate Limiting
86+
87+
- Default rate limit: 1 request per second
88+
- Maximum retries: 10
89+
- Base delay: 1.0 seconds
90+
91+
## Error Handling
92+
93+
The script includes:
94+
- Exponential backoff for API requests
95+
- Comprehensive error logging
96+
- Session management for API connections
97+
98+
## Example
99+
100+
```bash
101+
python rewardbench_lmunit.py \
102+
--model lmunit-api \
103+
--api_key your-api-key \
104+
--debug \
105+
--mode_unit_test custom_per_subset
106+
```
107+
108+
## Notes
109+
110+
- For large datasets, consider adjusting batch size and rate limits
111+
- Debug mode is available for testing with a smaller dataset
112+
- Custom unit tests can be specified per subset using the `mode_unit_test` parameter
113+
- The script is designed to be modular and extensible for different evaluation needs

0 commit comments

Comments
 (0)