Add custom GEval support with criteria, rubric, and evaluation_steps#37
Add custom GEval support with criteria, rubric, and evaluation_steps#37justaddcoffee wants to merge 2 commits intoliterature-eval-enhancementsfrom
Conversation
- Add MetricConfig class with support for custom criteria, rubric, and evaluation_steps - Add RubricItem class for structured rubric definitions - Implement make_custom_geval() to create GEval instances from MetricConfig - Add model_validator to enforce mutual exclusivity of criteria/evaluation_steps - Update EvalCase to accept both string and MetricConfig metric specifications - Fix make_geval() bug: remove duplicate criteria parameter (keep evaluation_steps only) - Improve error handling in Claude coder (warning instead of exception) - Add comprehensive test suite for MetricConfig validation Addresses limitations with generic evaluation criteria not suitable for: - Exact text extraction tasks - Specific metadata field retrieval - Binary decision evaluations The custom GEval feature allows test-specific evaluation criteria for better accuracy and reliability. evaluation_steps provides more control than criteria according to DeepEval documentation. Tests: Added test_custom_geval.py with 7 test cases covering all validation scenarios
|
Claude finished @justaddcoffee's task —— View job Code Review Complete ✅I've thoroughly reviewed the custom GEval enhancements PR. Overall, this is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my detailed analysis: Review Tasks
✅ StrengthsExcellent Architecture & Design
Robust Input Validation
Comprehensive Test Coverage
Important Bug Fixes
|
|
Claude finished @justaddcoffee's task —— View job Claude is reviewing the custom GEval enhancements PR Review Tasks
✅ Overall Assessment: APPROVEThis is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my comprehensive review: StrengthsExcellent Architecture & Design
Robust Input Validation
Comprehensive Test Coverage
Important Bug Fixes
Areas for Improvement1. Rubric Score Range Validation# src/metacoder/evals/runner.py:107-118
if not 0.0 <= item.score <= 1.0:
raise ValueError(f"Rubric score {item.score} must be between 0.0 and 1.0")Suggestion: Add validation that rubric scores are within valid ranges (e.g., 0.0-1.0) 2. Default Criteria Enhancement# src/metacoder/evals/runner.py:137-139
kwargs["criteria"] = "Evaluate the actual output based on the rubric criteria."Suggestion: This fallback criteria could be more descriptive or allow users to specify custom criteria even with rubrics 3. Minor Performance Optimization# src/metacoder/evals/runner.py:357-360
if metric_config and (metric_config.rubric or metric_config.criteria or metric_config.evaluation_steps):Suggestion: Can be simplified to just Security Assessment 🔒
Performance Assessment 🚀
Test Quality Assessment 📋Excellent test coverage with comprehensive scenarios:
Final RecommendationThis implementation is production-ready and addresses a real need for more flexible evaluation criteria. The code follows best practices, includes comprehensive testing, and maintains backward compatibility while adding powerful new functionality. Key Benefits:
The minor improvement suggestions don't block approval and can be addressed in future iterations. |
Summary
This PR adds comprehensive support for custom GEval metrics with three configuration options:
Motivation
The default CorrectnessMetric uses generic evaluation criteria that don't work well for specialized tasks:
Custom metrics allow test-specific evaluation criteria for better accuracy and reliability.
Changes
Core Features
1. Add
MetricConfigclass (eval_model.py)name,criteria,evaluation_steps,rubric@model_validatorenforces mutual exclusivity ofcriteriaandevaluation_stepscriteria,evaluation_steps, orrubric2. Add
RubricItemclass (eval_model.py)scoreandcriteriafields3. Implement
make_custom_geval()(runner.py)4. Update
EvalCase.metrics(eval_model.py)List[str]toList[Union[str, MetricConfig]]5. Update runner logic (
runner.py)make_custom_geval()criteria,evaluation_steps, orrubricpresenceBug Fixes
6. Fix
make_geval()bug (runner.py)criteriaparameter (was specifying bothcriteriaANDevaluation_steps)evaluation_stepsfor more reliable scoring7. Improve error handling (
claude.py)raise ValueErrortologger.warningfor non-auth errorsTests
8. Add comprehensive test suite (
tests/test_custom_geval.py)Usage Examples
Custom Criteria
Custom Rubric
Custom Evaluation Steps
Testing
pytest tests/test_custom_geval.py)Documentation
Compatibility
References
Generated with Claude Code