Humor-Bench

Humor-Bench is a benchmark for evaluating humor understanding in AI Models. This research project includes an autograder system and comprehensive performance analysis across various models.

Dataset

The Humor-Bench dataset consists of hand-annotated cartoons from the New Yorker Caption Contest. We sourced the original cartoons and captions from the Nextml Caption Contest Data, jmhessel/newyorker_caption_contest collections, and CartoonStock.

For each cartoon, we provide:

Detailed image descriptions, annotated by our team
Multiple captions sourced from professional humor contests
A structured "element" field, annotated by us, identifying the specific comedic device or concept

This high-quality ground truth signal allows us to objectively evaluate AI models on their ability to identify and explain humor, making it a robust benchmark for measuring humor comprehension capabilities.

Autograder and autograder eval

To automate the evaluation of generated humor explanations, Humor-Bench includes an LLM-based autograder (autograder.py). This autograder uses a separate, powerful language model (e.g., GPT-4o) to assess whether a generated explanation correctly identifies the specific humor element (comedic device or concept) defined in our ground truth annotations.

The autograder is designed to:

Accept an explanation, cartoon description, caption, and the target humor element.
Provide a PASS or FAIL judgment based on whether the explanation addresses the target element.
Output reasoning for its judgment.

We also provide scripts to evaluate the autograder itself against human judgments (autograder_eval.py) and analyze its performance (autograder_eval_analysis.py). This allows for measuring the autograder's accuracy, false positive rate (FPR), and false negative rate (FNR) compared to human annotations, ensuring its reliability as an evaluation tool.

Our evaluation rubric consists of 100 distinct explanation elements, each with human judgments (PASS/FAIL) for explanations generated by four different models: GPT-4o, Gemini 2.5 Pro, Llama 4 Maverick, and Claude 3.7 Sonnet (totaling 400 human-annotated data points). Using GPT-4o as the autograder model, we found it achieved an average accuracy of 87% against these human labels. Notably, the autograder exhibited a higher False Positive Rate than False Negative Rate, indicating a bias towards leniency. This means that while the autograder might occasionally pass a subpar explanation, a FAIL judgment is a strong indicator of an inadequate explanation.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
analysis		analysis
docs		docs
model_data		model_data
reports		reports
rubric		rubric
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annotate.py		annotate.py
autograder.py		autograder.py
autograder_eval.py		autograder_eval.py
autograder_eval_analysis.py		autograder_eval_analysis.py
autograder_eval_analysis.sh		autograder_eval_analysis.sh
benchmark.sh		benchmark.sh
categorize.py		categorize.py
categorized_output_cultural_reference.csv		categorized_output_cultural_reference.csv
categorized_output_toxic_or_shocking.csv		categorized_output_toxic_or_shocking.csv
categorized_output_wordplay.csv		categorized_output_wordplay.csv
comprehensive_annotations.csv		comprehensive_annotations.csv
cot_summarizer.py		cot_summarizer.py
explainer.py		explainer.py
generate_explanations.py		generate_explanations.py
llm_context.md		llm_context.md
main_benchmark.py		main_benchmark.py
prompts.py		prompts.py
regrade.py		regrade.py
rubric_annotator.py		rubric_annotator.py
run_eval.sh		run_eval.sh
truncate_and_regrade.py		truncate_and_regrade.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Humor-Bench

Dataset

Autograder and autograder eval

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ReubenNarad/Humor-Bench

Folders and files

Latest commit

History

Repository files navigation

Humor-Bench

Dataset

Autograder and autograder eval

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages