-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Deadlines
- Abstract: Fri 7 Mar 2025 23:59
- Submission: 14 Mar 2025 23:59
Contributions:
- We detect ambiguity better than ClarifyGPT because we have a better, more semantic measure
- evaluation with a human study
- We are the first who repair requirements fully-automatically (only based on public tests)
- cross-model Pass@K evaluation
Additionally, can think about
- We also incorporate human feedback (simulated humans) better than ClarifyGPT (fewer questions, better final outcome), but we need a more conceptual formulation
Remaining tasks
-
Contribution 1
- Challenge: some requirements are ambiguous, but not detected as ones, and vise-versa.
- @feixiangdejiahao tune hyperparameters such as threshold & temperature based on our pilot dataset
- @feixiangdejiahao with the help of a user study, construct ambiguity detection confusion matrix
- design user study
- need to be sure that our method works well (user study is for confirmation, not testing)
- formulate precise questions (without bias), and get verifiable answers, etc.
- show that the D-measure is better than ClarifyGPT's measure
- @mechtaev maybe we can improve here?
- low-priority: investigate how temperature affects D
-
Contribution 2
- Challenge: some requirement are not ambiguous. We repair only a subset of the dataset.
- identify a part of our HumanEval/MBPP/Taco dataset we aim to repair using D with hyperparameters from Contribution 1
- How should we call this subset?
- @robbiebmorris support our datasets in ClarifyGPT, make a ClarifyGPT modification that only uses public tests
- identify a part of our HumanEval/MBPP/Taco dataset we aim to repair using D with hyperparameters from Contribution 1
- Challenge: some requirement are not ambiguous. We repair only a subset of the dataset.
-
How do we know that our requirements actually become better? The current workflow is Model A: D(R) > Threshold, Model A: R -> R', Model A: D(R') < Threshold. In the other words, the repair is model-specific.
- compare performance of A's Pass@k/HiddenTests on R and R'
- compare performance of a different model B's Pass@k/HiddenTests on R and R'
- @mechtaev when describing the motivation of SpecFix, we need to discuss if our repairs are for humans or for LLMs, and if each fix is for a specific LLM, or for LLMs in general.
- The baseline for auto-repair is ClarifyGPT's simulated user feedback prompts with public tests. We can call it "ClarifyGPT's User Feedback Prompt" (CUFP)
-
@feixiangdejiahao discuss tasks with @ScooterStuff
Metadata
Metadata
Labels
No labels