Skip to content

ICSE 2026 submission plan #16

@mechtaev

Description

@mechtaev

Deadlines

  • Abstract: Fri 7 Mar 2025 23:59
  • Submission: 14 Mar 2025 23:59

Contributions:

  1. We detect ambiguity better than ClarifyGPT because we have a better, more semantic measure
    • evaluation with a human study
  2. We are the first who repair requirements fully-automatically (only based on public tests)
    • cross-model Pass@K evaluation

Additionally, can think about

  • We also incorporate human feedback (simulated humans) better than ClarifyGPT (fewer questions, better final outcome), but we need a more conceptual formulation

Remaining tasks

  • Contribution 1

    • Challenge: some requirements are ambiguous, but not detected as ones, and vise-versa.
    • @feixiangdejiahao tune hyperparameters such as threshold & temperature based on our pilot dataset
    • @feixiangdejiahao with the help of a user study, construct ambiguity detection confusion matrix
      • design user study
      • need to be sure that our method works well (user study is for confirmation, not testing)
      • formulate precise questions (without bias), and get verifiable answers, etc.
      • show that the D-measure is better than ClarifyGPT's measure
      • @mechtaev maybe we can improve here?
      • low-priority: investigate how temperature affects D
  • Contribution 2

    • Challenge: some requirement are not ambiguous. We repair only a subset of the dataset.
      • identify a part of our HumanEval/MBPP/Taco dataset we aim to repair using D with hyperparameters from Contribution 1
        • How should we call this subset?
      • @robbiebmorris support our datasets in ClarifyGPT, make a ClarifyGPT modification that only uses public tests
  • How do we know that our requirements actually become better? The current workflow is Model A: D(R) > Threshold, Model A: R -> R', Model A: D(R') < Threshold. In the other words, the repair is model-specific.

    • compare performance of A's Pass@k/HiddenTests on R and R'
    • compare performance of a different model B's Pass@k/HiddenTests on R and R'
    • @mechtaev when describing the motivation of SpecFix, we need to discuss if our repairs are for humans or for LLMs, and if each fix is for a specific LLM, or for LLMs in general.
    • The baseline for auto-repair is ClarifyGPT's simulated user feedback prompts with public tests. We can call it "ClarifyGPT's User Feedback Prompt" (CUFP)
  • @feixiangdejiahao discuss tasks with @ScooterStuff

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions