ICSE 2026 submission plan

# Deadlines

 - Abstract: Fri 7 Mar 2025 23:59 
 - Submission: 14 Mar 2025 23:59

# Contributions:

1. We detect ambiguity better than ClarifyGPT because we have a better, more **semantic** measure
   - evaluation with a human study
2. We are the first who repair requirements fully-automatically (only based on public tests)
   - cross-model Pass@K evaluation

Additionally, can think about

- We also incorporate human feedback (simulated humans) better than ClarifyGPT (fewer questions, better final outcome), but we need a more conceptual formulation

# Remaining tasks

- Contribution 1
  - Challenge: some requirements are ambiguous, but not detected as ones, and vise-versa. 
  - [ ] @feixiangdejiahao tune hyperparameters such as threshold & temperature based on our pilot dataset
  - [ ] @feixiangdejiahao with the help of a user study, construct ambiguity detection confusion matrix
    - [ ] design user study
	- need to be sure that our method works well (user study is for confirmation, not testing)
	- formulate precise questions (without bias), and get verifiable answers, etc.
    - [ ] show that the D-measure is better than ClarifyGPT's measure
    - [ ] @mechtaev maybe we can improve here?
    - [ ] low-priority: investigate how temperature affects D
- Contribution 2
  - Challenge: some requirement are not ambiguous. We repair only a subset of the dataset. 
    - [ ] identify a part of our HumanEval/MBPP/Taco dataset we aim to repair using D with hyperparameters from Contribution 1
      - [ ] How should we call this subset?
    - [ ] @robbiebmorris support our datasets in ClarifyGPT, make a ClarifyGPT modification that only uses public tests
- How do we know that our requirements actually become better? The current workflow is Model A: D(R) > Threshold, Model A: R -> R', Model A: D(R') < Threshold. In the other words, the repair is model-specific.
  - compare performance of A's Pass@k/HiddenTests on R and R'
  - compare performance of a different model B's Pass@k/HiddenTests on R and R'
  - [ ] @mechtaev when describing the motivation of SpecFix, we need to discuss if our repairs are for humans or for LLMs, and if each fix is for a specific LLM, or for LLMs in general.
  - The baseline for auto-repair is ClarifyGPT's simulated user feedback prompts with public tests. We can call it "ClarifyGPT's User Feedback Prompt" (CUFP)

- [ ] @feixiangdejiahao discuss tasks with @ScooterStuff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICSE 2026 submission plan #16

Deadlines

Contributions:

Remaining tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ICSE 2026 submission plan #16

Description

Deadlines

Contributions:

Remaining tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions