Skip to content

Conversation

@cpjet64
Copy link
Contributor

@cpjet64 cpjet64 commented Oct 24, 2025

Preserve CRLF/LF newline consistency on Windows edits

Problem & intent

Edits on Windows sometimes converted CRLF files to LF or produced mixed endings, creating noisy diffs and IDE warnings. This PR ensures newline preservation during file writes.


Before / After

Before
The text writer emitted LF regardless of a file’s original newline convention.
CRLF files could flip to LF or end up mixed.

After

  • Writer preserves the original file’s EOL.
  • If the on‑disk file used CRLF → edits keep CRLF.
  • If LF → edits keep LF.
  • New files follow the existing global default (unchanged).

Scope of change

codex-rs/apply-patch/src/lib.rs

  • Detect original EOL from file contents.
  • Normalize final text to match at write point.
  • Introduce tiny helper functions: Eol, detect_eol, normalize_to_eol.
  • Single call at the final text write site.

No changes to binary paths, new-file adds, TUI paste logic, or PATH/PATHEXT/env behavior.


Security effects

None. Only affects how edited text is serialized to disk.


Testing evidence

Static

  • cargo fmt
  • cargo clippy -p codex-apply-patch -- -D warningsclean
  • cargo test -p codex-apply-patch → added CRLF/LF preservation tests pass; full suite green.

Manual

  • CRLF originals remain CRLF after edits.
  • LF originals remain LF.

Notes

  • Diff/preview logic remains LF to avoid snapshot churn; preservation occurs only at final write.
  • New files keep default EOL behavior by design.

Checklist

@dylan-hurd-oai dylan-hurd-oai self-requested a review October 28, 2025 05:34
Copy link
Collaborator

@dylan-hurd-oai dylan-hurd-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpjet64 thanks for digging into this! Really appreciate the investigation here. My primary concern is whether this change will be sufficiently deterministic.

If you're interested in implementing follow-ups to the review, we're happy to continue the discussion as you iterate, or we can just build off the work you've contributed so far. But we're eager to land a fix here to improve the Windows experience!

Crlf,
}

fn detect_eol(s: &str) -> Eol {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will sufficiently cover all cases, e.g. how do we expect to handle newly created files in a repo that uses CRLF?

We might want to consider a more deterministic approach for detect_eol. We could check git configuration.

Curious if you have thoughts on adding logic to collect_git_info and passing using a cli flag to inform this setting instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylan-hurd-oai Thanks for the review. I kept this PR intentionally small to limit the review surface. If you’re okay with a little more LOC I’ll add a follow-up commit on this same branch (fix/win-crlf-preserve) to make new-file EOL behavior deterministic while leaving edits to existing files unchanged.

Proposal (follow-up commit)

Behavior

  • Existing files: unchanged — detect EOL from on-disk contents and preserve (fast, backward compatible).

  • New files (deterministic precedence):

    1. .gitattributes for the path (use git check-attr eol -- <path> where available)
    2. git config (core.eol then core.autocrlf)
    3. CLI/env override: --assume-eol=(git|crlf|lf|detect) or APPLY_PATCH_ASSUME_EOL
    4. Fallback to the current default behavior (what this writer already uses)

Wiring

  • Add a decide_eol() helper that implements the precedence and call it at the final write site before normalizing text.
  • Add a --assume-eol CLI flag + optional env var for explicit overrides.
  • Keep diff/preview logic unchanged (previews remain LF); normalization only affects final disk writes.

Common edge cases & how this follow-up handles them

  1. Repo not a git repo / git unavailable

    • Fall back to existing behavior. git is best-effort; failures are non-fatal and logged at debug level.
  2. .gitattributes pattern matching complexity

    • Prefer git check-attr to avoid reimplementing matching. If git is missing, fall back to a best-effort .gitattributes parse (documented).
  3. core.autocrlf semantics / cross-machine variance

    • Precedence: .gitattributes > core.eol > core.autocrlf.
    • Interpret core.autocrlf=true → CRLF, input → LF, false → unknown. .gitattributes is the repo-level guarantee.
  4. Binary files / -text attribute

    • If git check-attr reports binary or -text, skip normalization — don’t rewrite binaries.
  5. Mixed line endings in existing files

    • Preserve current behavior: detect dominant EOL and preserve. No auto-fixing of mixed files.
  6. Large files / performance

    • We only read content for existing files (current behavior). A streaming heuristic can be added later if needed.
  7. Paths with spaces / special characters

    • Use std::process::Command (no shell) and pass path args directly to git to avoid escaping issues.
  8. Files outside repo / submodules / worktrees

    • If file sits outside the repo root passed to the writer, skip repo checks and fall back. Submodule/worktree behavior depends on repo root provided; failures fall back safely.
  9. CI vs local dev differences

    • .gitattributes is authoritative for CI; that’s why it’s first in precedence.
  10. Old git without check-attr

    • Fall back to parsing .gitattributes if git check-attr isn’t available; document this as weaker behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough proposal and consideration here! At first glance this feels reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered adding a TUI slash command to auto-detect CRLF/LF, force CRLF, or force LF, but it would have been a much larger addition. I think most users would never use it, so it risked becoming bloat.

Context:
* Windows edits sometimes flipped CRLF files to LF or created mixed endings.

Change:
* Detect the original file EOL (treat any \r\n as CRLF; else LF) and normalize the final text to match only for Update File writes.
* Added small helper (detect_eol/normalize_to_eol) and a single call at the final write site.
* Added focused tests to verify CRLF/LF preservation.

Risk:
* Minimal and localized. Only affects the text update write path; binary paths and new-file adds are unchanged.

Tests:
* cargo fmt
* cargo clippy -p codex-apply-patch -- -D warnings (clean)
* cargo test -p codex-apply-patch (all tests passed; includes new EOL tests)

Verification:
* Verified on local runs that updating CRLF files preserves CRLF and LF files preserve LF. New-file behavior remains unchanged.
@cpjet64 cpjet64 force-pushed the fix/win-crlf-preserve branch from 1822f8e to 0c56804 Compare October 28, 2025 14:40
Copy link
Contributor Author

@cpjet64 cpjet64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylan-hurd-oai Commit pushed. Please let me know if you have any questions or want any changes! Super happy to help resolve any and all Windows issues!

Copy link
Collaborator

@dylan-hurd-oai dylan-hurd-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the direction here! 1 strong opinion on the precedence, otherwise mostly structural requestsx

AssumeEol::Lf => return Eol::Lf,
AssumeEol::Crlf => return Eol::Crlf,
_ => {}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO cli args should take precedence, even for existing files. If the client of this command explicitly sets their preferene, we should consistently adhere to it. e.g. if we wanted to use this option in core, we would also want it to flow through and ignore the split on existing files.

// Existing files: detect from on-disk bytes
if !is_new_file {
if let Ok(bytes) = std::fs::read(path) {
let e = detect_eol_from_bytes(&bytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fallback should just be based on the patch contents here (i.e. use the previous logic). Re-reading the entire file is going to be too much of a performance hit, and if a file is already mixed then adding more based on the line endings near the edit seems reasonable / equivalently deterministic.

let _argv0 = args.next();
// Allow optional flags, then a single positional PATCH argument; otherwise read from stdin.
let args = std::env::args().skip(1).collect::<Vec<String>>();
let mut patch_arg: Option<String> = None;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have more than 1 arg, I would prefer to switch to using clap for arg parsing here

.arg("check-attr")
.arg("eol")
.arg("text")
.arg("binary")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call here suggests it's generic, but it parses out specific attributes for line endings. Can we make either move this into git_check_attr_eol to simplify, or make this logic generic?


fn os_native_eol() -> Eol {
if cfg!(windows) { Eol::Crlf } else { Eol::Lf }
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib.rs has now crossed 2000 lines. We should have done this already, but can we start splitting this up? Let's move the eol functions into their own eol mod, please

_ => {}
}

// 2) .gitattributes, core.eol, core.autocrlf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting that I'm a bit conflicted about whether this config or patch contents should take precedence for existing files. I think the performance of not executing git commands on every patch is likely worth it, but given how core apply_patch is I'd like to be exceedingly thoughtful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylan-hurd-oai Would you have a issue with some simple caching? Heres what I am thinking:

Precedence and rationale:
For existing files, CLI/env wins; otherwise we use the original file’s EOL from the patch/original buffer (no git), which preserves intent and avoids I/O.
For new files, CLI/env → .gitattributes (path‑specific) → core.eol → core.autocrlf → OS native → detect; this matches repo policy and CI while staying deterministic.

Performance:
Hot path for existing files does zero git calls and no extra file re‑reads (we infer from bytes already in memory or the hunk’s adjacent lines).
New‑file policy does at most two cheap git queries per path (attrs + config), memoized per run.

Edge handling:
Mixed EOL in existing files keeps the dominant EOL (we don’t auto‑“fix” mixed files); binaries or -text skip normalization; trailing newline presence is preserved exactly; previews stay LF‑normalized—only the final write normalizes.

Caching safety:
Per‑process, in‑memory caches keyed by (repo_root, rel_path) with invalidation if the patch touches any .gitattributes; on any git error we fall back by precedence (no risk of stale/corrupt state).

If this works I can easily roll it in.

…ted caching

Context:
* Address Dylan’s review on EOL handling: extract logic into eol.rs, add a clear CLI/env override, and ensure precedence covers .gitattributes, core.eol, core.autocrlf, and OS native. Preserve trailing newline presence exactly and skip normalization for binary/-text. Add an opt‑in caching layer to reduce repeated git calls per run.

Change:
* Extracted EOL types + helpers into src/eol.rs (detect, normalize, git helpers, OS native).
* Implemented precedence: CLI/env > .gitattributes (typed; binary/-text -> Unknown) > core.eol > core.autocrlf > OS native; plus detect‑from‑patch for new files in non‑git dirs or when --assume-eol=detect.
* Standalone binary now uses clap; new -E/--assume-eol {lf|crlf|git|detect}; CLI wins over env.
* Added process‑local, feature‑gated caches for git lookups (core.eol, core.autocrlf, check‑attr) with invalidation on .gitattributes Add/Update/Delete.
* Strengthened Windows handling: repo root canonicalization, forward‑slash rel paths, and resilient rel calculation; ensured attrs win absent overrides.
* Tests: added/updated unit + integration tests for detection, normalization, precedence, detect fallback, CLI/env interaction, Windows path behavior, and caching hit/miss + invalidation.

Risk:
* Low; behavior identical except where explicitly specified (detect fallback for new files). Caches are in‑memory and feature‑gated (default on). No sandbox or shared crates touched.

Tests:
* cargo test -p codex-apply-patch (with cache): 54 passed.
* cargo test -p codex-apply-patch --no-default-features (cache off): 52 passed.
* Workspace run shows an existing app-server user-agent test failure; unrelated to apply‑patch.

Verification:
* Ran cargo fmt, cargo clippy (scoped), and tests as above. Confirmed invalidation triggers on .gitattributes changes and that CLI ‘detect’ overrides repo policy as intended.
@cpjet64
Copy link
Contributor Author

cpjet64 commented Oct 28, 2025

@dylan-hurd-oai Commit pushed with requested changes. Please let me know if I am on the right track or if theres anything else you would prefer.

EOL Policy Implementation

  • Single source of truth: All EOL policy and git logic consolidated into eol.rs, centralizing decisions for easier testing and evolution.

  • Explicit, test-backed precedence:

    • CLI/env.gitattributes (typed; binary/-text → skip) → core.eolcore.autocrlf → OS native
    • Narrow detect-from-patch fallback only for new files (non-git or when explicitly requested)
  • Attribute parsing kept specific: .gitattributes handling is localized and only interprets eol/text/binary, per the "make it specific" guidance.

  • No re-reads for existing files: EOL is inferred from original in-memory bytes, preserving trailing newline presence exactly.

  • Binary/-text respected: When attributes indicate binary or -text, normalization is skipped entirely.

  • Windows resilience:

    • Normalize relative paths (forward slashes, lowercased on Windows)
    • Canonicalize repo roots
    • Ensure .gitattributes wins absent overrides
  • Performance without behavior change:

    • Cache is behind a compile-time feature gate for opt-in/out flexibility
    • Memory-only with scoped invalidation on .gitattributes changes
    • Does not alter precedence or outputs
    • Enabled by default for this branch; disable via --no-default-features to compare behavior/performance side-by-side
    • Highly recommend implementing the cache functionality as during my local testing on my large rappct project using cargo run it was a significant reduction in process time though it wasnt as noticeable on any of my small projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Patched files have mixed line endings on Windows

2 participants