Skip to content

fix: disable LTO on mods lib and synchronize entity processing#125

Closed
Guffawaffle wants to merge 1 commit intonetniV:mainfrom
Guffawaffle:fix/lto-and-sync-crashes
Closed

fix: disable LTO on mods lib and synchronize entity processing#125
Guffawaffle wants to merge 1 commit intonetniV:mainfrom
Guffawaffle:fix/lto-and-sync-crashes

Conversation

@Guffawaffle
Copy link
Copy Markdown

Summary

Two latent crash bugs found during hotkey feature development. Both cause intermittent crashes that are extremely difficult to reproduce because they depend on code layout, timing, and optimization decisions.

Bug 1: LTO/LTCG breaks SPUD detour hooks

Symptom: Intermittent C0000005 (access violation) at detour call sites. Adding or removing any struct member changes whether it crashes because it shifts code layout.

Root Cause: The mods static library compiles with /GL (Whole Program Optimization) by default, emitting MSIL bytecode. When the linker performs LTCG on the final DLL, it re-optimizes and re-lays-out function bodies — but SPUD's detour trampolines were patched against pre-LTCG addresses. The trampoline jumps into relocated/rewritten instructions.

Fix: set_policy("build.optimization.lto", false) on the mods target in mods/xmake.lua.

Why this wasn't caught before: LTO is a link-time concern. Everything compiled and linked fine. Crashes only manifested at runtime in code paths that got relocated, and the crash location moved depending on what code was added/removed — making it look like data corruption rather than codegen.

Bug 2: Detached threads cause stack corruption

Symptom: Intermittent C0000409 (GS cookie / stack buffer overrun) in submit_async or RTC handler.

Root Cause: std::thread().detach() in submit_async creates fire-and-forget threads accessing shared state without synchronization. When the OS reclaims a detached thread's stack while it's still executing, or multiple threads race on shared containers, the GS security cookie fails.

Fix: Replace std::thread().detach() with synchronous execution. Entity processing is fast enough that blocking is not a problem.

Other changes

  • .vscode/settings.json removed from tracking (personal editor config)
  • .gitignore updated to exclude .smartergpt/

Testing

  • Mod loads, hooks install, hotkeys function with both fixes
  • Stress-tested config struct modifications that previously triggered Bug 1 — no crashes
  • Zero behavioral changes

- Disable LTO (build.optimization.lto) on mods static lib to prevent
  LTCG codegen from breaking SPUD detour hooks at runtime (C0000005)
- Replace detached std::thread in submit_async and RTC handler with
  synchronous execution to fix GS cookie stack corruption (C0000409)
- Remove .vscode/settings.json from tracking (personal editor config)
- Add .smartergpt/ to .gitignore
@Guffawaffle
Copy link
Copy Markdown
Author

Why disable LTO instead of fixing the root cause?

The /GL + LTCG pipeline works by deferring final code generation to link time. Instead of emitting machine code per translation unit, MSVC emits an intermediate representation, then at link time LTCG sees all code at once and re-optimizes globally — inlining across translation units, merging identical function bodies, eliminating "unused" functions, and reshaping prologues.

The problem: SPUD installs detours by patching function prologues at runtime, after linking is long finished. LTCG has no visibility into these runtime patches. When it inlines or eliminates a function that SPUD later tries to detour, the patch either lands on the wrong code or targets an address that no longer exists. The compiler is making correct optimizations based on incomplete information.

Alternatives considered

Approach Drawback
__declspec(noinline) on detour targets Fragile — miss one and you get a mystery crash. Must be maintained as new hooks are added.
LTO-aware detour library SPUD doesn't support this; would require upstream changes or a library swap.
Disable LTO on the mods target only Chosen. Single policy line, prevents the entire class of breakage, negligible perf impact for a small injected DLL.

This is a pragmatic workaround, not a root-cause fix. If SPUD gains LTO compatibility in the future, the set_policy("build.optimization.lto", false) line can be removed.

@Guffawaffle
Copy link
Copy Markdown
Author

QA Checklist

Prerequisites

  • Built with xmake build stfc-community-patch (releasedbg mode)
  • Verified no /GL or /LTCG in compiler/linker flags (xmake show -t stfc-community-patch)

Bug 1: LTO Disabled

  • set_policy("build.optimization.lto", false) present in mods/xmake.lua
  • Game loads past 20% splash screen without crashing
  • No C0000005 (access violation) at detour call sites
  • Repeat launch 3+ times to confirm stability

Bug 2: Synchronous Entity Processing

  • std::thread().detach() removed from submit_async lambda in HandleEntityGroup
  • Entity processing runs synchronously (inline in the hook)
  • No C0000409 (GS cookie / stack buffer overrun) crashes
  • Game remains responsive — no noticeable frame drops during entity processing

Sync Functionality

  • Sync hooks install successfully (check log for "Patching 14 of 15 (SyncPatches)")
  • ProcessResultInternal detour fires (entity groups logged)
  • Data uploads to configured sync target return 200 OK

Regression

  • All hotkeys still function (section navigation, ship select, toggles)
  • Config loads correctly from TOML
  • community_patch_runtime.vars generated correctly
  • Game shutdown is clean (no hung threads, no crash on exit)

@Guffawaffle
Copy link
Copy Markdown
Author

Dev Note: Why entity processing must be synchronous

During testing, we explored reducing the game-thread footprint of the synchronous entity processing fix. Documenting findings here for future reference.

What was tried

Replaced the synchronous submit_async calls in HandleEntityGroup with a single persistent worker thread using a queue (same architecture as TargetWorker):

  • Game thread: null check → byte copy → mutex lock → queue push → notify CV (~microseconds)
  • Worker thread: protobuf parse → diff → JSON serialize → queue_data() (the heavy work)
  • All Config::Get() checks remained on the game thread before queueing

This was a middle ground between the original per-call std::thread().detach() (crash-prone) and fully-synchronous processing (blocks game thread).

Result

Same C0000005 crash at ~20% game load. Identical to the original detached-thread crash.

Conclusion

The issue is not specific to detached threads or Config::Get() access patterns. Any processing on a separate thread stack — whether detached, persistent worker, or joined — triggers the crash. The SPUD detour trampoline is sensitive to the call stack layout in a way that requires entity processing to execute on the same stack frame as the original hooked function.

Synchronous inline processing is the only working approach. The actual frame cost is negligible in practice — it's just protobuf parse + diff + JSON serialize + queue push (no I/O). HTTP sending is already async via TargetWorker threads, which are unaffected because they don't run inside a SPUD detour call chain.

@netniV
Copy link
Copy Markdown
Owner

netniV commented Apr 14, 2026

We do NOT want to run the processing synchronously, that is a bad idea.

@netniV
Copy link
Copy Markdown
Owner

netniV commented Apr 14, 2026

Also, you're PRs are NOT clean... you have mixed up various changes and bled them into other PRs

@Guffawaffle Guffawaffle reopened this Apr 14, 2026
@Guffawaffle
Copy link
Copy Markdown
Author

Guffawaffle commented Apr 14, 2026

Yeah this one cascades into others because, without it, I crash on init on pretty much anything I touch so this one does get pulled in. I add a "depends on" to kinda point towards this but it's not clean when looking at a PR.

I'll step back and examine more about this init crash tomorrow. I can clearly say, without this PR, any other changes I make become a crap shoot on if it will crash on init (20% load) or not. [edit: this might also be fixed by the thin hook > dispatcher > handlers method on the other ticket]

I'll step back to this with the lessons I picked up today and rework it. I've learned a lot. Likely shouldn't have tossed any of these as prs yet but, in the dev cycle I work on, this is just normal to get a pr open and then folks can start reviewing as you iterate. Helps to get early feedback imo.

@Guffawaffle
Copy link
Copy Markdown
Author

Closing — This did not full resolve the crash issues, or even noticably improve the experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants