Fix occasional hangs when MacOS tries to invoke some callbacks on a dying thread#20
Conversation
This seems to happen when MacOS tries to invoke an 'autocompletion' which causes Emacs to read from the current buffer. The problem with this is that MacOS can try to do this at random times, including on a thread that is currently dying. Perhaps it's more appropriate to fix these handlers to defer to the main thread somehow, but this fix appears to work.
|
I'm intrigued by your fork! Thanks very much for contributing. Apologies for all the questions here; there's much about emacs-mac I do not know. I haven't seen any such hangs in months of using this experimental branch. What MacOS version are you on? Is there any sort of reproducer possible? Does it happen with I have indeed noticed "ghost" predictive text in apps like Mail, Messages, and Safari lately, so perhaps this is related to the random I did just yesterday noticed the setting in |
I'm on the 26 beta, and the issues seem to have started there, but I think only because these random calls to
I can reproduce this from HEAD of this repo :)
Sure: ( Process 46390 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x90)
frame #0: 0x00000001001f98c4 Emacs`mac_ax_selected_text_range [inlined] BUF_BEGV(buf=0x0000000bf1445800) at buffer.h:882:18 [opt]
879 INLINE ptrdiff_t
880 BUF_BEGV (struct buffer *buf)
881 {
-> 882 return (buf == current_buffer ? BEGV
883 : NILP (BVAR (buf, begv_marker)) ? buf->begv
884 : marker_position (BVAR (buf, begv_marker)));
885 }
Target 0: (Emacs) stopped.
warning: Emacs was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x90)
* frame #0: 0x00000001001f98c4 Emacs`mac_ax_selected_text_range [inlined] BUF_BEGV(buf=0x0000000bf1445800) at buffer.h:882:18 [opt]
frame #1: 0x00000001001f98b8 Emacs`mac_ax_selected_text_range [inlined] mac_get_selected_range(w=<unavailable>, range=<unavailable>) at macterm.c:4973:20 [opt]
frame #2: 0x00000001001f98b0 Emacs`mac_ax_selected_text_range(f=0x0000000bfe4eab98, range=location=51544614720 length=51539828224) at macterm.c:5283:3 [opt]
frame #3: 0x0000000100236ec4 Emacs`-[EmacsMainView selectedRange](self=<unavailable>, _cmd=<unavailable>) at macappkit.m:7312:3 [opt]
frame #4: 0x00000001918fc790 AppKit`-[NSTextInputContext(NSInputContext_WithCompletion) selectedRangeWithCompletionHandler:] + 80
frame #5: 0x0000000190c0b954 AppKit`-[NSTextInputContext selectedRange] + 148
frame #6: 0x0000000191673988 AppKit`-[NSAutoFillHeuristicController _isAbleToHandleAutoFillForTextInputContext:] + 60
frame #7: 0x00000001916704f8 AppKit`-[NSAutoFillHeuristicController showOrHideAutoFillForCurrentTextInputContextIfAppropriate] + 204
frame #8: 0x000000018ded0620 Foundation`__NSFireDelayedPerform + 372
frame #9: 0x000000018c7642b8 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 32
frame #10: 0x000000018c763f78 CoreFoundation`__CFRunLoopDoTimer + 980
frame #11: 0x000000018c763af0 CoreFoundation`__CFRunLoopDoTimers + 280
frame #12: 0x000000018c75489c CoreFoundation`__CFRunLoopRun + 1840
frame #13: 0x000000018c812b78 CoreFoundation`_CFRunLoopRunSpecificWithOptions + 564
frame #14: 0x000000018e91dca4 Foundation`-[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 212
frame #15: 0x000000010024a238 Emacs`__mac_select_block_invoke_2.1424(.block_descriptor=<unavailable>) at macappkit.m:16987:5 [opt]
frame #16: 0x0000000100226d48 Emacs`-[NSApplication(self=0x0000000bff084000, _cmd=<unavailable>, block=<unavailable>) stopAfterCallingBlock:] at macappkit.m:501:3 [opt]
frame #17: 0x000000018df08d10 Foundation`__NSFirePerformWithOrder + 296
frame #18: 0x000000018c754f2c CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 36
frame #19: 0x000000018c754e14 CoreFoundation`__CFRunLoopDoObservers + 548
frame #20: 0x000000018c754480 CoreFoundation`__CFRunLoopRun + 788
frame #21: 0x000000018c812b78 CoreFoundation`_CFRunLoopRunSpecificWithOptions + 564
frame #22: 0x00000001990e9874 HIToolbox`RunCurrentEventLoopInMode + 316
frame #23: 0x00000001990ecaf4 HIToolbox`ReceiveNextEventCommon + 456
frame #24: 0x0000000199276318 HIToolbox`_BlockUntilNextEventMatchingListInMode + 48
frame #25: 0x0000000190f811a0 AppKit`_DPSBlockUntilNextEventMatchingListInMode + 236
frame #26: 0x0000000190addf9c AppKit`_DPSNextEvent + 588
frame #27: 0x00000001914e385c AppKit`-[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 688
frame #28: 0x00000001914e3568 AppKit`-[NSApplication(NSEventRouting) nextEventMatchingMask:untilDate:inMode:dequeue:] + 72
frame #29: 0x0000000190ad6810 AppKit`-[NSApplication run] + 368
frame #30: 0x0000000100226df0 Emacs`-[NSApplication(self=0x0000000bff084000, _cmd=<unavailable>, block=<unavailable>) runTemporarilyWithBlock:] at macappkit.m:521:3 [opt]
frame #31: 0x000000010024a030 Emacs`__mac_select_block_invoke.1423 [inlined] mac_within_app(block=<unavailable>) at macappkit.m:532:5 [opt]
frame #32: 0x000000010024a024 Emacs`__mac_select_block_invoke.1423(.block_descriptor=0x0000000170795498) at macappkit.m:16968:4 [opt]
frame #33: 0x000000010024aa80 Emacs`mac_gui_loop at macappkit.m:16591:2 [opt]
frame #34: 0x0000000100262de0 Emacs`main.cold.1 at macappkit.m:17223:3 [opt]
frame #35: 0x000000010024a788 Emacs`main(argc=<unavailable>, argv=0x000000016fdfeb58) at macappkit.m:17221:3 [opt]
frame #36: 0x000000018c301854 dyld`start + 6256
That could be done too, but I'd guess there's other reasons for which macos might randomly call Screen.Recording.2025-06-14.at.16.23.26.mov |
|
I forgot to follow this one up. Is the fix already active in the |
|
This isn't in the |
|
I think we need to dig a bit deeper. It looks like there's an explicit flag Let's see if we can determine why To me it looks like nothing in (progn
(message "Commencing with threads: %S" (all-threads))
(let* ((var 0)
(thread (make-thread
(lambda ()
(dotimes (i 3)
(message
"Waiting %d %S (%S)"
i (current-thread) (all-threads))
(dotimes (j 10000000)
(cl-incf var))))
"Test Thread")))
(thread-join thread)
(message "Complete with current thread: %S (%S)"
(current-thread) (all-threads)))
nil)It's possible you've identified a race condition in |
|
It does seem This flag is only set when inside I think this race condition is impossible to trigger from elisp/ within emacs. Only one N.B. There are two 'global locks':
I'll have a look into what NS Emacs does, it might be the case that emacs-mac is the only build which has this behaviour of buffer access at arbitrary times, but NS Emacs might have a more elegant solution. |
|
Sorry I missed this last week. Thanks for your thoughts. It's quite a convoluted system. This may be the key comment: /* Restrict/unrestrict buffer and glyph matrix access from the GUI
thread to the case that no Lisp thread is running. */I see this is done during I do wonder why we have a special Key insight: I guess the idea is that the GUI thread is "special" and simply tries to coordinate with all the lisp threads, but never wants to be the What do you think the value of
To me it looks like there is in fact just one if (mac_try_buffer_and_glyph_matrix_access ())
...
mac_end_buffer_and_glyph_matrix_access ();BTW, is it your understanding that the main (lisp) thread gets the lock back in To find my way back to the topic of this PR, I'm wondering whether we should instead update We'd need to bless it with a really good comment so people know why we did that... |
|
To be concrete, can you try out the proposed change I just pushed to this PR branch? I think it should do the same as your fix. |
4cd5934 to
8932839
Compare
8932839 to
f905adc
Compare
|
Ben, did you get a chance to test my probably equivalent fix pushed to this PR? |
|
Heya, sorry I'd forgot about this. I've just tested this now and it looks to also work.
Ah yeah, I see this now.
I think you're completely correct here.
Yeah, I'm pretty sure the flow is:
|
|
Great, thanks. This is a useful table. We should expand it to include what the real main thread (GUI thread) is doing, trying to find windows where the global lock is available it can hop in. I hope our fix isn't too heavy-handed. I think the only reason the GUI thread tries (lightly) to acquire the global lock is to do things with buffers, etc. So it makes sense that it needs a I'll merge this and you can pull this into your master branch. BTW, I presume where we differed in how we implemented the initial merge with upstream |
I've noticed some hangs recently where Emacs was becoming stuck in a signal handling loop - it would segfault, and then the signal handler would segfault...
I traced this to the following sequence of events:
NSView::selectedRangeat random times, seemingly for some sort of text completion UI that never appears.BUF_BEGVinmac_get_selected_range, at which point it would segfault when attempting to accesscurrent_thread->m_current_buffer, due tocurrent_threadbeingNULL.The cause for this is that when an Emacs thread finishes its lisp function, it performs some cleanup, part of which is setting
current_thread = NULLhere (current_thread will then be updated the next time another thread takes the global lock).I've fixed this by simply updating
mac_try_buffer_and_glyph_matrix_accesssuch that it returns false when current_thread is null, this seems to be okay asselectedRangeseems to already have some edge case handling where it returns an empty range