Skip to content

Fix occasional hangs when MacOS tries to invoke some callbacks on a dying thread#20

Merged
jdtsmith merged 2 commits intojdtsmith:emacs-mac-30_1_expfrom
simmsb:simmsb/fix-hang-in-dying-thread
Jul 9, 2025
Merged

Fix occasional hangs when MacOS tries to invoke some callbacks on a dying thread#20
jdtsmith merged 2 commits intojdtsmith:emacs-mac-30_1_expfrom
simmsb:simmsb/fix-hang-in-dying-thread

Conversation

@simmsb
Copy link
Collaborator

@simmsb simmsb commented Jun 14, 2025

I've noticed some hangs recently where Emacs was becoming stuck in a signal handling loop - it would segfault, and then the signal handler would segfault...

I traced this to the following sequence of events:

  1. emacs-mac implements the buffer as a subclass of NSView, which seems to be what allows using the 'press and hold' function on words. MacOS seems to often call NSView::selectedRange at random times, seemingly for some sort of text completion UI that never appears.
  2. Emacs would be fine, up until calling BUF_BEGV in mac_get_selected_range, at which point it would segfault when attempting to access current_thread->m_current_buffer, due to current_thread being NULL.
  3. Emacs' segfault handler would then run, but then segfault itself, leading to a loop.

The cause for this is that when an Emacs thread finishes its lisp function, it performs some cleanup, part of which is setting current_thread = NULL here (current_thread will then be updated the next time another thread takes the global lock).

I've fixed this by simply updating mac_try_buffer_and_glyph_matrix_access such that it returns false when current_thread is null, this seems to be okay as selectedRange seems to already have some edge case handling where it returns an empty range

This seems to happen when MacOS tries to invoke an 'autocompletion'
which causes Emacs to read from the current buffer. The problem with
this is that MacOS can try to do this at random times, including on a
thread that is currently dying.

Perhaps it's more appropriate to fix these handlers to defer to the main
thread somehow, but this fix appears to work.
@jdtsmith
Copy link
Owner

I'm intrigued by your fork! Thanks very much for contributing. Apologies for all the questions here; there's much about emacs-mac I do not know.

I haven't seen any such hangs in months of using this experimental branch. What MacOS version are you on? Is there any sort of reproducer possible? Does it happen with Emacs -Q? At a glance I don't see that BUF_BEGV should access a thread. But the threading model between Emacs' green threads, and "real" Lisp/MainGUI threads is fairly unclear to me. Do you happen to have a traceback?

I have indeed noticed "ghost" predictive text in apps like Mail, Messages, and Safari lately, so perhaps this is related to the random selectedRange calls. If you disable Setting>Keyboard>TextInput/Region(U.S. for me)>Edit...>Show inline predictive text, does the bug go away (I have it enabled)? I'm wondering if we should instead (or in addition) disable the "ghost text" capability in our NSView, since it doesn't function anyway (and would no doubt conflict with Emacs' native auto-completion). If we can figure out how to do that, I'd like to (optionally) disable "Add period with double-spaces" setting, since it often gets in the way. I don't know how to do either.

I did just yesterday noticed the setting in EmacsMainView of ApplePressAndHoldEnabled=NO, and wondered what it was. Not sure how that would be related to the predictive text feature though.

@simmsb
Copy link
Collaborator Author

simmsb commented Jun 14, 2025

I haven't seen any such hangs in months of using this experimental branch. What MacOS version are you on?

I'm on the 26 beta, and the issues seem to have started there, but I think only because these random calls to selectedRange don't happen in prior MacOS versions.
Additionally there needs to be something creating short lived threads (those of the make-thread kind, and not forking 'threads'). One package that does this is diff-hl with async updates enabled.

What MacOS version are you on

I can reproduce this from HEAD of this repo :)

Do you happen to have a traceback?

Sure:

(current_buffer is a macro which expands to current_thread->m_current_buffer)

Process 46390 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x90)
    frame #0: 0x00000001001f98c4 Emacs`mac_ax_selected_text_range [inlined] BUF_BEGV(buf=0x0000000bf1445800) at buffer.h:882:18 [opt]
   879  INLINE ptrdiff_t
   880  BUF_BEGV (struct buffer *buf)
   881  {
-> 882    return (buf == current_buffer ? BEGV
   883            : NILP (BVAR (buf, begv_marker)) ? buf->begv
   884            : marker_position (BVAR (buf, begv_marker)));
   885  }
Target 0: (Emacs) stopped.
warning: Emacs was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x90)
  * frame #0: 0x00000001001f98c4 Emacs`mac_ax_selected_text_range [inlined] BUF_BEGV(buf=0x0000000bf1445800) at buffer.h:882:18 [opt]
    frame #1: 0x00000001001f98b8 Emacs`mac_ax_selected_text_range [inlined] mac_get_selected_range(w=<unavailable>, range=<unavailable>) at macterm.c:4973:20 [opt]
    frame #2: 0x00000001001f98b0 Emacs`mac_ax_selected_text_range(f=0x0000000bfe4eab98, range=location=51544614720 length=51539828224) at macterm.c:5283:3 [opt]
    frame #3: 0x0000000100236ec4 Emacs`-[EmacsMainView selectedRange](self=<unavailable>, _cmd=<unavailable>) at macappkit.m:7312:3 [opt]
    frame #4: 0x00000001918fc790 AppKit`-[NSTextInputContext(NSInputContext_WithCompletion) selectedRangeWithCompletionHandler:] + 80
    frame #5: 0x0000000190c0b954 AppKit`-[NSTextInputContext selectedRange] + 148
    frame #6: 0x0000000191673988 AppKit`-[NSAutoFillHeuristicController _isAbleToHandleAutoFillForTextInputContext:] + 60
    frame #7: 0x00000001916704f8 AppKit`-[NSAutoFillHeuristicController showOrHideAutoFillForCurrentTextInputContextIfAppropriate] + 204
    frame #8: 0x000000018ded0620 Foundation`__NSFireDelayedPerform + 372
    frame #9: 0x000000018c7642b8 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 32
    frame #10: 0x000000018c763f78 CoreFoundation`__CFRunLoopDoTimer + 980
    frame #11: 0x000000018c763af0 CoreFoundation`__CFRunLoopDoTimers + 280
    frame #12: 0x000000018c75489c CoreFoundation`__CFRunLoopRun + 1840
    frame #13: 0x000000018c812b78 CoreFoundation`_CFRunLoopRunSpecificWithOptions + 564
    frame #14: 0x000000018e91dca4 Foundation`-[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 212
    frame #15: 0x000000010024a238 Emacs`__mac_select_block_invoke_2.1424(.block_descriptor=<unavailable>) at macappkit.m:16987:5 [opt]
    frame #16: 0x0000000100226d48 Emacs`-[NSApplication(self=0x0000000bff084000, _cmd=<unavailable>, block=<unavailable>) stopAfterCallingBlock:] at macappkit.m:501:3 [opt]
    frame #17: 0x000000018df08d10 Foundation`__NSFirePerformWithOrder + 296
    frame #18: 0x000000018c754f2c CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 36
    frame #19: 0x000000018c754e14 CoreFoundation`__CFRunLoopDoObservers + 548
    frame #20: 0x000000018c754480 CoreFoundation`__CFRunLoopRun + 788
    frame #21: 0x000000018c812b78 CoreFoundation`_CFRunLoopRunSpecificWithOptions + 564
    frame #22: 0x00000001990e9874 HIToolbox`RunCurrentEventLoopInMode + 316
    frame #23: 0x00000001990ecaf4 HIToolbox`ReceiveNextEventCommon + 456
    frame #24: 0x0000000199276318 HIToolbox`_BlockUntilNextEventMatchingListInMode + 48
    frame #25: 0x0000000190f811a0 AppKit`_DPSBlockUntilNextEventMatchingListInMode + 236
    frame #26: 0x0000000190addf9c AppKit`_DPSNextEvent + 588
    frame #27: 0x00000001914e385c AppKit`-[NSApplication(NSEventRouting) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 688
    frame #28: 0x00000001914e3568 AppKit`-[NSApplication(NSEventRouting) nextEventMatchingMask:untilDate:inMode:dequeue:] + 72
    frame #29: 0x0000000190ad6810 AppKit`-[NSApplication run] + 368
    frame #30: 0x0000000100226df0 Emacs`-[NSApplication(self=0x0000000bff084000, _cmd=<unavailable>, block=<unavailable>) runTemporarilyWithBlock:] at macappkit.m:521:3 [opt]
    frame #31: 0x000000010024a030 Emacs`__mac_select_block_invoke.1423 [inlined] mac_within_app(block=<unavailable>) at macappkit.m:532:5 [opt]
    frame #32: 0x000000010024a024 Emacs`__mac_select_block_invoke.1423(.block_descriptor=0x0000000170795498) at macappkit.m:16968:4 [opt]
    frame #33: 0x000000010024aa80 Emacs`mac_gui_loop at macappkit.m:16591:2 [opt]
    frame #34: 0x0000000100262de0 Emacs`main.cold.1 at macappkit.m:17223:3 [opt]
    frame #35: 0x000000010024a788 Emacs`main(argc=<unavailable>, argv=0x000000016fdfeb58) at macappkit.m:17221:3 [opt]
    frame #36: 0x000000018c301854 dyld`start + 6256

I'm wondering if we should instead (or in addition) disable the "ghost text" capability in our NSView

That could be done too, but I'd guess there's other reasons for which macos might randomly call selectedRange, and these completions do actually work, you can see them if you use the accessibility keyboard:

Screen.Recording.2025-06-14.at.16.23.26.mov

@jdtsmith
Copy link
Owner

jdtsmith commented Jun 18, 2025

I forgot to follow this one up. Is the fix already active in the master_exp branch? If so we should probably merge it here too.

@simmsb
Copy link
Collaborator Author

simmsb commented Jun 22, 2025

This isn't in the master_exp branch yet, if you're happy with this change we can merge it into both

@jdtsmith
Copy link
Owner

jdtsmith commented Jun 22, 2025

I think we need to dig a bit deeper. It looks like there's an explicit flag mac_set_buffer_and_glyph_matrix_access_restricted which is designed to block GUI access while a lisp thread is running. What is happening on the lisp (real) thread while the main GUI thread is asking for the current selection, during the crash?

Let's see if we can determine why mac_set_buffer_and_glyph_matrix_access_restricted (true); is not getting called? Or perhaps it is, but then mac_try_buffer_and_glyph_matrix_access acquires the global lock anyway? In which case, why? And why has the main thread not reacquired the global_lock?

To me it looks like nothing in thread.c worries about current_thread == NULL anywhere, so it's worth understanding what (current-thread) and (all-threads) say when the fatal GUI selection request comes in. I just made a small test and I can't ever get current_thread == NULL; the main thread is always restored.

(progn
  (message "Commencing with threads: %S" (all-threads))
  (let* ((var 0)
         (thread (make-thread
                  (lambda ()
                    (dotimes (i 3)
                      (message
                       "Waiting %d %S (%S)"
                       i (current-thread) (all-threads))
                      (dotimes (j 10000000)
                        (cl-incf var))))
                  "Test Thread")))
    (thread-join thread)
    (message "Complete with current thread: %S (%S)"
             (current-thread) (all-threads)))
  nil)

It's possible you've identified a race condition in thread.c.

@simmsb
Copy link
Collaborator Author

simmsb commented Jun 23, 2025

It does seem mac_set_buffer_and_glyph_matrix_access_restricted is called, it actually seems this race condition can only occur when mac_buffer_and_glyph_matrix_access_restricted_p is true.

This flag is only set when inside mac_select when the gui thread is listening for input events, and there are running threads ((all-threads) returns non-nil).

I think this race condition is impossible to trigger from elisp/ within emacs. Only one make-thread thread can run at any given time, as a thread will acquire the global lock when it starts (only releasing it when it dies, or yields). The act of acquiring the global lock sets the current-thread global here.

N.B. There are two 'global locks':

  • One only present in emacs-mac: https://github.com/simmsb/emacs-mac-31/blob/978493630ada3d6a58e1658f9bd870399a3128e0/src/thread.c#L1151-L1154
    This is the lock mac_try_buffer_and_glyph_matrix_access acquires. It's a simple pthread mutex and there's nothing else happening during acquisition. As it's only used by mac_try_buffer_and_glyph_matrix_access when mac_buffer_and_glyph_matrix_access_restricted_p is set (which is only when the main thread is in mac_select), I think its main use is to synchronize buffer access coming from these macos callbacks.
  • The other is part of GNU Emacs, after acquiring it, current-thread is set to the current thread holding the lock.

run_thread only releases the global lock after finishing all its work, which normally results in a very small dead time before another thread acquires it (likely the main thread if there were no other make-thread threads). However if the main thread is currently in select it will not pick up the main lock until it finishes, leaving quite a long time where current_thread == NULL.

I'll have a look into what NS Emacs does, it might be the case that emacs-mac is the only build which has this behaviour of buffer access at arbitrary times, but NS Emacs might have a more elegant solution.

@jdtsmith
Copy link
Owner

Sorry I missed this last week. Thanks for your thoughts. It's quite a convoluted system. This may be the key comment:

/* Restrict/unrestrict buffer and glyph matrix access from the GUI
   thread to the case that no Lisp thread is running.  */

I see this is done during mac_select, like you say. I gather the idea is to prevent the GUI from looking at the contents of buffers/etc. while they are being updated in a lisp thread, which makes sense. I also gather our main GUI thread does not really care about the global lock some/much of the time, i.e. it can run independently except for certain times (I think that's different from how NS does it, but not entirely sure).

I do wonder why we have a special thread_try_acquire_global_lock just for us, which does not set current_thread. I understand that mutex_trylock does not block/wait in the current thread, whereas mutex_lock (which acquire_global_lock calls) does. So that seems valuable. But if we are trying to acquire the lock, and we do acquire the lock, shouldn't we update current_thread?

Key insight: I guess the idea is that the GUI thread is "special" and simply tries to coordinate with all the lisp threads, but never wants to be the current_thread (despite sometimes acquiring the lock).

What do you think the value of pthread_main_np is (as called here)? I think it's true only in what we call the main GUI thread. So our GUI thread is indeed a "real" thread (in fact the main one), at the same level as the main lisp thread, as well as all make-thread lisp threads. The difference is, the main GUI thread doesn't completely care about the global_lock, and doesn't pretend to have the capabilities of a lisp thread (buffer state and so on). So it never sets current_thread itself.

N.B. There are two 'global locks'

To me it looks like there is in fact just one global_lock mutex variable, but on mac we just try to acquire it differently, without blocking, which is what leaves current_thread unset, if we do so during the thin slice of time between when the (run_thread) current thread unlocks, and the main lisp thread picks the lock back up. Then we can run in the GUI thread fully independently of the current LISP thread, except when we can't. For those special times, we wrap code in

if (mac_try_buffer_and_glyph_matrix_access ())
  ...
  mac_end_buffer_and_glyph_matrix_access ();

BTW, is it your understanding that the main (lisp) thread gets the lock back in really_call_select?

To find my way back to the topic of this PR, I'm wondering whether we should instead update thread_try_acquire_global_lock itself, to not bother trying to acquire the global lock, and just return EBUSY if current_thread is NULL. I think that amounts to the same thing as your proposed fix, but I think the situation is we want to acquire the lock only from some current LISP thread.

We'd need to bless it with a really good comment so people know why we did that...

@jdtsmith
Copy link
Owner

Related question. Shouldn't this be gated by HAVE_MACGUI too?

emacs-mac/src/thread.c

Lines 35 to 39 in 2ef2f75

#if defined HAVE_GLIB && ! defined (HAVE_NS)
#include <xgselect.h>
#else
#define release_select_lock() do { } while (0)
#endif

This is relevant for really_call_select, which is what calls mac_select. Update: wrong, in fact the thread_select call gets replaced, here:

emacs-mac/src/xgselect.c

Lines 188 to 195 in 30901a2

#ifdef HAVE_MACGUI
mac_select (
#else
thread_select (pselect,
#endif
fds_lim,
&all_rfds, have_wfds ? &all_wfds : NULL, efds,
tmop, sigmask);

So nevermind.

@jdtsmith
Copy link
Owner

jdtsmith commented Jul 2, 2025

To be concrete, can you try out the proposed change I just pushed to this PR branch? I think it should do the same as your fix.

@jdtsmith jdtsmith force-pushed the simmsb/fix-hang-in-dying-thread branch from 4cd5934 to 8932839 Compare July 2, 2025 14:42
@jdtsmith jdtsmith force-pushed the simmsb/fix-hang-in-dying-thread branch from 8932839 to f905adc Compare July 2, 2025 21:32
@jdtsmith
Copy link
Owner

jdtsmith commented Jul 9, 2025

Ben, did you get a chance to test my probably equivalent fix pushed to this PR?

@simmsb
Copy link
Collaborator Author

simmsb commented Jul 9, 2025

Heya, sorry I'd forgot about this. I've just tested this now and it looks to also work.

To me it looks like there is in fact just one global_lock mutex variable, but on mac we just try to acquire it differently

Ah yeah, I see this now.

What do you think the value of pthread_main_np is (as called here)? I think it's true only in what we call the main GUI thread. So our GUI thread is indeed a "real" thread (in fact the main one), at the same level as the main lisp thread, as well as all make-thread lisp threads. The difference is, the main GUI thread doesn't completely care about the global_lock, and doesn't pretend to have the capabilities of a lisp thread (buffer state and so on). So it never sets current_thread itself.

I think you're completely correct here.

BTW, is it your understanding that the main (lisp) thread gets the lock back in really_call_select?

Yeah, I'm pretty sure the flow is:

Main thread other lisp thread <current thread>
really_call_select() Main thread
release_global_lock (); Main thread
acquire_global_lock (self); Other thread
<dies>: current_thread = NULL; NULL
release_global_lock(); NULL
<still in select> NULL
acquire_global_lock (self); Main thread

@jdtsmith
Copy link
Owner

jdtsmith commented Jul 9, 2025

Great, thanks. This is a useful table. We should expand it to include what the real main thread (GUI thread) is doing, trying to find windows where the global lock is available it can hop in. I hope our fix isn't too heavy-handed. I think the only reason the GUI thread tries (lightly) to acquire the global lock is to do things with buffers, etc. So it makes sense that it needs a current_thread to do so.

I'll merge this and you can pull this into your master branch. BTW, I presume where we differed in how we implemented the initial merge with upstream work branch — build flags, etc. — you standardized on the choice I made? If there are differences we should iron them out to avoid future merge difficulties.

@jdtsmith jdtsmith merged commit 821c110 into jdtsmith:emacs-mac-30_1_exp Jul 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants