Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.
Specifically, on rank 0
[root@elcap1:conf.d]# flux overlay status |grep elcap1124
├─ 820 elcap1124: lost lost connection
but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.
The actual TCP connection to the node appeared to be in connected state. Here is elcap12119 (another node in that state):
tcp 0 138488 eelcap12119:46878 eelcap1:8050 ESTABLISHED 188356/broker
and the same connection on elcap1
tcp 246536 0 eelcap1:8050 eelcap12119:46878 ESTABLISHED 3180474/broker
Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.