1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected

Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.

Specifically, on rank 0
```
[root@elcap1:conf.d]# flux overlay status |grep elcap1124
├─ 820 elcap1124: lost lost connection
```
but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.

The  actual TCP connection to the node appeared to be in connected state.  Here is elcap12119 (another node in that state):
```
tcp        0 138488 eelcap12119:46878       eelcap1:8050            ESTABLISHED 188356/broker
```
and the same connection on elcap1
```
tcp   246536      0 eelcap1:8050            eelcap12119:46878       ESTABLISHED 3180474/broker
```

Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions