Skip to content

Commit 755fabf

Browse files
committed
rfc21: add offline job state
1 parent 379966b commit 755fabf

File tree

3 files changed

+130
-89
lines changed

3 files changed

+130
-89
lines changed

data/spec_21/states.dot

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ digraph states {
1212
DEPEND;
1313
PRIORITY;
1414
SCHED;
15-
RUN;
15+
{rank=same; RUN; OFFLINE;}
1616
CLEANUP;
1717
}
1818

@@ -25,6 +25,9 @@ digraph states {
2525

2626
SCHED -> PRIORITY [label="flux-restart"]
2727

28+
RUN -> OFFLINE [xlabel="disconnect"]
29+
OFFLINE -> RUN [xlabel="reconnect"]
30+
2831
edge [weight=0 color="red"];
2932

3033
DEPEND -> CLEANUP [label="exception"];

data/spec_21/states.svg

Lines changed: 85 additions & 86 deletions
Loading

spec_21.rst

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,14 @@ RUN
110110
job shells have been started, and a ``finish`` event once all the job shells
111111
have exited. The state transitions to CLEANUP.
112112

113+
OFFLINE
114+
The job was started, but the job manager has lost track of it due
115+
to an error (for example, a system crash). The job manager is
116+
attempting to reconnect itself to the running job. A ``disconnect``
117+
event is logged to indicate transition into this state.
118+
``reconnect`` will be logged when the tracking has been
119+
reestablished and we can re-enter the RUN state.
120+
113121
CLEANUP
114122
The job has completed or an exception has occurred. Under normal termination,
115123
the job manager waits for notification from the exec service that job
@@ -133,10 +141,10 @@ PENDING
133141
The job is in DEPEND, PRIORITY, or SCHED states.
134142

135143
RUNNING
136-
The job is in RUN or CLEANUP states.
144+
The job is in RUN, OFFLINE, or CLEANUP states.
137145

138146
ACTIVE
139-
The job is in DEPEND, PRIORITY, SCHED, RUN, or CLEANUP states.
147+
The job is in DEPEND, PRIORITY, SCHED, RUN, OFFLINE, or CLEANUP states.
140148

141149

142150
Exceptions
@@ -391,6 +399,37 @@ status
391399
{"timestamp":1552594348.0,"name":"epilog-finish","context":{"description":"/usr/sbin/job-epilog.sh", "status":0}}
392400
393401
402+
Disconnect Event
403+
^^^^^^^^^^^^^^^^
404+
405+
The job manager has lost tracking to a running job.
406+
407+
The following keys are OPTIONAL in the event context object:
408+
409+
id
410+
(long long) job ID
411+
412+
Example:
413+
414+
.. code:: json
415+
416+
{"timestamp":1636747761.5495925,"name":"disconnect","context":{"id":341835776000}}
417+
418+
419+
Reconnect Event
420+
^^^^^^^^^^^^^^^
421+
422+
The job manager has reconnected to the job shells.
423+
424+
The context SHALL be empty.
425+
426+
Example:
427+
428+
.. code:: json
429+
430+
{"timestamp":1636747761.827836,"name":"reconnect"}
431+
432+
394433
Free Event
395434
^^^^^^^^^^
396435

0 commit comments

Comments
 (0)