Skip to content

Commit 841ebc3

Browse files
grondogarlick
authored andcommitted
rfc33: add new RFC for job execution module protocol
Problem: The distributed protocol between Flux job execution modules is not designed or documented. Add RFC 33 to cover a high-level design of a distributed job execution protocol, used by the job execution system to launch, monitor, and control the job shells of a Flux job.
1 parent b451bfe commit 841ebc3

File tree

3 files changed

+346
-0
lines changed

3 files changed

+346
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Table of Contents
4343
- [30/Job Urgency](spec_30.rst)
4444
- [31/Job Constraints Specification](spec_31.rst)
4545
- [32/Flux Job Execution Protocol Version 1](spec_32.rst)
46+
- [33/Distributed Job Control Protocol](spec_33.rst)
4647

4748
Build Instructions
4849
------------------

index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,12 @@ job constraints.
228228
This specification describes Version 1 of the Flux Execution Protocol
229229
implemented by the job manager and job execution system.
230230

231+
:doc:`33/Distributed Job Control Protocol<spec_33>`
232+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
233+
234+
This specification describes the Distributed Job Protocol,
235+
a distributed protocol used within the job execution broker modules.
236+
231237
.. Each file must appear in a toctree
232238
.. toctree::
233239
:hidden:
@@ -263,3 +269,4 @@ implemented by the job manager and job execution system.
263269
spec_30
264270
spec_31
265271
spec_32
272+
spec_33

spec_33.rst

Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
.. github display
2+
GitHub is NOT the preferred viewer for this file. Please visit
3+
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_33.html
4+
5+
33/Distributed Job Control Protocol
6+
===================================
7+
8+
This specification describes the distributed protocol that the job
9+
execution service uses to launch, monitor, and control a set of Flux job
10+
shells.
11+
12+
- Name: github.com/flux-framework/rfc/spec_33.rst
13+
14+
- Editor: Mark A. Grondona <[email protected]>
15+
16+
- State: raw
17+
18+
19+
Language
20+
--------
21+
22+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
23+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to
24+
be interpreted as described in `RFC 2119 <https://tools.ietf.org/html/rfc2119>`__.
25+
26+
27+
Related Standards
28+
-----------------
29+
30+
- :doc:`21/Job States and Events <spec_21>`
31+
32+
- :doc:`22/Idset String Representation <spec_22>`
33+
34+
- :doc:`32/Flux Job Execution Protocol Version 1 <spec_32>`
35+
36+
Background
37+
----------
38+
39+
RFC 32 describes the protocol between the execution service and job manager
40+
used to initiate and control jobs during the execution phase. Upon receipt
41+
of a ``start`` request, the execution service is responsible for the launch,
42+
monitoring, and control of job shells on all execution targets involved
43+
in the job.
44+
45+
The Distributed Job Control Protocol describes how a set of execution service
46+
broker modules interact in distributed fashion to meet the requirements of
47+
executing job shells on behalf of the job manager.
48+
49+
The initial execution service was a minimum viable implementation concentrated
50+
on rank 0, launching remote processes using the simple ``broker.rexec``
51+
service. In contrast, the Distributed Job Control Protocol sets the stage for
52+
an implementation that
53+
54+
- takes advantage of the tree based overlay network structure to optimize
55+
performance
56+
57+
- is structured to allow running jobs to be recovered upon Flux system
58+
instance restart
59+
60+
- incorporates design insights from the early WRECK prototype execution system
61+
62+
Design Criteria
63+
---------------
64+
65+
The Distributed Job Control Protocol must adhere to the following criteria:
66+
67+
- Avoid global distributed operations which would require all ranks to
68+
be online before the service is ready to execute work.
69+
70+
- Avoid presenting obstacles to the scaling of job size, the number of jobs
71+
running concurrently, or job throughput.
72+
73+
- Support recovery of running jobs after instance restart or execution
74+
module reload.
75+
76+
- Support execution of a job prolog and/or epilog.
77+
78+
- Support for collecting stdout and stderr from IMP and/or job shells
79+
80+
- Support for a barrier implementation used by the job shells, so that
81+
the execution service may determine if shells exit early due to error.
82+
83+
- Support partial release of allocated resources.
84+
85+
- Support for job termination on job exceptions, job time limit, and other
86+
error conditions.
87+
88+
- Support delivery of signals to jobs.
89+
90+
Implementation
91+
--------------
92+
93+
Job execution modules SHALL be loaded on all ranks in an instance, and are
94+
organized in a hierarchy with rank 0 at the root. Each module SHALL track
95+
the state of all running jobs for itself and all of its children. This state
96+
SHALL include at a minimum the jobid, userid, job state, and the idset of
97+
execution targets on which the job has an allocation.
98+
99+
All job execution modules register a ``job-exec.hello`` service method.
100+
Downstream execution modules send a ``hello`` request to their upstream
101+
peer. An execution module SHALL wait to send a ``hello`` response to its
102+
downstream peers until an initial ``hello`` response from upstream has been
103+
received. In the case of rank 0, the job execution module SHALL wait to send
104+
``hello`` responses until the initial RFC 32 ``hello`` response is received
105+
from the job manager.
106+
107+
Responses to the ``job-exec.hello`` request are used to distribute job state
108+
and other events downstream through the job execution module hierarchy.
109+
These responses have a JSON object payload including the REQUIRED keys
110+
``type``, ``idset``, and ``data``.
111+
112+
Supported types of ``job-exec.hello`` responses SHALL include at a minimum
113+
the following:
114+
115+
state-update
116+
A ``state-update`` response is used to update the distributed state of
117+
jobs. The ``data`` object SHALL have a single key, ``jobs``, which SHALL
118+
be an array of (id, userid, type, idset) tuples. The ``type`` entry of the
119+
tuple SHALL indicate how the state is to be resolved on ranks in ``idset``.
120+
Possible values for ``type`` are described below.
121+
122+
When a job execution module receives a ``state-update`` response from
123+
upstream, it SHALL take the following actions, depending on the value of
124+
the ``type`` key:
125+
126+
add
127+
If the jobid already exists in the local module's state, then do nothing.
128+
129+
Otherwise, if the provided ``idset`` intersects any child idset, then
130+
the module SHALL send a ``state`` response to matching children of type
131+
``add``. Then, the local module SHALL determine if the provided ``idset``
132+
contains its rank, and if so, the module SHALL execute the job locally
133+
using the currently selected execution implementation.
134+
135+
remove
136+
If the provided ``idset`` intersects any child idset, then the job exec
137+
module SHALL send a ``state`` response to matching children with type
138+
``remove``. Then, the the referenced ``jobid`` SHALL be purged from the
139+
local module's state.
140+
141+
check
142+
If the provided ``idset`` intersects any child idset, then the job exec
143+
module SHALL send a ``state`` response to matching children with type
144+
``check``.
145+
146+
If the provided ``idset`` contains the local module's rank, then the
147+
module SHALL check if the referenced ``jobid`` exists locally. If not,
148+
then a job exception SHALL be raised.
149+
150+
The first response to ``job-exec.hello`` SHALL be of type ``state-update``.
151+
The included ``jobs`` tuples SHALL all be of ``type=check`` and MUST
152+
include the entire set of jobs which are expected to be currently running
153+
on the execution targets of the current module and its children. If a job
154+
execution module discovers a locally running job which is not in the initial
155+
``state-update`` list, then the module SHALL terminate the job processes
156+
and log an error.
157+
158+
When the rank 0 job execution module receives an RFC 32 ``start`` request
159+
from the job manager, it SHALL determine the idset associated with the
160+
job from *R*, and then locally issue a state update of type ``add``,
161+
following the specification for ``add`` listed above.
162+
163+
While job execution is in progress, execution modules SHALL update their
164+
upstream peer with the following status changes:
165+
166+
start
167+
when the local job shell has started
168+
barrier
169+
the local job shell has entered a barrier
170+
finish
171+
the local job shell has exited
172+
exception
173+
a job exception has occurred
174+
release
175+
all local work is completed, the resources on this rank may be released
176+
(e.g. after job epilog is complete)
177+
178+
Upon receiving one of the requests above, a job execution module MAY
179+
attempt a reduction and SHALL forward the request upstream. On rank 0, the
180+
job exec module SHALL collect and translate job execution module requests
181+
to job-manager ``start`` response payloads including:
182+
183+
start
184+
after job exec ``start`` has been received from all ranks
185+
finish
186+
after all job exec ``finish`` requests have been received from all ranks
187+
exception
188+
forwarded immediately to job-manager
189+
release
190+
release requests may be aggregated and forwarded in chunks to the job
191+
manager to allow for partial release.
192+
193+
Each job exec module SHALL subscribe to ``job-exception`` events and MUST
194+
handle exceptions locally. For fatal job exceptions, the default behavior
195+
SHALL be to kill the local job shell and its children.
196+
197+
After receiving the final ``release`` request from a downstream module,
198+
the rank 0 job execution module SHALL perform the following final steps:
199+
200+
- post a terminating event to the exec eventlog
201+
- copy guest namespace to primary namespace
202+
- issue a ``release`` response with final=true to the job manager
203+
- remove local state entry for the job
204+
- update distributed state so job is removed from all children
205+
206+
Job-Exec Hello Request
207+
^^^^^^^^^^^^^^^^^^^^^^
208+
The ``job-exec.hello`` request has no payload.
209+
210+
Job-Exec Hello Response
211+
^^^^^^^^^^^^^^^^^^^^^^^
212+
213+
A ``job-exec.hello`` response payload SHALL be a JSON object containing
214+
the following REQUIRED keys:
215+
216+
type
217+
(string) The response type
218+
219+
idset
220+
(string) RFC 22 Idset string indicating the ranks to which this response
221+
should be delivered
222+
223+
data
224+
(object) type-specific data
225+
226+
State-update
227+
~~~~~~~~~~~~
228+
229+
The ``state-update`` ``hello`` response ``data`` object SHALL contain the
230+
following REQUIRED keys:
231+
232+
jobs
233+
A list of job tuples where a tuple is an array ``[ id, userid, type, idset]``.
234+
235+
Where
236+
237+
id
238+
(integer) the job ID
239+
240+
userid
241+
(integer) the job user ID
242+
243+
idset
244+
(string) An RFC 22 idset string denoting all ranks which are included
245+
in the assigned resources for job ``id``.
246+
247+
type
248+
(string) The type of state update. One of ``add``, ``remove``, or ``check``.
249+
250+
Job-Exec Start Request
251+
^^^^^^^^^^^^^^^^^^^^^^
252+
253+
A ``job-exec.start`` request SHALL be sent upstream by an execution module
254+
once the job shell or IMP has been started. The payload SHALL be a JSON
255+
object containing the following REQUIRED keys:
256+
257+
id
258+
(integer) the job ID
259+
260+
ranks
261+
(string) an RFC 22 Idset string of ranks on which the job shell has started
262+
263+
264+
Job-Exec Barrier Request
265+
^^^^^^^^^^^^^^^^^^^^^^^^
266+
267+
A ``job-exec.barrier`` request SHALL be sent upstream from a execution
268+
module when the locally executed job shell enters a barrier. The payload
269+
SHALL be a JSON object containing the following REQUIRED keys:
270+
271+
id
272+
(integer) the job ID
273+
274+
ranks
275+
(string) an RFC 22 Idset string of execution targets on which the shell
276+
barrier has been started.
277+
278+
seq
279+
(integer) a shell barrier sequence number
280+
281+
The upstream module SHALL respond to a ``job-exec.barrier`` request
282+
once all job shells have entered the barrier with a matching sequence
283+
number.
284+
285+
286+
Job-Exec Finish Request
287+
^^^^^^^^^^^^^^^^^^^^^^^
288+
289+
A ``job-exec.finish`` request SHALL be sent upstream by an execution
290+
module once the job shell has exited. The payload SHALL be a JSON object
291+
containing the following REQUIRED keys:
292+
293+
id
294+
(integer) the job ID
295+
296+
ranks
297+
(string) an RFC 22 idset string of execution targets on which the job
298+
shell has exited.
299+
300+
status
301+
(integer) the greatest job shell wait status among ``ranks``
302+
303+
304+
Job-Exec Exception Request
305+
^^^^^^^^^^^^^^^^^^^^^^^^^^
306+
307+
A ``job-exec.execption`` request SHALL be sent upstream by an execution
308+
module when the module wishes to raise a execution related job exception. The
309+
payload SHALL be a JSON object containing the following REQUIRED keys:
310+
311+
id
312+
(integer) the job ID
313+
314+
severity
315+
(integer) the exception severity
316+
317+
type
318+
(string) the exception type
319+
320+
note
321+
(string) a human readable description of the job exception
322+
323+
324+
Job-Exec Release Request
325+
^^^^^^^^^^^^^^^^^^^^^^^^
326+
327+
A ``job-exec.release`` request SHALL be sent upstream by an execution
328+
module after the job shell has exited and any job epilog or other work
329+
associated with the job has completed. The payload SHALL be a JSON object
330+
with the following REQUIRED keys:
331+
332+
id
333+
(integer) the job ID
334+
335+
ranks
336+
(string) an RFC 22 Idset including the execution target ranks on which
337+
resources should be released
338+

0 commit comments

Comments
 (0)