Skip to content

Commit 24fc237

Browse files
committed
rfc33: add new RFC for job execution module protocol
Problem: The distributed protocol between Flux job execution modules is not designed or documented. Add RFC 33 to cover a high-level design of a distributed job execution protocol, used by the job execution system to launch, monitor, and control the job shells of a Flux job.
1 parent b451bfe commit 24fc237

File tree

3 files changed

+336
-0
lines changed

3 files changed

+336
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Table of Contents
4343
- [30/Job Urgency](spec_30.rst)
4444
- [31/Job Constraints Specification](spec_31.rst)
4545
- [32/Flux Job Execution Protocol Version 1](spec_32.rst)
46+
- [33/Flux Job Execution Module Protocol Version 1](spec_33.rst)
4647

4748
Build Instructions
4849
------------------

index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,12 @@ job constraints.
228228
This specification describes Version 1 of the Flux Execution Protocol
229229
implemented by the job manager and job execution system.
230230

231+
:doc:`33/Flux Job Execution Module Protocol Version 1 <spec_33>`
232+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
233+
234+
This specification describes Version 1 of the Flux Execution Module Protocol,
235+
a distributed protocol used by job execution broker modules.
236+
231237
.. Each file must appear in a toctree
232238
.. toctree::
233239
:hidden:
@@ -263,3 +269,4 @@ implemented by the job manager and job execution system.
263269
spec_30
264270
spec_31
265271
spec_32
272+
spec_33

spec_33.rst

Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
.. github display
2+
GitHub is NOT the preferred viewer for this file. Please visit
3+
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_32.html
4+
5+
33/Flux Job Execution Module Protocol Version 1
6+
===============================================
7+
8+
This specification describes the distributed protocol that the job
9+
execution service uses to launch, monitor, and control job shells
10+
in a Flux job.
11+
12+
- Name: github.com/flux-framework/rfc/spec_32.rst
13+
14+
- Editor: Mark A. Grondona <[email protected]>
15+
16+
- State: raw
17+
18+
19+
Language
20+
--------
21+
22+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
23+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to
24+
be interpreted as described in `RFC 2119 <https://tools.ietf.org/html/rfc2119>`__.
25+
26+
27+
Related Standards
28+
-----------------
29+
30+
- :doc:`21/Job States and Events <spec_21>`
31+
32+
- :doc:`22/Idset String Representation <spec_22>`
33+
34+
- :doc:`32/Flux Job Execution Protocol Version 1 <spec_32>`
35+
36+
Background
37+
----------
38+
39+
RFC 32 describes the protocol between the execution service and job manager
40+
used to initiate and control jobs during the execution phase. Upon receipt
41+
of a ``start`` request, the execution service is responsible for the launch,
42+
monitoring, and control of job shells on all execution targets involved
43+
in the job. Therefore, the execution service is necessarily distributed
44+
among all ranks of a Flux instance.
45+
46+
The Flux Job Execution Module Protocol Version 1 describes how a set of
47+
execution service broker modules interact in distributed fashion to meet
48+
the requirements of executing job shells on behalf of the job manager.
49+
50+
Design Criteria
51+
---------------
52+
53+
The job execution module protocol must adhere to the following criteria:
54+
55+
- Avoid global distributed operations which would require all ranks to
56+
be online before the service is ready to execute work.
57+
58+
- Avoid presenting obstacles to the scaling of job size, the number of jobs
59+
running concurrently, or job throughput.
60+
61+
- Support execution module reload.
62+
63+
- Support recovery of running jobs after instance restart or execution
64+
module reload.
65+
66+
- Support execution of a job prolog and/or epilog.
67+
68+
- Support for collecting stdout and stderr from IMP and/or job shells
69+
70+
- Support for a barrier implementation used by the job shells, so that
71+
the execution service may determine if shells exit early due to error.
72+
73+
- Support partial release of allocated resources.
74+
75+
- Support for job termination on job exceptions, job time limit, and other
76+
error conditions.
77+
78+
- Support delivery of signals to jobs.
79+
80+
Implementation
81+
--------------
82+
83+
Job execution modules SHALL be loaded on all ranks in an instance, and are
84+
organized in a hierarchy with rank 0 at the root. Each module SHALL track
85+
the set of all running jobs for itself and all of its children. This state
86+
SHALL include at a minimum the jobid, userid, job state, and the idset of
87+
execution targets on which the job has an allocation.
88+
89+
All job execution modules register a ``job-exec.hello`` service endpoint.
90+
Downstream execution modules send a ``hello`` request to their upstream
91+
peer to initiate the execution module protocol. An execution module SHALL
92+
wait to send a ``hello`` response to its downstream peers until an initial
93+
``hello`` response from upstream has been received. In the case of rank 0,
94+
the job execution module SHALL wait to send ``hello`` responses until the
95+
initial RFC 32 ``hello`` response is received from the job manager.
96+
97+
Responses to the ``job-exec.hello`` request are used to distribute job state
98+
and other events downstream through the job execution module hierarchy.
99+
These responses have a JSON object payload including the REQUIRED keys
100+
``type``, ``idset``, and ``data``.
101+
102+
Supported types of ``job-exec.hello`` responses SHALL include at a minimum
103+
the following:
104+
105+
state-update
106+
A ``state-update`` response is used to update the distributed state of
107+
jobs. The ``data`` object SHALL have a single key, ``jobs``, which SHALL
108+
be an array of (id, userid, type, idset) tuples. The ``type`` entry of the
109+
tuple SHALL indicate how the state is to be resolved on ranks in ``idset``.
110+
Possible values for ``type`` are described below.
111+
112+
When a job execution module receives a ``state-update`` response from
113+
upstream, it SHALL take the following actions, depending on the value of
114+
the ``type`` key:
115+
116+
add
117+
If the jobid already exists in the local module's state, then do nothing.
118+
119+
Otherwise, if the provided ``idset`` intersects any child idset, then
120+
the module SHALL send a ``state`` response to matching children of type
121+
``add``. Then, the local module SHALL determine if the provided ``idset``
122+
contains its rank, and if so, the module SHALL execute the job locally
123+
using the currently selected execution implementation.
124+
125+
remove
126+
If the provided ``idset`` intersects any child idset, then the job exec
127+
module SHALL send a ``state`` response to matching children with type
128+
``remove``. Then, the the referenced ``jobid`` SHALL be purged from the
129+
local module's state.
130+
131+
check
132+
If the provided ``idset`` intersects any child idset, then the job exec
133+
module SHALL send a ``state`` response to matching children with type
134+
``check``.
135+
136+
If the provided ``idset`` contains the local module's rank, then the
137+
module SHALL check if the referenced ``jobid`` exists locally. If not,
138+
then a job exception SHALL be raised.
139+
140+
The first response to ``job-exec.hello`` SHALL be of type ``state-update``.
141+
The included ``jobs`` tuples SHALL all be of ``type=check`` and MUST
142+
include the entire set of jobs which are expected to be currently running
143+
on the execution targets of the current module and its children. If a job
144+
execution module discovers a locally running job which is not in the initial
145+
``state-update`` list, then the module SHALL terminate the job processes
146+
and log an error.
147+
148+
When the rank 0 job execution module receives an RFC 32 ``start`` request
149+
from the job manager, it SHALL determine the idset associated with the
150+
job from *R*, and then locally issue a state update of type ``add``,
151+
following the specification for ``add`` listed above.
152+
153+
While job execution is in progress, execution modules SHALL update their
154+
upstream peer with the following status changes:
155+
156+
start
157+
when the local job shell has started
158+
barrier
159+
the local job shell has entered a barrier
160+
finish
161+
the local job shell has exited
162+
exception
163+
a job exception has occurred
164+
release
165+
all local work is completed, the resources on this rank may be released
166+
(e.g. after job epilog is complete)
167+
168+
Upon receiving one of the requests above, a job execution module MAY
169+
attempt a reduction and SHALL forward the request upstream. On rank 0, the
170+
job exec module SHALL collect and translate job execution module requests
171+
to job-manager ``start`` response payloads including:
172+
173+
start
174+
after job exec ``start`` has been received from all ranks
175+
finish
176+
after all job exec ``finish`` requests have been received from all ranks
177+
exception
178+
forwarded immediately to job-manager
179+
release
180+
release requests may be aggregated and forwarded in chunks to the job
181+
manager to allow for partial release.
182+
183+
Each job exec module SHALL subscribe to ``job-exception`` events and MUST
184+
handle exceptions locally. For fatal job exceptions, the default behavior
185+
SHALL be to kill the local job shell and its children.
186+
187+
After receiving the final ``release`` request from a downstream module,
188+
the rank 0 job execution module SHALL perform the following final steps:
189+
190+
- post a terminating event to the exec eventlog
191+
- copy guest namespace to primary namespace
192+
- issue a ``release`` response with final=true to the job manager
193+
- remove local state entry for the job
194+
- update distributed state so job is removed from all children
195+
196+
Job-Exec Hello Request
197+
^^^^^^^^^^^^^^^^^^^^^^
198+
The ``job-exec.hello`` request has no payload.
199+
200+
Job-Exec Hello Response
201+
^^^^^^^^^^^^^^^^^^^^^^^
202+
203+
A ``job-exec.hello`` response payload SHALL be a JSON object containing
204+
the following REQUIRED keys:
205+
206+
type
207+
(string) The response type
208+
209+
idset
210+
(string) RFC 22 Idset string indicating the ranks to which this response
211+
should be delivered
212+
213+
data
214+
(object) type-specific data
215+
216+
State-update
217+
~~~~~~~~~~~~
218+
219+
The ``state-update`` ``hello`` response ``data`` object SHALL contain the
220+
following REQUIRED keys:
221+
222+
jobs
223+
A list of job tuples where a tuple is an array ``[ id, userid, type, idset]``.
224+
225+
Where
226+
227+
id
228+
(integer) the job ID
229+
230+
userid
231+
(integer) the job user ID
232+
233+
idset
234+
(string) An RFC 22 idset string denoting all ranks which are included
235+
in the assigned resources for job ``id``.
236+
237+
type
238+
(string) The type of state update. One of ``add``, ``remove``, or ``check``.
239+
240+
Job-Exec Start Request
241+
^^^^^^^^^^^^^^^^^^^^^^
242+
243+
A ``job-exec.start`` request SHALL be sent upstream by an execution module
244+
once the job shell or IMP has been started. The payload SHALL be a JSON
245+
object containing the following REQUIRED keys:
246+
247+
id
248+
(integer) the job ID
249+
250+
ranks
251+
(string) an RFC 22 Idset string of ranks on which the job shell has started
252+
253+
254+
Job-Exec Barrier Request
255+
^^^^^^^^^^^^^^^^^^^^^^^^
256+
257+
A ``job-exec.barrier`` request SHALL be sent upstream from a execution
258+
module when the locally executed job shell enters a barrier. The payload
259+
SHALL be a JSON object containing the following REQUIRED keys:
260+
261+
id
262+
(integer) the job ID
263+
264+
ranks
265+
(string) an RFC 22 Idset string of execution targets on which the shell
266+
barrier has been started.
267+
268+
seq
269+
(integer) a shell barrier sequence number
270+
271+
The upstream module SHALL respond to a ``job-exec.barrier`` request
272+
once all job shells have entered the barrier with a matching sequence
273+
number.
274+
275+
276+
Job-Exec Finish Request
277+
^^^^^^^^^^^^^^^^^^^^^^^
278+
279+
A ``job-exec.finish`` request SHALL be sent upstream by an execution
280+
module once the job shell has exited. The payload SHALL be a JSON object
281+
containing the following REQUIRED keys:
282+
283+
id
284+
(integer) the job ID
285+
286+
ranks
287+
(string) an RFC 22 idset string of execution targets on which the job
288+
shell has exited.
289+
290+
status
291+
(integer) the greatest job shell wait status among ``ranks``
292+
293+
294+
Job-Exec Exception Request
295+
^^^^^^^^^^^^^^^^^^^^^^^^^^
296+
297+
A ``job-exec.execption`` request SHALL be sent upstream by an execution
298+
module when the module wishes to raise a execution related job exception. The
299+
payload SHALL be a JSON object containing the following REQUIRED keys:
300+
301+
id
302+
(integer) the job ID
303+
304+
severity
305+
(integer) the exception severity
306+
307+
type
308+
(string) the exception type
309+
310+
note
311+
(string) a human readable description of the job exception
312+
313+
314+
Job-Exec Release Request
315+
^^^^^^^^^^^^^^^^^^^^^^^^
316+
317+
A ``job-exec.release`` request SHALL be sent upstream by an execution
318+
module after the job shell has exited and any job epilog or other work
319+
associated with the job has completed. The payload SHALL be a JSON object
320+
with the following REQUIRED keys:
321+
322+
id
323+
(integer) the job ID
324+
325+
ranks
326+
(string) an RFC 22 Idset including the execution target ranks on which
327+
resources should be released
328+

0 commit comments

Comments
 (0)