|
| 1 | +.. github display |
| 2 | + GitHub is NOT the preferred viewer for this file. Please visit |
| 3 | + https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_33.html |
| 4 | +
|
| 5 | +33/Distributed Job Control Protocol |
| 6 | +=================================== |
| 7 | + |
| 8 | +This specification describes the distributed protocol that the job |
| 9 | +execution service uses to launch, monitor, and control a set of Flux job |
| 10 | +shells. |
| 11 | + |
| 12 | +- Name: github.com/flux-framework/rfc/spec_33.rst |
| 13 | + |
| 14 | +- Editor: Mark A. Grondona < [email protected]> |
| 15 | + |
| 16 | +- State: raw |
| 17 | + |
| 18 | + |
| 19 | +Language |
| 20 | +-------- |
| 21 | + |
| 22 | +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", |
| 23 | +"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to |
| 24 | +be interpreted as described in `RFC 2119 <https://tools.ietf.org/html/rfc2119>`__. |
| 25 | + |
| 26 | + |
| 27 | +Related Standards |
| 28 | +----------------- |
| 29 | + |
| 30 | +- :doc:`21/Job States and Events <spec_21>` |
| 31 | + |
| 32 | +- :doc:`22/Idset String Representation <spec_22>` |
| 33 | + |
| 34 | +- :doc:`32/Flux Job Execution Protocol Version 1 <spec_32>` |
| 35 | + |
| 36 | +Background |
| 37 | +---------- |
| 38 | + |
| 39 | +RFC 32 describes the protocol between the execution service and job manager |
| 40 | +used to initiate and control jobs during the execution phase. Upon receipt |
| 41 | +of a ``start`` request, the execution service is responsible for the launch, |
| 42 | +monitoring, and control of job shells on all execution targets involved |
| 43 | +in the job. |
| 44 | + |
| 45 | +The Distributed Job Control Protocol describes how a set of execution service |
| 46 | +broker modules interact in distributed fashion to meet the requirements of |
| 47 | +executing job shells on behalf of the job manager. |
| 48 | + |
| 49 | +The initial execution service was a minimum viable implementation concentrated |
| 50 | +on rank 0, launching remote processes using the simple ``broker.rexec`` |
| 51 | +service. In contrast, the Distributed Job Control Protocol sets the stage for |
| 52 | +an implementation that |
| 53 | + |
| 54 | + - takes advantage of the tree based overlay network structure to optimize |
| 55 | + performance |
| 56 | + |
| 57 | + - is structured to allow running jobs to be recovered upon Flux system |
| 58 | + instance restart |
| 59 | + |
| 60 | + - incorporates design insights from the early WRECK prototype execution system |
| 61 | + |
| 62 | +Design Criteria |
| 63 | +--------------- |
| 64 | + |
| 65 | +The Distributed Job Control Protocol must adhere to the following criteria: |
| 66 | + |
| 67 | + - Avoid global distributed operations which would require all ranks to |
| 68 | + be online before the service is ready to execute work. |
| 69 | + |
| 70 | + - Avoid presenting obstacles to the scaling of job size, the number of jobs |
| 71 | + running concurrently, or job throughput. |
| 72 | + |
| 73 | + - Support recovery of running jobs after instance restart or execution |
| 74 | + module reload. |
| 75 | + |
| 76 | + - Support execution of a job prolog and/or epilog. |
| 77 | + |
| 78 | + - Support for collecting stdout and stderr from IMP and/or job shells |
| 79 | + |
| 80 | + - Support for a barrier implementation used by the job shells, so that |
| 81 | + the execution service may determine if shells exit early due to error. |
| 82 | + |
| 83 | + - Support partial release of allocated resources. |
| 84 | + |
| 85 | + - Support for job termination on job exceptions, job time limit, and other |
| 86 | + error conditions. |
| 87 | + |
| 88 | + - Support delivery of signals to jobs. |
| 89 | + |
| 90 | +Implementation |
| 91 | +-------------- |
| 92 | + |
| 93 | +Job execution modules SHALL be loaded on all ranks in an instance, and are |
| 94 | +organized in a hierarchy with rank 0 at the root. Each module SHALL track |
| 95 | +the state of all running jobs for itself and all of its children. This state |
| 96 | +SHALL include at a minimum the jobid, userid, job state, and the idset of |
| 97 | +execution targets on which the job has an allocation. |
| 98 | + |
| 99 | +All job execution modules register a ``job-exec.hello`` service method. |
| 100 | +Downstream execution modules send a ``hello`` request to their upstream |
| 101 | +peer. An execution module SHALL wait to send a ``hello`` response to its |
| 102 | +downstream peers until an initial ``hello`` response from upstream has been |
| 103 | +received. In the case of rank 0, the job execution module SHALL wait to send |
| 104 | +``hello`` responses until the initial RFC 32 ``hello`` response is received |
| 105 | +from the job manager. |
| 106 | + |
| 107 | +Responses to the ``job-exec.hello`` request are used to distribute job state |
| 108 | +and other events downstream through the job execution module hierarchy. |
| 109 | +These responses have a JSON object payload including the REQUIRED keys |
| 110 | +``type``, ``idset``, and ``data``. |
| 111 | + |
| 112 | +Supported types of ``job-exec.hello`` responses SHALL include at a minimum |
| 113 | +the following: |
| 114 | + |
| 115 | +state-update |
| 116 | + A ``state-update`` response is used to update the distributed state of |
| 117 | + jobs. The ``data`` object SHALL have a single key, ``jobs``, which SHALL |
| 118 | + be an array of (id, userid, type, idset) tuples. The ``type`` entry of the |
| 119 | + tuple SHALL indicate how the state is to be resolved on ranks in ``idset``. |
| 120 | + Possible values for ``type`` are described below. |
| 121 | + |
| 122 | +When a job execution module receives a ``state-update`` response from |
| 123 | +upstream, it SHALL take the following actions, depending on the value of |
| 124 | +the ``type`` key: |
| 125 | + |
| 126 | +add |
| 127 | + If the jobid already exists in the local module's state, then do nothing. |
| 128 | + |
| 129 | + Otherwise, if the provided ``idset`` intersects any child idset, then |
| 130 | + the module SHALL send a ``state`` response to matching children of type |
| 131 | + ``add``. Then, the local module SHALL determine if the provided ``idset`` |
| 132 | + contains its rank, and if so, the module SHALL execute the job locally |
| 133 | + using the currently selected execution implementation. |
| 134 | + |
| 135 | +remove |
| 136 | + If the provided ``idset`` intersects any child idset, then the job exec |
| 137 | + module SHALL send a ``state`` response to matching children with type |
| 138 | + ``remove``. Then, the the referenced ``jobid`` SHALL be purged from the |
| 139 | + local module's state. |
| 140 | + |
| 141 | +check |
| 142 | + If the provided ``idset`` intersects any child idset, then the job exec |
| 143 | + module SHALL send a ``state`` response to matching children with type |
| 144 | + ``check``. |
| 145 | + |
| 146 | + If the provided ``idset`` contains the local module's rank, then the |
| 147 | + module SHALL check if the referenced ``jobid`` exists locally. If not, |
| 148 | + then a job exception SHALL be raised. |
| 149 | + |
| 150 | +The first response to ``job-exec.hello`` SHALL be of type ``state-update``. |
| 151 | +The included ``jobs`` tuples SHALL all be of ``type=check`` and MUST |
| 152 | +include the entire set of jobs which are expected to be currently running |
| 153 | +on the execution targets of the current module and its children. If a job |
| 154 | +execution module discovers a locally running job which is not in the initial |
| 155 | +``state-update`` list, then the module SHALL terminate the job processes |
| 156 | +and log an error. |
| 157 | + |
| 158 | +When the rank 0 job execution module receives an RFC 32 ``start`` request |
| 159 | +from the job manager, it SHALL determine the idset associated with the |
| 160 | +job from *R*, and then locally issue a state update of type ``add``, |
| 161 | +following the specification for ``add`` listed above. |
| 162 | + |
| 163 | +While job execution is in progress, execution modules SHALL update their |
| 164 | +upstream peer with the following status changes: |
| 165 | + |
| 166 | +start |
| 167 | + when the local job shell has started |
| 168 | +barrier |
| 169 | + the local job shell has entered a barrier |
| 170 | +finish |
| 171 | + the local job shell has exited |
| 172 | +exception |
| 173 | + a job exception has occurred |
| 174 | +release |
| 175 | + all local work is completed, the resources on this rank may be released |
| 176 | + (e.g. after job epilog is complete) |
| 177 | + |
| 178 | +Upon receiving one of the requests above, a job execution module MAY |
| 179 | +attempt a reduction and SHALL forward the request upstream. On rank 0, the |
| 180 | +job exec module SHALL collect and translate job execution module requests |
| 181 | +to job-manager ``start`` response payloads including: |
| 182 | + |
| 183 | +start |
| 184 | + after job exec ``start`` has been received from all ranks |
| 185 | +finish |
| 186 | + after all job exec ``finish`` requests have been received from all ranks |
| 187 | +exception |
| 188 | + forwarded immediately to job-manager |
| 189 | +release |
| 190 | + release requests may be aggregated and forwarded in chunks to the job |
| 191 | + manager to allow for partial release. |
| 192 | + |
| 193 | +Each job exec module SHALL subscribe to ``job-exception`` events and MUST |
| 194 | +handle exceptions locally. For fatal job exceptions, the default behavior |
| 195 | +SHALL be to kill the local job shell and its children. |
| 196 | + |
| 197 | +After receiving the final ``release`` request from a downstream module, |
| 198 | +the rank 0 job execution module SHALL perform the following final steps: |
| 199 | + |
| 200 | + - post a terminating event to the exec eventlog |
| 201 | + - copy guest namespace to primary namespace |
| 202 | + - issue a ``release`` response with final=true to the job manager |
| 203 | + - remove local state entry for the job |
| 204 | + - update distributed state so job is removed from all children |
| 205 | + |
| 206 | +Job-Exec Hello Request |
| 207 | +^^^^^^^^^^^^^^^^^^^^^^ |
| 208 | +The ``job-exec.hello`` request has no payload. |
| 209 | + |
| 210 | +Job-Exec Hello Response |
| 211 | +^^^^^^^^^^^^^^^^^^^^^^^ |
| 212 | + |
| 213 | +A ``job-exec.hello`` response payload SHALL be a JSON object containing |
| 214 | +the following REQUIRED keys: |
| 215 | + |
| 216 | +type |
| 217 | + (string) The response type |
| 218 | + |
| 219 | +idset |
| 220 | + (string) RFC 22 Idset string indicating the ranks to which this response |
| 221 | + should be delivered |
| 222 | + |
| 223 | +data |
| 224 | + (object) type-specific data |
| 225 | + |
| 226 | +State-update |
| 227 | +~~~~~~~~~~~~ |
| 228 | + |
| 229 | +The ``state-update`` ``hello`` response ``data`` object SHALL contain the |
| 230 | +following REQUIRED keys: |
| 231 | + |
| 232 | +jobs |
| 233 | + A list of job tuples where a tuple is an array ``[ id, userid, type, idset]``. |
| 234 | + |
| 235 | +Where |
| 236 | + |
| 237 | +id |
| 238 | + (integer) the job ID |
| 239 | + |
| 240 | +userid |
| 241 | + (integer) the job user ID |
| 242 | + |
| 243 | +idset |
| 244 | + (string) An RFC 22 idset string denoting all ranks which are included |
| 245 | + in the assigned resources for job ``id``. |
| 246 | + |
| 247 | +type |
| 248 | + (string) The type of state update. One of ``add``, ``remove``, or ``check``. |
| 249 | + |
| 250 | +Job-Exec Start Request |
| 251 | +^^^^^^^^^^^^^^^^^^^^^^ |
| 252 | + |
| 253 | +A ``job-exec.start`` request SHALL be sent upstream by an execution module |
| 254 | +once the job shell or IMP has been started. The payload SHALL be a JSON |
| 255 | +object containing the following REQUIRED keys: |
| 256 | + |
| 257 | +id |
| 258 | + (integer) the job ID |
| 259 | + |
| 260 | +ranks |
| 261 | + (string) an RFC 22 Idset string of ranks on which the job shell has started |
| 262 | + |
| 263 | + |
| 264 | +Job-Exec Barrier Request |
| 265 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 266 | + |
| 267 | +A ``job-exec.barrier`` request SHALL be sent upstream from a execution |
| 268 | +module when the locally executed job shell enters a barrier. The payload |
| 269 | +SHALL be a JSON object containing the following REQUIRED keys: |
| 270 | + |
| 271 | +id |
| 272 | + (integer) the job ID |
| 273 | + |
| 274 | +ranks |
| 275 | + (string) an RFC 22 Idset string of execution targets on which the shell |
| 276 | + barrier has been started. |
| 277 | + |
| 278 | +seq |
| 279 | + (integer) a shell barrier sequence number |
| 280 | + |
| 281 | +The upstream module SHALL respond to a ``job-exec.barrier`` request |
| 282 | +once all job shells have entered the barrier with a matching sequence |
| 283 | +number. |
| 284 | + |
| 285 | + |
| 286 | +Job-Exec Finish Request |
| 287 | +^^^^^^^^^^^^^^^^^^^^^^^ |
| 288 | + |
| 289 | +A ``job-exec.finish`` request SHALL be sent upstream by an execution |
| 290 | +module once the job shell has exited. The payload SHALL be a JSON object |
| 291 | +containing the following REQUIRED keys: |
| 292 | + |
| 293 | +id |
| 294 | + (integer) the job ID |
| 295 | + |
| 296 | +ranks |
| 297 | + (string) an RFC 22 idset string of execution targets on which the job |
| 298 | + shell has exited. |
| 299 | + |
| 300 | +status |
| 301 | + (integer) the greatest job shell wait status among ``ranks`` |
| 302 | + |
| 303 | + |
| 304 | +Job-Exec Exception Request |
| 305 | +^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 306 | + |
| 307 | +A ``job-exec.execption`` request SHALL be sent upstream by an execution |
| 308 | +module when the module wishes to raise a execution related job exception. The |
| 309 | +payload SHALL be a JSON object containing the following REQUIRED keys: |
| 310 | + |
| 311 | +id |
| 312 | + (integer) the job ID |
| 313 | + |
| 314 | +severity |
| 315 | + (integer) the exception severity |
| 316 | + |
| 317 | +type |
| 318 | + (string) the exception type |
| 319 | + |
| 320 | +note |
| 321 | + (string) a human readable description of the job exception |
| 322 | + |
| 323 | + |
| 324 | +Job-Exec Release Request |
| 325 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 326 | + |
| 327 | +A ``job-exec.release`` request SHALL be sent upstream by an execution |
| 328 | +module after the job shell has exited and any job epilog or other work |
| 329 | +associated with the job has completed. The payload SHALL be a JSON object |
| 330 | +with the following REQUIRED keys: |
| 331 | + |
| 332 | +id |
| 333 | + (integer) the job ID |
| 334 | + |
| 335 | +ranks |
| 336 | + (string) an RFC 22 Idset including the execution target ranks on which |
| 337 | + resources should be released |
| 338 | + |
0 commit comments