WIP: experimental: add new RFC for job execution module protocol #329

grondo · 2022-04-20T16:54:57Z

This PR is a WIP definition of a distributed protocol for the Flux job execution module, and defines an initial design for distributing the job execution module across a Flux instance.

A prototype will be developed using this RFC as a starting point, with the idea that any or all of the specification detailed here may change along the way. However, having something proposed in writing will help capture the design constraints, along with discussion and changes along the way.

chu11

awesome to see how this is coming together. just saw two nits when reading through it. no biggie, I know this is WIP

chu11 · 2022-04-20T21:53:05Z

spec_33.rst

+All job execution modules register a ``job-exec.hello`` service endpoint.
+Downstream execution modules send a ``hello`` request to their upstream
+peer to initiate the execution module protocol. An execution module SHALL
+wait to send a ``hello`` response to its downstream peers until n an initial


typo here n an

chu11 · 2022-04-20T21:58:23Z

spec_33.rst

+ Possible values for ``type`` MAY be *add*, *remove*, or *check*, to *add*
+ a new job, *remove* an inactive job, or *check* that an existing job is
+ active as expected.


can replace the last sentence with "described below" (or equivalent)

grondo · 2022-04-21T16:07:18Z

Thanks @chu11! I fixed those two issues you noted and force-pushed the result.

garlick

I think actually this is great as is. I had a couple of nit picky comments at the beginning but as I went on I realized that this is probably going to change as we get experience prototyping and so getting too focused on those details is likely a distraction. I left my initial comments in anyway, in case they are useful.

garlick · 2022-04-21T17:10:27Z

README.md

 - [30/Job Urgency](spec_30.rst)
 - [31/Job Constraints Specification](spec_31.rst)
 - [32/Flux Job Execution Protocol Version 1](spec_32.rst)
+- [33/Flux Job Execution Module Protocol Version 1](spec_33.rst)


Suggestion: title the document something like "Distributed Job Control Protocol" or similar.

Main thought is to call out "distributed" to distinguish it from what we currently have, and to choose words that don't make it look too much like it is a replacement for RFC 32.

garlick · 2022-04-21T17:17:48Z

spec_33.rst

+in the job. Therefore, the execution service is necessarily distributed
+among all ranks of a Flux instance.


Since we have an execution service that isn't currently distributed (at least, not like this one), maybe this last sentence should be dropped or replaced with something like

The initial execution service was a minimum viable implementation concentrated on rank 0, launching remote processes using the simple broker.rexec protocol. In contrast, the distributed job protocol takes advantage of the tree based overlay network structure to optimize performance, and is structured to allow jobs to be recovered upon Flux instance restart. It also leverages some design insights from the early wreck execution system.

(This is background after all so it doesn't hurt to add some back story here)

garlick · 2022-04-21T17:21:08Z

spec_33.rst

+ - Avoid presenting obstacles to the scaling of job size, the number of jobs
+   running concurrently, or job throughput.
+
+ - Support execution module reload.


Would it be too much detail to add "assuming modules are loaded in root to leaves order"?

Oh, I meant to remove this bullet, but yeah if we keep it then your suggestion should be added.

garlick · 2022-04-21T17:24:21Z

spec_33.rst

+the set of all running jobs for itself and all of its children. This state
+SHALL include at a minimum the jobid, userid, job state, and the idset of


One sentence refers to "the set" while next refers to "this state". Maybe use "state" for the first one too?

garlick · 2022-04-21T17:27:32Z

spec_33.rst

+SHALL include at a minimum the jobid, userid, job state, and the idset of
+execution targets on which the job has an allocation.
+
+All job execution modules register a ``job-exec.hello`` service endpoint.


In other RFCs I think we've referred to the first word in the topic string as the "service" and the second as the "method". So maybe s/service endpoint/service method/ ?

garlick · 2022-04-21T17:50:30Z

spec_33.rst

+execution targets on which the job has an allocation.
+
+All job execution modules register a ``job-exec.hello`` service endpoint.
+Downstream execution modules send a ``hello`` request to their upstream


I'm not sure how it would break down exactly, but subheadings might be useful to make the document easier to scan. There's a transition from hello types to state update types that is a little run together here.

(other subheading comments deleted as "too soon")

grondo · 2022-04-21T18:12:54Z

I think actually this is great as is. I had a couple of nit picky comments at the beginning but as I went on I realized that this is probably going to change as we get experience prototyping and so getting too focused on those details is likely a distraction. I left my initial comments in anyway, in case they are useful.

Feel free to push up any comments directly, I agree with them all. Whether you push individual commits or squash is up to you. I also imagine the details of the message types and contents will change as I try to implement a prototype.

Edit: and I apologize for the sloppy naming, this first draft was written and rewritten several times already

garlick · 2022-04-21T18:53:16Z

No worries, yeah I'll go ahead and push some changes if you're not working on it right now.

Problem: The distributed protocol between Flux job execution modules is not designed or documented. Add RFC 33 to cover a high-level design of a distributed job execution protocol, used by the job execution system to launch, monitor, and control the job shells of a Flux job.

garlick · 2022-04-21T19:39:33Z

Feel free to push up any comments directly, I agree with them all. Whether you push individual commits or squash is up to you.

I made the changes I had suggested including the title change and other changes mostly confined to the introductory sections. I didn't try to do any subheading reorg. I squashed my changes with yours, and then reread your comment and I realize you probably wanted me to just squash my changes together. Sorry?

grondo · 2022-04-21T21:09:35Z

and then reread your comment and I realize you probably wanted me to just squash my changes together. Sorry?

No, I meant do whatever you preferred.

chu11 reviewed Apr 20, 2022

View reviewed changes

grondo force-pushed the job-exec branch from 4fa491e to 24fc237 Compare April 21, 2022 16:06

garlick approved these changes Apr 21, 2022

View reviewed changes

garlick force-pushed the job-exec branch from 24fc237 to 841ebc3 Compare April 21, 2022 19:37

		in the job. Therefore, the execution service is necessarily distributed
		among all ranks of a Flux instance.

		the set of all running jobs for itself and all of its children. This state
		SHALL include at a minimum the jobid, userid, job state, and the idset of

WIP: experimental: add new RFC for job execution module protocol #329

Are you sure you want to change the base?

WIP: experimental: add new RFC for job execution module protocol #329

Uh oh!

Conversation

grondo commented Apr 20, 2022

Uh oh!

chu11 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grondo commented Apr 21, 2022

Uh oh!

garlick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grondo commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garlick commented Apr 21, 2022

Uh oh!

garlick commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grondo commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grondo commented Apr 21, 2022 •

edited

Loading

garlick commented Apr 21, 2022 •

edited

Loading