-
Notifications
You must be signed in to change notification settings - Fork 22
faq: add entry about reslient batch jobs #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@trws - there was not an entry in the FAQ for running resilient batch/alloc jobs, so here's a first stab at one. There's still some outstanding issues with running in this manner. See for example flux-framework/flux-core#6692 |
wihobbs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks clear and helpful to me!
| For more help and examples, see :core:man1:`flux-bulksubmit`. | ||
| .. _resilient_batch_jobs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message nit: resilient
cf2896e to
106074e
Compare
|
Thanks @wihobbs! I fixed your commit message nit and also a couple other issues preventing CI from passing and now I'll set MWP. |
faqs.rst
Outdated
| instance can no longer properly function, triggering a fatal ``node-failure`` | ||
| job exception. If the node is not critical, a non-fatal job exception is | ||
| raised and the leader job shell is notified. After the ``exit-timeout`` | ||
| period (if set to a value other than none ``none``), a fatal job exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here? "other than none none"
Problem: There's no advice for how to run batch jobs that are resilient to node failure. Add an entry in the FAQ.
Problem: The workflow examples repo has been removed, but it still has a ref configured in flux-docs conf.py. Remove it.
garlick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Problem: There's no advice for how to run batch jobs that are resilent to node failure.
Add an entry in the FAQ.