Skip to content

Conversation

@grondo
Copy link
Contributor

@grondo grondo commented Mar 7, 2025

Problem: There's no advice for how to run batch jobs that are resilent to node failure.

Add an entry in the FAQ.

@grondo
Copy link
Contributor Author

grondo commented Mar 7, 2025

@trws - there was not an entry in the FAQ for running resilient batch/alloc jobs, so here's a first stab at one.

There's still some outstanding issues with running in this manner. See for example flux-framework/flux-core#6692

Copy link
Member

@wihobbs wihobbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks clear and helpful to me!

For more help and examples, see :core:man1:`flux-bulksubmit`.
.. _resilient_batch_jobs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message nit: resilient

@grondo grondo force-pushed the faq-resilience branch 2 times, most recently from cf2896e to 106074e Compare March 7, 2025 18:09
@grondo
Copy link
Contributor Author

grondo commented Mar 7, 2025

Thanks @wihobbs! I fixed your commit message nit and also a couple other issues preventing CI from passing and now I'll set MWP.

@grondo grondo added the merge-when-passing mark PR for auto-merging by mergify.io bot label Mar 7, 2025
faqs.rst Outdated
instance can no longer properly function, triggering a fatal ``node-failure``
job exception. If the node is not critical, a non-fatal job exception is
raised and the leader job shell is notified. After the ``exit-timeout``
period (if set to a value other than none ``none``), a fatal job exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here? "other than none none"

@grondo grondo removed the merge-when-passing mark PR for auto-merging by mergify.io bot label Mar 7, 2025
grondo added 2 commits March 7, 2025 10:53
Problem: There's no advice for how to run batch jobs that are resilient
to node failure.

Add an entry in the FAQ.
Problem: The workflow examples repo has been removed, but it still has
a ref configured in flux-docs conf.py.

Remove it.
@grondo grondo added the merge-when-passing mark PR for auto-merging by mergify.io bot label Mar 7, 2025
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 7fe10ec into flux-framework:master Mar 11, 2025
7 checks passed
@grondo grondo deleted the faq-resilience branch March 11, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-when-passing mark PR for auto-merging by mergify.io bot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants