Skip to content

Commit cf2896e

Browse files
committed
faq: add entry about resilient batch jobs
Problem: There's no advice for how to run batch jobs that are resilent to node failure. Add an entry in the FAQ.
1 parent 920fafe commit cf2896e

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

faqs.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -522,6 +522,46 @@ see what would be submitted to Flux without actually running any jobs
522522
523523
For more help and examples, see :core:man1:`flux-bulksubmit`.
524524
525+
.. _resilient_batch_jobs:
526+
527+
How do I run a batch or alloc job that is resilient to node failures?
528+
=====================================================================
529+
TL;DR: Use:
530+
531+
.. code-block:: console
532+
533+
$ flux batch --conf=tbon.topo=kary:0 -o exit-timeout=none ...
534+
535+
.. note::
536+
537+
In future versions of Flux ``-o exit-timeout=none`` may become the
538+
default for :core:man1:`flux batch` and :core:man1:`flux alloc`. Check
539+
with your version to see if ``-o exit-timeout=none`` is necessary.
540+
541+
When a Flux instance running as a job loses a node, what happens next is
542+
dependent on two factors: whether the lost node is critical, and the value
543+
of the ``exit-timeout`` job shell option. If the lost node is critical, the
544+
instance can no longer properly function, triggering a fatal ``node-failure``
545+
job exception. If the node is not critical, a non-fatal job exception is
546+
raised and the leader job shell is notified. After the ``exit-timeout``
547+
period (if set to a value other than none ``none``), a fatal job exception
548+
is raised and the job is terminated.
549+
550+
To maximize resilience in batch or allocation jobs, disable the
551+
``exit-timeout`` option (set it to ``none``) and minimize the number of
552+
critical ranks by running a flat TBON with ``--conf=tbon.topo=kary:0``. This
553+
configuration allows jobs to continue running even if any node is lost,
554+
except for rank 0, which is always critical.
555+
556+
The same rules apply to jobs running within the instance. For parallel
557+
jobs that are not instances of Flux, all ranks are considered critical by
558+
default. As a result, jobs running on the lost nodes within the instance are
559+
immediately terminated under normal circumstances. Additionally, the resource
560+
set available for scheduling new jobs is reduced by the lost nodes. This
561+
means that pending and newly submitted jobs requesting more resources than
562+
are currently available will trigger a fatal "unsatisfiable" job exception.
563+
564+
525565
*************
526566
MPI Questions
527567
*************

0 commit comments

Comments
 (0)