@@ -522,6 +522,46 @@ see what would be submitted to Flux without actually running any jobs
522522
523523For more help and examples, see :core:man1:`flux-bulksubmit`.
524524
525+ .. _resilient_batch_jobs:
526+
527+ How do I run a batch or alloc job that is resilient to node failures?
528+ =====================================================================
529+ TL;DR: Use:
530+
531+ .. code-block:: console
532+
533+ $ flux batch --conf=tbon.topo=kary:0 -o exit-timeout=none ...
534+
535+ .. note::
536+
537+ In future versions of Flux ``-o exit-timeout=none`` may become the
538+ default for :core:man1:`flux batch` and :core:man1:`flux alloc`. Check
539+ with your version to see if ``-o exit-timeout=none`` is necessary.
540+
541+ When a Flux instance running as a job loses a node, what happens next is
542+ dependent on two factors: whether the lost node is critical, and the value
543+ of the ``exit-timeout`` job shell option. If the lost node is critical, the
544+ instance can no longer properly function, triggering a fatal ``node-failure``
545+ job exception. If the node is not critical, a non-fatal job exception is
546+ raised and the leader job shell is notified. After the ``exit-timeout``
547+ period (if set to a value other than none ``none``), a fatal job exception
548+ is raised and the job is terminated.
549+
550+ To maximize resilience in batch or allocation jobs, disable the
551+ ``exit-timeout`` option (set it to ``none``) and minimize the number of
552+ critical ranks by running a flat TBON with ``--conf=tbon.topo=kary:0``. This
553+ configuration allows jobs to continue running even if any node is lost,
554+ except for rank 0, which is always critical.
555+
556+ The same rules apply to jobs running within the instance. For parallel
557+ jobs that are not instances of Flux, all ranks are considered critical by
558+ default. As a result, jobs running on the lost nodes within the instance are
559+ immediately terminated under normal circumstances. Additionally, the resource
560+ set available for scheduling new jobs is reduced by the lost nodes. This
561+ means that pending and newly submitted jobs requesting more resources than
562+ are currently available will trigger a fatal "unsatisfiable" job exception.
563+
564+
525565*************
526566MPI Questions
527567*************
0 commit comments