Skip to content

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Sep 15, 2025

This feature includes the addition to create a Flux Operator MiniCluster running across some subset of rabbit nodes given that the rabbit.mpi directive is defined. For example, this is the minimal that a user needs to do:

flux run -N 4 --setattr=rabbit.mpi=yes sleep 400

That does require the rabbit directives to get pushed through the process, but note that I'm separating the logic for the MiniCluster into attributes so it's easier to read and understand - the strings that Marty was showing me were really not intuitive. By default, setting that to true (or anything) can use a default container base (the base we built with Flux and cxi on ubuntu 24.04) and interactive mode, and everything from that can be customized. Here are more examples:

--setattr=rabbit.mpi.image="ghcr.io/converged-computing/lammps-reax:ubuntu2404-cxi"
--setattr=rabbit.mpi.workdir="/opt/lammps/examples/reaxff/HNS"
--setattr=rabbit.mpi.command='lmp -v x 2 -v y 2 - v x 2 -in in.reax.hnx -nocite'
--setattr=rabbit.mpi.add_flux=false
--setattr=rabbit.mpi.succeed=true
--setattr=rabbit.mpi.tasks=96
--setattr=rabbit.mpi.env.one=ketchup
--setattr=rabbit.mpi.env.two=mustard
--setattr=rabbit.mpi.rabbits=hetchy201,hetchy202
--setattr=rabbit.mpi.nodes=4

Notes

I'll include additional notes here.

flux hop

I added a flux hop command that is able to interact with the same generation classes, but without the requirement of the HPE / workflow operator stuff. This would mimic us manually creating a MiniCluster via CRD on the command line. It's just done with Python. Here is an example:

flux hop python rabbit_client.py \
    --image "ghcr.io/converged-computing/lammps-reax:ubuntu2404-cxi" \
    --command 'lmp -v x 2 -v y 2 - v x 2 -in in.reax.hnx -nocite' \
    --workdir "/opt/lammps/examples/reaxff/HNS" \
    --tasks 96 \
    --nodes 4 \
    --rabbits "hetchy201,hetchy202,hetchy203,hetchy204" \
    --no-add-flux \
    --succeed \
    --env one=ketchup \
    --env two=mustard

It likely won't be used for production given the permissions needed for that, but it will provide us with a means to test (and the command is pretty fun too). It was Marty's idea and I kind of love it. 🐰

TODO

I wrote TODO for all items we can discuss. I have opinions on most of them but I want to know what you think. Some of them are about defaults, and others about features. Don't feel like you need to read before the Hackathon, I can talk through most of them.

MiniCluster Types

As mentioned, we have two modes of operation:

  1. Creation of a MiniCluster with a Flux job / workflow on rabbits (the path we talked about through coral2_dws.py)
  2. Creation of an a-la-carte MiniCluster with flux hop (primarily for testing or fun)

For the second, we require the rabbit node names since we can't get them from an actual job. The second class RabbitMiniCluster is based on the first and is customized to expect the Workflow CRD object and be able to get node names from Flux.

RabbitMPI

The RabbitMPI class is a wrapper around a jobspec that translates it into MiniCluster needs (e.g., What container to use?, Do we add Flux? Should it be interactive?) I like this design because it means we can populate and generate MiniClusters in ways that don't require Flux jobs. We use the jobspec, but that's just a dictionary of attributes that can be created in another way (e.g., flux hop). I thought about removing the jobspec entirely but I don't think it's necessary - it just serves as a "standardized" data structure to derive metadata from.

Todo Items

These are primarily if we move forward with adding this integration. It's just testing for now.

  • We need a testing suite, likely to run on Hetchy. I could make something for GitHub actions but it would have to operate without actual rabbits.
  • Documentation for the set of attributes that can be set (see RabbitMPI for what is currently exposed.
  • If getting job info is redundant, we can add attributes to the jobtap plugin that might be needed.
  • We also likely want the flux_operator.py to be an actual module somewhere in there. I don't really like the style of "dump everything into one file" so I'd want to have like:
flux_k8s/
  ...
  operator/
     minicluster.py
     rabbit_mpi.py

And since the top level module is flux_k8s we can probably just call it operator to avoid a dreaded underscore.

Apologies for the list of dumb names for the flux hop command - this is for fun, and the only piece I asked Gemini to help produce, and I asked for a docker-like generation style with adjective and noun, and mentioned that I'd contribute to the set. I was horrified when it added a comment with my name to do that. I never told it my name. It claimed "statistical anomaly." 🙃 🤯 😨

ping @jameshcorbett @mcfadden8 @milroy

This feature includes the addition to create a Flux Operator MiniCluster
running across some subset of rabbit nodes given that the rabbit.mpi
directive is defined. By default, setting that to true (or anything)
can use a default container base and interactive mode, and everything
from that can be customized. In addition, we have a "flux hop" command
that is able to take the same metadata, populate the RabbitMPI Job
object, and create the Flux MiniCluster using the same classes/logic
but without requiring the HPE stuff and Workflow. This could be used,
but likely will be for testing or for fun.

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Member Author

vsoch commented Sep 20, 2025

For our notes, here is the command that worked (for an interactive run) on hetchy. The reason we needed to ask for all 12 nodes was to get around fluxion scheduling and compute node to rabbit assignment.

flux alloc -N12 -Sdw=xfs_small -Srabbit.mpi.image="ghcr.io/converged-computing/lammps-reax:ubuntu2404-cxi" -Srabbit.mpi.workdir="/opt/lammps/examples/reaxff/HNS" -Srabbit.mpi.add_flux=false -Srabbit.mpi.nodes=2  -qparrypeak  echo success

We need to test:

  • Non-interactive (the command added to the above)
  • Adding a scoped user set allowed to execute this
  • Adding a pre command for the worker nodes to wait slightly to allow the lead broker to come up.

For the last, the workers typically have a retry and it isn't clear why this is failing. It would have to be the cast that they are able to connect and then something forces the exit (and that is when they typically cleanly exit, which is what we are seeing).

@mcfadden8
Copy link

mcfadden8 commented Sep 22, 2025

Does the flux hop command offer a mechanism to provide the rabbit with the paths to the ephemeral file systems that have been created? Having file system access from the Compute and Rabbit nodes is a key feature required by the demo. For the ephemeral lustre file system, the path is the same for images running on the rabbit nodes as it is for things running on the compute nodes. For GFS2, there is a necessary mapping that needs to occur. We will only be using lustre for the demo, but will be looking to use GFS2 when it is supported.

The file system paths are stored in environment variables available to the NNF user containers. This, and semantics for gfs2 path naming are documented here: https://nearnodeflash.github.io/dev/guides/user-containers/readme/?h=user+container#putting-it-all-together

ping @behlendorf, @jameshcorbett

@vsoch
Copy link
Member Author

vsoch commented Sep 22, 2025

@mcfadden8 depending on when the demo is, I'm not sure we have a reasonable amount of time to consider filesystems for the demo - a run of LAMMPS that is triggered by the submission of a job is what we had scoped it to. Let us know more details of what you had in mind so we can discuss.

This changeset moves the Flux Operator MiniCluster to be a module,
operator, under flux_k8s. I have also cleaned up the organization
of assets, and better coupled the creation of the MiniCluster with
saving the name / namespace so they do not need to be provided again.
We will need to implement the function to see if a user is allowed
to request a MiniCluster, and work further on adding the additional
securityContext needed for production.

Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Comment on lines +111 to +112
"willow" "accelerator",
"algorithm",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants