Skip to content

Add a workload manager to GPU cluster #28

@PGijsbers

Description

@PGijsbers

Our GPU server is shared with the AutoML group, but does not have a workload manager. Currently, that means that largely division of resources happens over chat and/or unwritten rules (we currently have 2 GPUs reserved by default). This is incredibly wasteful, but also makes it hard to scale up experiments later on. We want a job scheduler installed so that everyone that needs to run GPU jobs can simply queue requested jobs and we do not need to manually ensure people are not using the same physical resources.

Overall, the server is mainly intended for prototype testing, so the workload manager should allow quick turn-around time when reasonable for all users. Allowing users to explicitly set some job priority for this is OK, as we only have a small number of users that shouldn't abuse this.

I am not sure what workload manager is most appropriate, but I think everyone on our team is already familiar with SLURM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    automationCI/CD and other automation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions