Add a workload manager to GPU cluster

Our GPU server is shared with the AutoML group, but does not have a workload manager. Currently, that means that largely division of resources happens over chat and/or unwritten rules (we currently have 2 GPUs reserved by default). This is incredibly wasteful, but also makes it hard to scale up experiments later on. We want a job scheduler installed so that everyone that needs to run GPU jobs can simply queue requested jobs and we do not need to manually ensure people are not using the same physical resources.

Overall, the server is mainly intended for prototype testing, so the workload manager should allow quick turn-around time when reasonable for all users. Allowing users to explicitly set some job priority for this is OK, as we only have a small number of users that _shouldn't_ abuse this.

I am not sure what workload manager is most appropriate, but I think everyone on our team is already familiar with SLURM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a workload manager to GPU cluster #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a workload manager to GPU cluster #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions