Skip to content

Commit be870cb

Browse files
authored
Merge pull request #252 from maflister/update-gpus
Update cluster docs for new GPU nodes and hardware
2 parents 152ca6b + f689c45 commit be870cb

File tree

7 files changed

+25
-12
lines changed

7 files changed

+25
-12
lines changed

docs/cluster/hardware.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# Overview
22

3-
The HPC environment became available to MCW researchers in March 2021. The cluster consists of **71** compute nodes, **3,800** CPU cores, and **40** GPUs. The cluster is connected by **7** 100 Gbps switches running RoCEv2 (ethernet equivalent to Infiniband). Additionally, a **467 TB** NVMe provides scratch storage, and a **2.6 PB** scale-out NAS provides persistent storage.
3+
The HPC environment became available to MCW researchers in March 2021. The cluster consists of **79** compute nodes, **4,200** CPU cores, and **96** GPUs. The cluster is connected by **7** 100 Gbps switches running RoCEv2 (ethernet equivalent to Infiniband). Additionally, a **467 TB** NVMe provides scratch storage, and a **2.6 PB** scale-out NAS provides persistent storage.
44

55
## Cluster
66

77
Detailed information is available below. Please note, the table is wide and might require side scrolling to view all data.
88

99
{{ read_csv('../../includes/cluster-hardware.csv', keep_default_na=False) }}
10+
11+
!!! tip "Condo hardware"
12+
Condo nodes are factored into the overall cluster metrics, but specific hardware details for condo systems are not listed in the table.

docs/cluster/jobs/running-jobs.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ The **bigmem** partition contains the large memory nodes, hm01-hm02. Each node i
142142

143143
#### GPU
144144

145-
The **gpu** partition contains the gpu nodes. Nodes gn01-gn06 each have **48 cores**, **360GB RAM**, **4 V100 GPUs**, and a **480GB SSD** for local scratch. Nodes gn07-gn08 each have **48 cores**, **480GB RAM**, **4 A40 GPUs**, and a **3.84TB NVMe SSD** for local scratch. See [GPU Jobs](#gpu-jobs) below for more details.
145+
The **gpu** partition contains the gpu nodes. The original nodes gn01-gn06 each have **48 cores**, **360GB RAM**, **4 V100 GPUs**, and a **480GB SSD** for local scratch. Additional nodes have been added over time. Please see the [hardware guide](../hardware.md) for details.
146146

147147
### QOS
148148

@@ -355,7 +355,7 @@ exit
355355

356356
## GPU Jobs
357357

358-
The cluster includes 8 compute nodes that each have 4 GPUs. Nodes gn01-gn06 each have 48 cores, 360GB RAM, 4 V100 GPUs, and a 480GB SSD for local scratch. Nodes gn07-gn08 each have 48 cores, 480GB RAM, 4 A40 GPUs, and a 3.84TB NVMe SSD for local scratch You can add a GPU to your batch or interactive job submission with the `--gres=gpu:N` flag, where ***N*** is the number of GPUs.
358+
In SLURM, GPUs are referred to as generic resources (GRES) for scheduling purposes. You can add a GPU to your batch or interactive job submission with the `--gres=gpu:N` flag, where ***N*** is the number of GPUs.
359359

360360
In a job script add:
361361

@@ -374,7 +374,7 @@ Use the `#SBATCH --partition=gpu` and `#SBATCH --gres=gpu:1` directives to have
374374

375375
### GPU Type
376376

377-
To use a specific GPU type, use `--gres=gpu:type:1`, where ***type*** is either `v100` or `a40`. Please note that most jobs will not benefit from specifying a GPU type, and this may delay scheduling of your job.
377+
To use a specific GPU type, use `--gres=gpu:type:1`, where ***type*** is `v100`, `a40`, or `l40s`. Please note that most jobs will not benefit from specifying a GPU type, and instead it may potentially delay scheduling of your job.
378378

379379
### GPU Compute Mode
380380

docs/grants.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The MCW Research Computing Center (RCC) is a division within MCW Information Ser
1111

1212
## High Performance Computing
1313

14-
The High Performance Computing (HPC) environment includes 71 computational nodes, 3400 processor cores, 29.3 TB of memory, and 40 graphical processing units (GPUs). The nodes are interconnected by 100 Gb/s Ethernet, allowing efficient parallel computing for both CPU and GPU intensive workloads. All nodes run the Rocky Linux 8 operating system. Job submission and scheduling is controlled by the Simple Linux Utility for Resource Management (SLURM). SLURM is an open-source HPC scheduling system that automates job submission, controls resource access, and maintains fair use of all systems. Each compute node includes a standardized operating system image, set of compilers, math libraries, and system software. RCC also supports a variety of open-source software and containerized workloads are supported.
14+
The High Performance Computing (HPC) environment includes 79 computational nodes, 4,200 processor cores, and 96 graphical processing units (GPUs). The nodes are interconnected by 100 Gb/s Ethernet, allowing efficient parallel computing for both CPU and GPU intensive workloads. All nodes run the Rocky Linux 8 operating system. Job submission and scheduling is controlled by the Simple Linux Utility for Resource Management (SLURM). SLURM is an open-source HPC scheduling system that automates job submission, controls resource access, and maintains fair use of all systems. Each compute node includes a standardized operating system image, set of compilers, math libraries, and system software. RCC also supports a variety of open-source software and containerized workloads are supported.
1515

1616
## Restricted HPC
1717

docs/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ Research Computing provides services and support for computational research at M
88

99
### HPC Cluster
1010

11-
The {{ hpc_name }} cluster is the institution's primary computational resource, and has been available to MCW researchers since March 2021. The cluster contains over **3,800** CPU cores in a variety of compute node architectures, including large memory and GPUs. All compute nodes are connected by a 100 Gbps RoCEv2 network. The cluster also includes a **467 TB** NVMe scratch storage filesystem. Please see the [Quick Start guide](cluster/quickstart.md) for more detail.
11+
The {{ hpc_name }} cluster is the institution's primary computational resource, and has been available to MCW researchers since March 2021. The cluster contains a variety of compute node architectures, including large memory and GPU, all connected by a 100 Gbps network. The cluster also includes a NVMe scratch storage filesystem. Please see the [Quick Start guide](cluster/quickstart.md) for more detail.
1212

1313
### ResHPC
1414

15-
**Res**tricted **HPC** (ResHPC) is a secure way to access and utilize the HPC cluster. It is specifically designed for restricted datasets that have a defined Data Use Agreement (DUA). The ResHPC service is built on the existing HPC cluster, but incorporates a separate, secure login method, and project specific accounts and directories. With ResHPC, you can work with familiar tools while also satisfying complex data provider security requirements. Please see the [ResHPC overview](secure-computing/reshpc.md) to get started.
15+
Restricted HPC (ResHPC) is a secure way to access and utilize the HPC cluster. It is specifically designed for restricted datasets that have a defined Data Use Agreement (DUA). The ResHPC service is built on the existing HPC cluster, but incorporates a separate, secure login method, and project specific accounts and directories. With ResHPC, you can work with familiar tools while also satisfying complex data provider security requirements. Please see the [ResHPC overview](secure-computing/reshpc.md) to get started.
1616

1717
### Data Storage
1818

19-
In addition to the cluster's scratch storage, RCC also provides general purpose research storage with a replicated **2.6 PB** filesystem. This persistent storage is mounted on the cluster via NFS, or provided directly to user's via NFS and SMB. Please see the [Storage Overview](storage/rcc-storage.md) for more detail.
19+
In addition to the cluster's scratch storage, RCC also provides general purpose research storage with a replicated filesystem. This persistent storage is mounted on the cluster via NFS, or provided directly to users via NFS and SMB. Please see the [Storage Overview](storage/rcc-storage.md) for more detail.
2020

2121
### Software
2222

docs/news/posts/2025-06-04.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
date: 2025-06-04
3+
categories:
4+
- Announcements
5+
---
6+
7+
# New GPU nodes available
8+
9+
Two new GPU nodes are now available. Each new node has 128 cores, 750 GB of memory, 8 L40S GPUs, and a 7 TB local scratch disk. The new GPU nodes are part of the `gpu` partition and no special job configuration is needed. Please see the updated [hardware](../../cluster/hardware.md) and [SLURM](../../cluster/jobs/running-jobs.md#gpu-jobs) guides for details.

docs/storage/rcc-storage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,11 @@ Each user has the same set of default storage paths:
2929

3030
The home directory is your starting place every time you login to the cluster. It's location is `/home/netid`, where `netid` is your MCW username. The purpose of a home directory is storing user-installed software, user-written scripts, configuration files, etc. Each home directory is only accessible by its owner and is not suitable for data sharing. Home is also not appropriate for large scale research data or temporary job files.
3131

32-
The quota limit is 100 GB and data protection includes replication[^1] and snapshots[^2]. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
32+
The quota limit is 100 GB and data protection includes replication and snapshots. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
3333

3434
### Group
3535

36-
Group storage is a shared space for labs to store research data in active projects. Each lab receives 1 TB for free and can expand via [additional paid storage](../storage/paid-storage.md). This space is large scale, but low performance. It is not meant for high I/O, and so is not mounted to compute nodes. Data protection includes replication[^1] and snapshots[^3]. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
36+
Group storage is a shared space for labs to store research data in active projects. Each lab receives 1 TB for free and can expand via [additional paid storage](../storage/paid-storage.md). This space is large scale, but low performance. It is not meant for high I/O, and so is not mounted to compute nodes. Data protection includes replication and snapshots. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
3737

3838
This space is organized by lab group. Each folder in `/group` represents a lab, and is named using the PI's NetID (username). For example, a PI with username "jsmith" would have a group directory located at `/group/jsmith`. Directories within that lab space are organized by purpose and controlled by unique security groups. For example, there is a default `/group/pi_netid/work` directory, which is shared space restricted to lab users. Other shared directories can be created by request for projects that require unique permissions. Additionally, you may have data directories related to your use of a MCW core. These directories will be named for the core and located at `/group/pi_netid/cores`. For example, a Mellowes Center project could be delivered to your group storage and located at `/group/pi_netid/cores/mellowes/example_project1`.
3939

includes/cluster-hardware.csv

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1-
Nodes,Type,Cores/node,Mem/node (Gb),Disk/node (Gb),Total GPUs,Sockets/node,Cores/socket,Threads/core,CPU Vendor,CPU Model,CPU Base Freq (GHz),CPU Turbo Freq (GHz),GPU Vendor,GPU Model,GPU Mem (Gb)
1+
Nodes,Type,Cores/node,Mem/node (Gb),Disk/node (Gb),GPUs/node,Sockets/node,Cores/socket,Threads/core,CPU Vendor,CPU Model,CPU Base Freq (GHz),CPU Turbo Freq (GHz),GPU Vendor,GPU Model,GPU Mem (Gb)
22
60,CPU,48,384,440,,2,24,1,Intel,6240R,2.4,4.0,,,
33
6,GPU,48,384,440,4,2,24,1,Intel,5220R,2.2,4.0,NVIDIA,Tesla V100,32
44
2,GPU,48,512,440,4,2,24,1,Intel,6336Y,2.4,3.6,NVIDIA,Ampere A40,48
55
1,GPU,40,512,7000,8,2,20,1,Intel,E5-2698 v4,2.2,3.6,NVIDIA,Tesla V100 SXM2,32
6-
2,Large mem,,1536,440,,2,24,1,Intel,6240R,2.4,4.0,,,
6+
2,GPU,128,750,7000,8,2,64,1,AMD,EPYC 9554,3.1,3.75,NVIDIA,Ada Lovelace L40S,48
7+
2,Large Mem,48,1536,440,,2,24,1,Intel,6240R,2.4,4.0,,,

0 commit comments

Comments
 (0)