Merge pull request #252 from maflister/update-gpus

maflister · web-flow · commit be870cb3b99b · 2025-06-23T22:14:16.000-05:00
Update cluster docs for new GPU nodes and hardware
diff --git a/docs/cluster/hardware.md b/docs/cluster/hardware.md
@@ -1,9 +1,12 @@
 # Overview
 
-The HPC environment became available to MCW researchers in March 2021. The cluster consists of **71** compute nodes, **3,800** CPU cores, and **40** GPUs. The cluster is connected by **7** 100 Gbps switches running RoCEv2 (ethernet equivalent to Infiniband). Additionally, a **467 TB** NVMe provides scratch storage, and a **2.6 PB** scale-out NAS provides persistent storage.
+The HPC environment became available to MCW researchers in March 2021. The cluster consists of **79** compute nodes, **4,200** CPU cores, and **96** GPUs. The cluster is connected by **7** 100 Gbps switches running RoCEv2 (ethernet equivalent to Infiniband). Additionally, a **467 TB** NVMe provides scratch storage, and a **2.6 PB** scale-out NAS provides persistent storage.
 
 ## Cluster
 
 Detailed information is available below. Please note, the table is wide and might require side scrolling to view all data.
 
 {{ read_csv('../../includes/cluster-hardware.csv', keep_default_na=False) }}
+
+!!! tip "Condo hardware"
+    Condo nodes are factored into the overall cluster metrics, but specific hardware details for condo systems are not listed in the table.
diff --git a/docs/cluster/jobs/running-jobs.md b/docs/cluster/jobs/running-jobs.md
@@ -142,7 +142,7 @@ The **bigmem** partition contains the large memory nodes, hm01-hm02. Each node i
 
 #### GPU
 
-The **gpu** partition contains the gpu nodes. Nodes gn01-gn06 each have **48 cores**, **360GB RAM**, **4 V100 GPUs**, and a **480GB SSD** for local scratch. Nodes gn07-gn08 each have **48 cores**, **480GB RAM**, **4 A40 GPUs**, and a **3.84TB NVMe SSD** for local scratch. See [GPU Jobs](#gpu-jobs) below for more details.
+The **gpu** partition contains the gpu nodes. The original nodes gn01-gn06 each have **48 cores**, **360GB RAM**, **4 V100 GPUs**, and a **480GB SSD** for local scratch. Additional nodes have been added over time. Please see the [hardware guide](../hardware.md) for details.
 
 ### QOS
 
@@ -355,7 +355,7 @@ exit
 
 ## GPU Jobs
 
-The cluster includes 8 compute nodes that each have 4 GPUs. Nodes gn01-gn06 each have 48 cores, 360GB RAM, 4 V100 GPUs, and a 480GB SSD for local scratch. Nodes gn07-gn08 each have 48 cores, 480GB RAM, 4 A40 GPUs, and a 3.84TB NVMe SSD for local scratch You can add a GPU to your batch or interactive job submission with the `--gres=gpu:N` flag, where ***N*** is the number of GPUs.
+In SLURM, GPUs are referred to as generic resources (GRES) for scheduling purposes. You can add a GPU to your batch or interactive job submission with the `--gres=gpu:N` flag, where ***N*** is the number of GPUs.
 
 In a job script add:
 
@@ -374,7 +374,7 @@ Use the `#SBATCH --partition=gpu` and `#SBATCH --gres=gpu:1` directives to have
 
 ### GPU Type
 
-To use a specific GPU type, use `--gres=gpu:type:1`, where ***type*** is either `v100` or `a40`. Please note that most jobs will not benefit from specifying a GPU type, and this may delay scheduling of your job.
+To use a specific GPU type, use `--gres=gpu:type:1`, where ***type*** is `v100`, `a40`, or `l40s`. Please note that most jobs will not benefit from specifying a GPU type, and instead it may potentially delay scheduling of your job.
 
 ### GPU Compute Mode
 
diff --git a/docs/grants.md b/docs/grants.md
@@ -11,7 +11,7 @@ The MCW Research Computing Center (RCC) is a division within MCW Information Ser
 
 ## High Performance Computing
 
-The High Performance Computing (HPC) environment includes 71 computational nodes, 3400 processor cores, 29.3 TB of memory, and 40 graphical processing units (GPUs). The nodes are interconnected by 100 Gb/s Ethernet, allowing efficient parallel computing for both CPU and GPU intensive workloads. All nodes run the Rocky Linux 8 operating system. Job submission and scheduling is controlled by the Simple Linux Utility for Resource Management (SLURM). SLURM is an open-source HPC scheduling system that automates job submission, controls resource access, and maintains fair use of all systems. Each compute node includes a standardized operating system image, set of compilers, math libraries, and system software. RCC also supports a variety of open-source software and containerized workloads are supported.
+The High Performance Computing (HPC) environment includes 79 computational nodes, 4,200 processor cores, and 96 graphical processing units (GPUs). The nodes are interconnected by 100 Gb/s Ethernet, allowing efficient parallel computing for both CPU and GPU intensive workloads. All nodes run the Rocky Linux 8 operating system. Job submission and scheduling is controlled by the Simple Linux Utility for Resource Management (SLURM). SLURM is an open-source HPC scheduling system that automates job submission, controls resource access, and maintains fair use of all systems. Each compute node includes a standardized operating system image, set of compilers, math libraries, and system software. RCC also supports a variety of open-source software and containerized workloads are supported.
 
 ## Restricted HPC
 
diff --git a/docs/index.md b/docs/index.md
@@ -8,15 +8,15 @@ Research Computing provides services and support for computational research at M
 
 ### HPC Cluster
 
-The {{ hpc_name }} cluster is the institution's primary computational resource, and has been available to MCW researchers since March 2021. The cluster contains over **3,800** CPU cores in a variety of compute node architectures, including large memory and GPUs. All compute nodes are connected by a 100 Gbps RoCEv2 network. The cluster also includes a **467 TB** NVMe scratch storage filesystem. Please see the [Quick Start guide](cluster/quickstart.md) for more detail.
+The {{ hpc_name }} cluster is the institution's primary computational resource, and has been available to MCW researchers since March 2021. The cluster contains a variety of compute node architectures, including large memory and GPU, all connected by a 100 Gbps network. The cluster also includes a NVMe scratch storage filesystem. Please see the [Quick Start guide](cluster/quickstart.md) for more detail.
 
 ### ResHPC
 
-**Res**tricted **HPC** (ResHPC) is a secure way to access and utilize the HPC cluster. It is specifically designed for restricted datasets that have a defined Data Use Agreement (DUA). The ResHPC service is built on the existing HPC cluster, but incorporates a separate, secure login method, and project specific accounts and directories. With ResHPC, you can work with familiar tools while also satisfying complex data provider security requirements. Please see the [ResHPC overview](secure-computing/reshpc.md) to get started.
+Restricted HPC (ResHPC) is a secure way to access and utilize the HPC cluster. It is specifically designed for restricted datasets that have a defined Data Use Agreement (DUA). The ResHPC service is built on the existing HPC cluster, but incorporates a separate, secure login method, and project specific accounts and directories. With ResHPC, you can work with familiar tools while also satisfying complex data provider security requirements. Please see the [ResHPC overview](secure-computing/reshpc.md) to get started.
 
 ### Data Storage
 
-In addition to the cluster's scratch storage, RCC also provides general purpose research storage with a replicated **2.6 PB** filesystem. This persistent storage is mounted on the cluster via NFS, or provided directly to user's via NFS and SMB. Please see the [Storage Overview](storage/rcc-storage.md) for more detail.
+In addition to the cluster's scratch storage, RCC also provides general purpose research storage with a replicated filesystem. This persistent storage is mounted on the cluster via NFS, or provided directly to users via NFS and SMB. Please see the [Storage Overview](storage/rcc-storage.md) for more detail.
 
 ### Software
 
diff --git a/docs/news/posts/2025-06-04.md b/docs/news/posts/2025-06-04.md
@@ -0,0 +1,9 @@
+---
+date: 2025-06-04
+categories:
+  - Announcements
+---
+
+# New GPU nodes available
+
+Two new GPU nodes are now available. Each new node has 128 cores, 750 GB of memory, 8 L40S GPUs, and a 7 TB local scratch disk. The new GPU nodes are part of the `gpu` partition and no special job configuration is needed. Please see the updated [hardware](../../cluster/hardware.md) and [SLURM](../../cluster/jobs/running-jobs.md#gpu-jobs) guides for details.
diff --git a/docs/storage/rcc-storage.md b/docs/storage/rcc-storage.md
@@ -29,11 +29,11 @@ Each user has the same set of default storage paths:
 
 The home directory is your starting place every time you login to the cluster. It's location is `/home/netid`, where `netid` is your MCW username. The purpose of a home directory is storing user-installed software, user-written scripts, configuration files, etc. Each home directory is only accessible by its owner and is not suitable for data sharing. Home is also not appropriate for large scale research data or temporary job files.
 
-The quota limit is 100 GB and data protection includes replication[^1] and snapshots[^2]. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
+The quota limit is 100 GB and data protection includes replication and snapshots. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
 
 ### Group
 
-Group storage is a shared space for labs to store research data in active projects. Each lab receives 1 TB for free and can expand via [additional paid storage](../storage/paid-storage.md). This space is large scale, but low performance. It is not meant for high I/O, and so is not mounted to compute nodes. Data protection includes replication[^1] and snapshots[^3]. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
+Group storage is a shared space for labs to store research data in active projects. Each lab receives 1 TB for free and can expand via [additional paid storage](../storage/paid-storage.md). This space is large scale, but low performance. It is not meant for high I/O, and so is not mounted to compute nodes. Data protection includes replication and snapshots. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
 
 This space is organized by lab group. Each folder in `/group` represents a lab, and is named using the PI's NetID (username). For example, a PI with username "jsmith" would have a group directory located at `/group/jsmith`. Directories within that lab space are organized by purpose and controlled by unique security groups. For example, there is a default `/group/pi_netid/work` directory, which is shared space restricted to lab users. Other shared directories can be created by request for projects that require unique permissions. Additionally, you may have data directories related to your use of a MCW core. These directories will be named for the core and located at `/group/pi_netid/cores`. For example, a Mellowes Center project could be delivered to your group storage and located at `/group/pi_netid/cores/mellowes/example_project1`.
 
diff --git a/includes/cluster-hardware.csv b/includes/cluster-hardware.csv
@@ -1,6 +1,7 @@
-﻿Nodes,Type,Cores/node,Mem/node (Gb),Disk/node (Gb),Total GPUs,Sockets/node,Cores/socket,Threads/core,CPU Vendor,CPU Model,CPU Base Freq (GHz),CPU Turbo Freq (GHz),GPU Vendor,GPU Model,GPU Mem (Gb)
+﻿Nodes,Type,Cores/node,Mem/node (Gb),Disk/node (Gb),GPUs/node,Sockets/node,Cores/socket,Threads/core,CPU Vendor,CPU Model,CPU Base Freq (GHz),CPU Turbo Freq (GHz),GPU Vendor,GPU Model,GPU Mem (Gb)
 60,CPU,48,384,440,,2,24,1,Intel,6240R,2.4,4.0,,,
 6,GPU,48,384,440,4,2,24,1,Intel,5220R,2.2,4.0,NVIDIA,Tesla V100,32
 2,GPU,48,512,440,4,2,24,1,Intel,6336Y,2.4,3.6,NVIDIA,Ampere A40,48
 1,GPU,40,512,7000,8,2,20,1,Intel,E5-2698 v4,2.2,3.6,NVIDIA,Tesla V100 SXM2,32
-2,Large mem,,1536,440,,2,24,1,Intel,6240R,2.4,4.0,,,
+2,GPU,128,750,7000,8,2,64,1,AMD,EPYC 9554,3.1,3.75,NVIDIA,Ada Lovelace L40S,48
+2,Large Mem,48,1536,440,,2,24,1,Intel,6240R,2.4,4.0,,,