You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The HPC environment became available to MCW researchers in March 2021. The cluster consists of **71** compute nodes, **3,800** CPU cores, and **40** GPUs. The cluster is connected by **7** 100 Gbps switches running RoCEv2 (ethernet equivalent to Infiniband). Additionally, a **467 TB** NVMe provides scratch storage, and a **2.6 PB** scale-out NAS provides persistent storage.
3
+
The HPC environment became available to MCW researchers in March 2021. The cluster consists of **79** compute nodes, **4,200** CPU cores, and **96** GPUs. The cluster is connected by **7** 100 Gbps switches running RoCEv2 (ethernet equivalent to Infiniband). Additionally, a **467 TB** NVMe provides scratch storage, and a **2.6 PB** scale-out NAS provides persistent storage.
4
4
5
5
## Cluster
6
6
7
7
Detailed information is available below. Please note, the table is wide and might require side scrolling to view all data.
Copy file name to clipboardExpand all lines: docs/cluster/jobs/running-jobs.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -142,7 +142,7 @@ The **bigmem** partition contains the large memory nodes, hm01-hm02. Each node i
142
142
143
143
#### GPU
144
144
145
-
The **gpu** partition contains the gpu nodes. Nodes gn01-gn06 each have **48 cores**, **360GB RAM**, **4 V100 GPUs**, and a **480GB SSD** for local scratch. Nodes gn07-gn08 each have **48 cores**, **480GB RAM**, **4 A40 GPUs**, and a **3.84TB NVMe SSD** for local scratch. See [GPU Jobs](#gpu-jobs) below for more details.
145
+
The **gpu** partition contains the gpu nodes. The original nodes gn01-gn06 each have **48 cores**, **360GB RAM**, **4 V100 GPUs**, and a **480GB SSD** for local scratch. Additional nodes have been added over time. Please see the [hardware guide](../hardware.md)for details.
146
146
147
147
### QOS
148
148
@@ -355,7 +355,7 @@ exit
355
355
356
356
## GPU Jobs
357
357
358
-
The cluster includes 8 compute nodes that each have 4 GPUs. Nodes gn01-gn06 each have 48 cores, 360GB RAM, 4 V100 GPUs, and a 480GB SSD for local scratch. Nodes gn07-gn08 each have 48 cores, 480GB RAM, 4 A40 GPUs, and a 3.84TB NVMe SSD for local scratch You can add a GPU to your batch or interactive job submission with the `--gres=gpu:N` flag, where ***N*** is the number of GPUs.
358
+
In SLURM, GPUs are referred to as generic resources (GRES) for scheduling purposes. You can add a GPU to your batch or interactive job submission with the `--gres=gpu:N` flag, where ***N*** is the number of GPUs.
359
359
360
360
In a job script add:
361
361
@@ -374,7 +374,7 @@ Use the `#SBATCH --partition=gpu` and `#SBATCH --gres=gpu:1` directives to have
374
374
375
375
### GPU Type
376
376
377
-
To use a specific GPU type, use `--gres=gpu:type:1`, where ***type*** is either `v100`or `a40`. Please note that most jobs will not benefit from specifying a GPU type, and this may delay scheduling of your job.
377
+
To use a specific GPU type, use `--gres=gpu:type:1`, where ***type*** is `v100`, `a40`, or `l40s`. Please note that most jobs will not benefit from specifying a GPU type, and instead it may potentially delay scheduling of your job.
Copy file name to clipboardExpand all lines: docs/grants.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ The MCW Research Computing Center (RCC) is a division within MCW Information Ser
11
11
12
12
## High Performance Computing
13
13
14
-
The High Performance Computing (HPC) environment includes 71 computational nodes, 3400 processor cores, 29.3 TB of memory, and 40 graphical processing units (GPUs). The nodes are interconnected by 100 Gb/s Ethernet, allowing efficient parallel computing for both CPU and GPU intensive workloads. All nodes run the Rocky Linux 8 operating system. Job submission and scheduling is controlled by the Simple Linux Utility for Resource Management (SLURM). SLURM is an open-source HPC scheduling system that automates job submission, controls resource access, and maintains fair use of all systems. Each compute node includes a standardized operating system image, set of compilers, math libraries, and system software. RCC also supports a variety of open-source software and containerized workloads are supported.
14
+
The High Performance Computing (HPC) environment includes 79 computational nodes, 4,200 processor cores, and 96 graphical processing units (GPUs). The nodes are interconnected by 100 Gb/s Ethernet, allowing efficient parallel computing for both CPU and GPU intensive workloads. All nodes run the Rocky Linux 8 operating system. Job submission and scheduling is controlled by the Simple Linux Utility for Resource Management (SLURM). SLURM is an open-source HPC scheduling system that automates job submission, controls resource access, and maintains fair use of all systems. Each compute node includes a standardized operating system image, set of compilers, math libraries, and system software. RCC also supports a variety of open-source software and containerized workloads are supported.
Copy file name to clipboardExpand all lines: docs/index.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,15 +8,15 @@ Research Computing provides services and support for computational research at M
8
8
9
9
### HPC Cluster
10
10
11
-
The {{ hpc_name }} cluster is the institution's primary computational resource, and has been available to MCW researchers since March 2021. The cluster contains over **3,800** CPU cores in a variety of compute node architectures, including large memory and GPUs. All compute nodes are connected by a 100 Gbps RoCEv2 network. The cluster also includes a**467 TB** NVMe scratch storage filesystem. Please see the [Quick Start guide](cluster/quickstart.md) for more detail.
11
+
The {{ hpc_name }} cluster is the institution's primary computational resource, and has been available to MCW researchers since March 2021. The cluster contains a variety of compute node architectures, including large memory and GPU, all connected by a 100 Gbps network. The cluster also includes a NVMe scratch storage filesystem. Please see the [Quick Start guide](cluster/quickstart.md) for more detail.
12
12
13
13
### ResHPC
14
14
15
-
**Res**tricted **HPC** (ResHPC) is a secure way to access and utilize the HPC cluster. It is specifically designed for restricted datasets that have a defined Data Use Agreement (DUA). The ResHPC service is built on the existing HPC cluster, but incorporates a separate, secure login method, and project specific accounts and directories. With ResHPC, you can work with familiar tools while also satisfying complex data provider security requirements. Please see the [ResHPC overview](secure-computing/reshpc.md) to get started.
15
+
Restricted HPC (ResHPC) is a secure way to access and utilize the HPC cluster. It is specifically designed for restricted datasets that have a defined Data Use Agreement (DUA). The ResHPC service is built on the existing HPC cluster, but incorporates a separate, secure login method, and project specific accounts and directories. With ResHPC, you can work with familiar tools while also satisfying complex data provider security requirements. Please see the [ResHPC overview](secure-computing/reshpc.md) to get started.
16
16
17
17
### Data Storage
18
18
19
-
In addition to the cluster's scratch storage, RCC also provides general purpose research storage with a replicated **2.6 PB**filesystem. This persistent storage is mounted on the cluster via NFS, or provided directly to user's via NFS and SMB. Please see the [Storage Overview](storage/rcc-storage.md) for more detail.
19
+
In addition to the cluster's scratch storage, RCC also provides general purpose research storage with a replicated filesystem. This persistent storage is mounted on the cluster via NFS, or provided directly to users via NFS and SMB. Please see the [Storage Overview](storage/rcc-storage.md) for more detail.
Two new GPU nodes are now available. Each new node has 128 cores, 750 GB of memory, 8 L40S GPUs, and a 7 TB local scratch disk. The new GPU nodes are part of the `gpu` partition and no special job configuration is needed. Please see the updated [hardware](../../cluster/hardware.md) and [SLURM](../../cluster/jobs/running-jobs.md#gpu-jobs) guides for details.
Copy file name to clipboardExpand all lines: docs/storage/rcc-storage.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,11 +29,11 @@ Each user has the same set of default storage paths:
29
29
30
30
The home directory is your starting place every time you login to the cluster. It's location is `/home/netid`, where `netid` is your MCW username. The purpose of a home directory is storing user-installed software, user-written scripts, configuration files, etc. Each home directory is only accessible by its owner and is not suitable for data sharing. Home is also not appropriate for large scale research data or temporary job files.
31
31
32
-
The quota limit is 100 GB and data protection includes replication[^1] and snapshots[^2]. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
32
+
The quota limit is 100 GB and data protection includes replication and snapshots. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
33
33
34
34
### Group
35
35
36
-
Group storage is a shared space for labs to store research data in active projects. Each lab receives 1 TB for free and can expand via [additional paid storage](../storage/paid-storage.md). This space is large scale, but low performance. It is not meant for high I/O, and so is not mounted to compute nodes. Data protection includes replication[^1] and snapshots[^3]. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
36
+
Group storage is a shared space for labs to store research data in active projects. Each lab receives 1 TB for free and can expand via [additional paid storage](../storage/paid-storage.md). This space is large scale, but low performance. It is not meant for high I/O, and so is not mounted to compute nodes. Data protection includes replication and snapshots. For more info on snapshots, and how you might recover a file, please see [file recovery](file-recovery.md).
37
37
38
38
This space is organized by lab group. Each folder in `/group` represents a lab, and is named using the PI's NetID (username). For example, a PI with username "jsmith" would have a group directory located at `/group/jsmith`. Directories within that lab space are organized by purpose and controlled by unique security groups. For example, there is a default `/group/pi_netid/work` directory, which is shared space restricted to lab users. Other shared directories can be created by request for projects that require unique permissions. Additionally, you may have data directories related to your use of a MCW core. These directories will be named for the core and located at `/group/pi_netid/cores`. For example, a Mellowes Center project could be delivered to your group storage and located at `/group/pi_netid/cores/mellowes/example_project1`.
0 commit comments