CDS Compute Cluster

Overview

The CDS Compute Cluster is a high-performance computing environment developed by the Cornell Data Science team to provide local, on-demand compute resources for machine learning and data processing. It connects heterogeneous nodes, some with NVIDIA GPUs, into a unified Slurm-orchestrated cluster. This internal system eliminates cloud costs while enabling GPU-accelerated workloads, distributed training, and large-scale data processing.

Architecture

Head Nodes and Compute Nodes: The system includes a dedicated head node for job scheduling and multiple compute nodes with varied CPU and GPU configurations.
Networking: All nodes are connected via static IPs on a private LAN. The head node bridges this internal network with campus Wi-Fi.
Shared Storage: A shared NFS volume ensures consistent file access across all nodes.
Authentication and Sync: Munge provides inter-node authentication, and Chrony ensures time synchronization, which is critical for job coordination.
Containerized Environments: Docker is used to create uniform runtime environments across nodes, ensuring compatibility and reproducibility regardless of hardware or operating system differences.

Key Technologies

Slurm Scheduler: Manages job queuing and parallel dispatch across CPUs and GPUs.
GPU Scheduling: Slurm tracks and allocates GPUs using GRES, supporting exclusive access and multi-GPU jobs.
Container Support: Docker is used for consistent environments across different node architectures.

Features

Batch and Interactive Jobs: Users submit via sbatch or launch live sessions with srun.
Resource-Aware Scheduling: Slurm dispatches based on node specifications, including GPU count and memory.
Scalable and Modular: New nodes can be added with minimal configuration, and setup is fully documented.

Engineering Highlights

Built entirely in-house using open-source tools.
Supports heterogeneous hardware with unified scheduling and storage.
Enables real-world machine learning workflows like LLM inference (vLLM) and distributed training.
Demonstrates expertise in Linux, HPC, networking, DevOps, and systems engineering.

Summary

This project mirrors production HPC systems on a smaller scale, delivering cloud-like capabilities with local infrastructure. It showcases hands-on systems design, cluster orchestration, and technical leadership, which are critical skills for infrastructure and platform engineering roles.

Links

Architecture
Adding a new node

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
benchmarking		benchmarking
docker		docker
hosting		hosting
hosts		hosts
network		network
services/mathsearch-cluster		services/mathsearch-cluster
slurm		slurm
vllm @ 5b9bd62		vllm @ 5b9bd62
.gitignore		.gitignore
.gitmodules		.gitmodules
Architecture.md		Architecture.md
README.md		README.md
benchmark.py		benchmark.py
client.sh		client.sh
download.py		download.py
interactive-prompt.sh		interactive-prompt.sh
newnode.md		newnode.md
server.sh		server.sh
submit-prompt.sh		submit-prompt.sh
vllm.md		vllm.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CDS Compute Cluster

Overview

Architecture

Key Technologies

Features

Engineering Highlights

Summary

Links

About

Uh oh!

Releases

Packages

Contributors 3

Languages

CornellDataScience/computecluster

Folders and files

Latest commit

History

Repository files navigation

CDS Compute Cluster

Overview

Architecture

Key Technologies

Features

Engineering Highlights

Summary

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages