Awesome GPU Engineering

A curated list of resources for mastering GPU engineering from architecture and kernel programming to large-scale distributed systems and AI acceleration.

📘 Foundational Books

Programming Massively Parallel Processors: A Hands-on Approach — David B. Kirk & Wen-mei W. Hwu The canonical introduction to CUDA, memory hierarchies, and parallel patterns. Amazon , notes: Abi's Concise Notes
CUDA by Example — Jason Sanders & Edward Kandrot
A practical introduction to CUDA for beginners. Amazon
The Ultra-Scale Playbook: Training LLMs on GPU Clusters - Hugging Face Web Version

💻 GPU Programming Frameworks

CUDA — NVIDIA’s proprietary GPU programming platform.
- Libraries: cuBLAS, cuDNN
ROCm — AMD’s open compute stack.
OpenCL — Cross-platform parallel computing standard.
SYCL / oneAPI — Intel’s C++ abstraction for heterogeneous compute.
Vulkan Compute — Low-level GPU compute API.
Kompute — Higher level general purpose GPU compute framework built on Vulkan.
Metal Performance Shaders — Apple’s GPU framework.

🧩 Optimization and Performance

NVIDIA Nsight Systems — System-wide GPU profiler.
Nsight Compute — Kernel-level performance analysis.
Occupancy Calculator — NVIDIA spreadsheet for kernel configuration.
CUTLASS — CUDA templates for linear algebra subroutines.
TensorRT — High-performance deep learning inference.
OpenAI Triton — Python DSL for writing high-performance GPU kernels.
Roofline Model — Analytical model to reason about compute/memory bottlenecks.

🧠 Architecture and Low-Level Design

NVIDIA Ampere Whitepaper
AMD RDNA & CDNA Architectures
SIMT execution and warp scheduling
Memory hierarchy and coalescing
Shared memory and cache optimization
Warp divergence and thread occupancy

⚙️ Systems and Multi-GPU Engineering

NCCL — Multi-GPU communication primitives.
vLLM - Inference and serving engine for LLMs
Hugging Face Accelerate - Simplify abstractions for distributed training
SGLang
Prime Intellect
TensorRT-LLM
TGI by Hugging Face
Horovod — Distributed deep learning across GPUs.
NVLink & PCIe Topology — GPU interconnects and bandwidth optimization.
GPUDirect RDMA — Zero-copy GPU networking.
Ray Train, DeepSpeed, Megatron-LM — Large-scale GPU orchestration frameworks.
Iris by AMD - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.

🧪 Tutorials and Courses

📄 Research Papers and Articles

Optimization techniques for GPU programming - Hijma, Pieter, et al.
Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads - Oden, Lena, and Klaus Nölp
Evolving GPU Architecture — Kirk & Hwu
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision - Wei Gao et al
Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis - Niteesh, L., and M. B. Ampareeshan
NVIDIA Research Papers on Model Parallelism and Megatron-LM
GPU Virtualization and Multi-Tenant Scheduling
A Survey of Multi-Tenant Deep Learning Inference on GPU
Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception

🧰 Tools and Utilities

nvprof, nvvp, Nsight Systems / Compute — NVIDIA profiling tools.
cuda-memcheck, compute-sanitizer — Memory and correctness tools.
GPGPU-Sim, Accel-Sim — GPU simulation frameworks.
Perfetto, Nsight UI — Visual profilers for tracing GPU workloads.

Learning Tools

LeetGPU
GPU MODE Discord
GPU Glossary - A dictionary of terms related to programming GPUs

🧑‍🔬 GPU for AI & ML

PyTorch CUDA Extensions — Custom kernels for PyTorch.
JAX + XLA — Compiler-based GPU vectorization.
TensorFlow XLA Compiler — Ahead-of-time GPU graph compilation.
FlashAttention, FlashConv — Kernel optimization techniques for transformers.
DeepSpeed, FSDP, Megatron-LM — Distributed training systems.

🧱 GPU Systems Design Topics For Interview Prep

FlashAttention and PagedAttention
Matmul Operations
GPU scheduling algorithms and runtime systems.
Memory oversubscription and unified memory models.
Resource allocation in GPU clusters.
GPU virtualization
Kernel fusion and graph execution
Dataflow optimization
Persistent threads model

🧑‍💻 Contributors

Contributions welcome!
Please read the contribution guidelines before submitting a pull request.

🧾 License

CC BY 4.0 — feel free to share and adapt with attribution.

⭐ Acknowledgements

Inspired by:

“GPU engineering is not just about writing kernels. It’s about understanding how systems work.” — Model Craft

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
notes		notes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome GPU Engineering

📘 Foundational Books

💻 GPU Programming Frameworks

🧩 Optimization and Performance

🧠 Architecture and Low-Level Design

⚙️ Systems and Multi-GPU Engineering

🧪 Tutorials and Courses

📄 Research Papers and Articles

🧰 Tools and Utilities

Learning Tools

🧑‍🔬 GPU for AI & ML

🧱 GPU Systems Design Topics For Interview Prep

🧑‍💻 Contributors

🧾 License

⭐ Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

goabiaryan/awesome-gpu-engineering

Folders and files

Latest commit

History

Repository files navigation

Awesome GPU Engineering

📘 Foundational Books

💻 GPU Programming Frameworks

🧩 Optimization and Performance

🧠 Architecture and Low-Level Design

⚙️ Systems and Multi-GPU Engineering

🧪 Tutorials and Courses

📄 Research Papers and Articles

🧰 Tools and Utilities

Learning Tools

🧑‍🔬 GPU for AI & ML

🧱 GPU Systems Design Topics For Interview Prep

🧑‍💻 Contributors

🧾 License

⭐ Acknowledgements

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages