A curated list of resources for mastering GPU engineering from architecture and kernel programming to large-scale distributed systems and AI acceleration.
- Programming Massively Parallel Processors: A Hands-on Approach — David B. Kirk & Wen-mei W. Hwu The canonical introduction to CUDA, memory hierarchies, and parallel patterns. Amazon , notes: Abi's Concise Notes
- CUDA by Example — Jason Sanders & Edward Kandrot
A practical introduction to CUDA for beginners. Amazon - The Ultra-Scale Playbook: Training LLMs on GPU Clusters - Hugging Face Web Version
- CUDA — NVIDIA’s proprietary GPU programming platform.
- ROCm — AMD’s open compute stack.
- OpenCL — Cross-platform parallel computing standard.
- SYCL / oneAPI — Intel’s C++ abstraction for heterogeneous compute.
- Vulkan Compute — Low-level GPU compute API.
- Kompute — Higher level general purpose GPU compute framework built on Vulkan.
- Metal Performance Shaders — Apple’s GPU framework.
- NVIDIA Nsight Systems — System-wide GPU profiler.
- Nsight Compute — Kernel-level performance analysis.
- Occupancy Calculator — NVIDIA spreadsheet for kernel configuration.
- CUTLASS — CUDA templates for linear algebra subroutines.
- TensorRT — High-performance deep learning inference.
- OpenAI Triton — Python DSL for writing high-performance GPU kernels.
- Roofline Model — Analytical model to reason about compute/memory bottlenecks.
- NVIDIA Ampere Whitepaper
- AMD RDNA & CDNA Architectures
- SIMT execution and warp scheduling
- Memory hierarchy and coalescing
- Shared memory and cache optimization
- Warp divergence and thread occupancy
- NCCL — Multi-GPU communication primitives.
- vLLM - Inference and serving engine for LLMs
- Hugging Face Accelerate - Simplify abstractions for distributed training
- SGLang
- Prime Intellect
- TensorRT-LLM
- TGI by Hugging Face
- Horovod — Distributed deep learning across GPUs.
- NVLink & PCIe Topology — GPU interconnects and bandwidth optimization.
- GPUDirect RDMA — Zero-copy GPU networking.
- Ray Train, DeepSpeed, Megatron-LM — Large-scale GPU orchestration frameworks.
- Iris by AMD - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.
- CUDA C++ Programming Guide
- Triton Tutorials (OpenAI)
- CUDA in 12 hours by FreeCodeCamp and Video Repo
- Stanford CS149, Fall 2025 Parallel Computing Course Fall 2025
- CMU 15-418/618: Parallel Computer Architecture & Programming
- MIT 6.5940: TinyML and Efficient Deep Learning Computing
- GPU MODE video lecture series
- Red Hat vLLM Office Hours video series
- Optimization techniques for GPU programming - Hijma, Pieter, et al.
- Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads - Oden, Lena, and Klaus Nölp
- Evolving GPU Architecture — Kirk & Hwu
- Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision - Wei Gao et al
- Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis - Niteesh, L., and M. B. Ampareeshan
- NVIDIA Research Papers on Model Parallelism and Megatron-LM
- GPU Virtualization and Multi-Tenant Scheduling
- A Survey of Multi-Tenant Deep Learning Inference on GPU
- Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception
- nvprof, nvvp, Nsight Systems / Compute — NVIDIA profiling tools.
- cuda-memcheck, compute-sanitizer — Memory and correctness tools.
- GPGPU-Sim, Accel-Sim — GPU simulation frameworks.
- Perfetto, Nsight UI — Visual profilers for tracing GPU workloads.
- LeetGPU
- GPU MODE Discord
- GPU Glossary - A dictionary of terms related to programming GPUs
- PyTorch CUDA Extensions — Custom kernels for PyTorch.
- JAX + XLA — Compiler-based GPU vectorization.
- TensorFlow XLA Compiler — Ahead-of-time GPU graph compilation.
- FlashAttention, FlashConv — Kernel optimization techniques for transformers.
- DeepSpeed, FSDP, Megatron-LM — Distributed training systems.
- FlashAttention and PagedAttention
- Matmul Operations
- GPU scheduling algorithms and runtime systems.
- Memory oversubscription and unified memory models.
- Resource allocation in GPU clusters.
- GPU virtualization
- Kernel fusion and graph execution
- Dataflow optimization
- Persistent threads model
Contributions welcome!
Please read the contribution guidelines before submitting a pull request.
CC BY 4.0 — feel free to share and adapt with attribution.
Inspired by:
“GPU engineering is not just about writing kernels. It’s about understanding how systems work.” — Model Craft