-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
🚀 The feature, motivation and pitch
Hi everyone,
We’re currently extending our GFM-RAG model to support reasoning over large-scale graphs and would appreciate your insights.
Motivation
The existing message-passing framework used in GNNs is only conducted on a local GPU, which cannot generalize to large-scale graphs due to the constraints of GPU memory.
Existing distributed (Multi-GPU) GNN training frameworks (e.g., PyG, DGL) focus on node-based subgraph partitioning strategies and learning unconditioned node embeddings.
This might not support well for some advanced GNNs as reasoners works (e.g., NBFNet, ULTRA and GFM-RAG) as they require propagating messages to all nodes conditioned on certain queries for prediction.
Solution
We’re considering following the SPMD (Single-Program-Multiple-Data) approach for distributed GNN message passing across multiple GPUs to support GNN reasoning on large-scale graphs.
We can keep a full copy of node embeddings on each GPU while partitioning only the edges. Message passing would occur locally on a subset of edges, with node embeddings reduced and aggregated across GPUs to update after each layer.
This differs from the typical node-based subgraph partitioning strategies used in frameworks like DGL or PyG, as it may not suit our model architecture, where messages are propagated to all nodes conditioned on certain query nodes during reasoning.
Comparison between Node-level partition and Edge-level partition.
Node-level Partition
Edge-level partition and SPMD-style message passing
Do you think this is a feasible and effective feature to explore further?
Alternatives
No response
Additional context
No response