SPMD-style Distributed Message Passing Support

### 🚀 The feature, motivation and pitch

Hi everyone, 

We’re currently extending our [GFM-RAG](https://github.com/RManLuo/gfm-rag) model to support reasoning over large-scale graphs and would appreciate your insights.

## Motivation
The existing message-passing framework used in GNNs is only conducted on a local GPU, which cannot generalize to large-scale graphs due to the constraints of GPU memory.

Existing distributed (Multi-GPU) GNN training frameworks (e.g., [PyG](https://pytorch-geometric.readthedocs.io/en/latest/tutorial/distributed.html), [DGL](https://www.dgl.ai/dgl_docs/stochastic_training/multigpu_node_classification.html)) focus on node-based subgraph partitioning strategies and learning unconditioned node embeddings. 

This might not support well for some advanced GNNs as reasoners works (e.g.,  [NBFNet](https://github.com/DeepGraphLearning/NBFNet), [ULTRA](https://github.com/DeepGraphLearning/ULTRA/) and [GFM-RAG](https://github.com/RManLuo/gfm-rag)) as they require propagating messages to all nodes conditioned on certain queries for prediction.

## Solution
We’re considering following the SPMD (Single-Program-Multiple-Data) approach for distributed GNN message passing across multiple GPUs to support GNN reasoning on large-scale graphs.

We can keep a full copy of node embeddings on each GPU while partitioning only the edges. Message passing would occur locally on a subset of edges, with node embeddings reduced and aggregated across GPUs to update after each layer.

This differs from the typical node-based subgraph partitioning strategies used in frameworks like DGL or PyG, as it may not suit our model architecture, where messages are propagated to all nodes conditioned on certain query nodes during reasoning.

## Comparison between Node-level partition and Edge-level partition.
Node-level Partition
![Image](https://github.com/user-attachments/assets/3ac08e1c-ffee-4b74-aa78-a8534f4aa01b)
Edge-level partition and SPMD-style message passing
![Image](https://github.com/user-attachments/assets/66e6a2ae-9da1-4f37-aedb-ee06e02f4536)
 

Do you think this is a feasible and effective feature to explore further?

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPMD-style Distributed Message Passing Support #10347

🚀 The feature, motivation and pitch

Motivation

Solution

Comparison between Node-level partition and Edge-level partition.

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SPMD-style Distributed Message Passing Support #10347

Description

🚀 The feature, motivation and pitch

Motivation

Solution

Comparison between Node-level partition and Edge-level partition.

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions