[Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend #260

siju-samuel · 2025-08-20T11:12:33Z

This PR introduces the foundational support for enabling XPU devices with XCCL as the backend in TorchFT as per RFC

Key highlights:

Added ProcessGroupXCCL/ProcessGroupBabyXCCL implementation
Integrated XPU device handling into manager/processgroup
Managed streams and events
Updated example train_ddp.py to run in Intel GPU environment

cc @tushar00jain @jeromean @gujinghui @zhangxiaoli73 @rbabukv

meta-cla · 2025-08-20T11:12:38Z

Hi @siju-samuel!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-08-20T13:14:17Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

torchft/futures.py

torchft/manager.py

train_ddp.py

torchft/process_group.py

tushar00jain · 2025-08-26T17:50:00Z

@siju-samuel looks good, would you mind running train_diloco.py with

gloo + gpu tensors
gloo + cpu tensors
nccl + gpu tensors
xccl + gpu tensors

and post a screenshot of the gpu profiles here to make sure this change doesn't break the communication/computation overlap

siju-samuel · 2025-09-01T11:41:21Z

@tushar00jain Attaching the profiling data for train_diloco.py with and without my changes
Profiling was done on an A6000 device

Gloo + GPU tensors

Gloo + CPU tensors

NCCL + GPU tensors

XCCL + GPU tensors

The corresponding JSON profiles are included in the ZIP:
gloo-nccl-xccl-profiles.zip

Please let me know if anything else is needed.

Also, I noticed that in the last CI run, a few tests are failing.

Mac unittest failures:

torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_failure_recovery_0 FAILED [ 13%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_0 FAILED [ 13%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_1 FAILED [ 14%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_2 FAILED [ 14%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_3 FAILED [ 15%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_4 FAILED [ 15%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_5 FAILED [ 16%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_00 FAILED [ 25%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_01 FAILED [ 26%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_02 FAILED [ 26%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_03 FAILED [ 27%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_04 FAILED [ 27%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_05 FAILED [ 28%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_06 FAILED [ 28%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_07 FAILED [ 29%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_08 FAILED [ 29%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_09 FAILED [ 30%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_10 FAILED [ 30%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_11 FAILED [ 31%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_recovery_0 FAILED [ 32%]

Not sure why these testcases are failing in Mac

CUDA-related failures:

torchft/checkpointing/pg_transport_test.py::PGTransportTest::test_pg_transport_baby_nccl FAILED [  4%]
torchft/checkpointing/pg_transport_test.py::PGTransportTest::test_pg_transport_baby_nccl_inplace FAILED [  4%]
torchft/checkpointing/pg_transport_test.py::PGTransportTest::test_pg_transport_gloo FAILED [  4%]

When i ran locally first 2 is failing due to
TypeError: cannot pickle 'torch.Event' object:
And the 3rd testcase test_pg_transport_gloo passes when i run locally.

torchft_test/checkpointing/pg_transport_test.py:61:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.12/dist-packages/torchft/checkpointing/transport_test.py:146: in run_multi_recovery_test
    transports.append(fut.result())
/usr/lib/python3.12/concurrent/futures/_base.py:449: in result
    return self.__get_result()
/usr/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/usr/lib/python3.12/concurrent/futures/thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
/usr/local/lib/python3.12/dist-packages/torchft/checkpointing/transport_test.py:86: in run
    got = transport.recv_checkpoint(
/usr/local/lib/python3.12/dist-packages/torchft/checkpointing/pg_transport.py:239: in recv_checkpoint
    self._pg.recv([len_t], src_rank, tag=1).wait(timeout)
/usr/local/lib/python3.12/dist-packages/torchft/process_group.py:1771: in recv
    return self._run_func("recv", tensors, src_rank, tag)
/usr/local/lib/python3.12/dist-packages/torchft/process_group.py:1674: in _run_func
    pipe.send(
/usr/local/lib/python3.12/dist-packages/torchft/multiprocessing.py:15: in send
    self._pipe.send(obj)
/usr/lib/python3.12/multiprocessing/connection.py:206: in send
    self._send_bytes(_ForkingPickler.dumps(obj))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'multiprocessing.reduction.ForkingPickler'>, obj = ('func', 0, 'recv', ([tensor([0], device='cuda:1')], 0, 1), {}, device(type='cuda', index=1), ...), protocol = None

    @classmethod
    def dumps(cls, obj, protocol=None):
        buf = io.BytesIO()
>       cls(buf, protocol).dump(obj)
E       TypeError: cannot pickle 'torch.Event' object

/usr/lib/python3.12/multiprocessing/reduction.py:51: TypeError

Why is torch.Event being pickled here? If pickling is required, how should we handle serialization of a torch.Event object?
Instead of torch.Event shall I add conditional checks to handle CUDA/XPU events and streams?

tushar00jain · 2025-09-03T02:23:44Z

@siju-samuel thanks for putting this together! The before/after for the profiles look good. Do we need any special machines to test out XCCL? Is there XCCL support in the pytorch profiler? The profiler doesn't seem to show the allreduce.

Some of the tests you mentioned that are failing in the CI are configured to be skipped so not sure why they're running.

siju-samuel · 2025-09-03T12:30:58Z

Do we need any special machines to test out XCCL?

Yes, this is supported only with Intel GPU (XPU) hardware

Is there XCCL support in the pytorch profiler? The profiler doesn't seem to show the allreduce.

Yeah. we support, but the trace collection method at kernel level is slightly different. In the profile data i have given, the allreduce kernel came along with other kernels in same stream. Need to check whether there is actual issue in XPU or only tracing problem.

CI Failure due to TypeError: cannot pickle 'torch.Event' object

Is fixed currently with this commit c65eed8

tushar00jain · 2025-09-04T18:36:35Z

@siju-samuel for the mac unit tests, maybe the skipif is not working with parameterized expand. we can add a return statement explicitly inside the test if the platform is darwin. that should make the mac unit tests pass.

siju-samuel · 2025-09-08T05:08:04Z

Thanks @tushar00jain.
You are right. the skipIf was not working with parameterized. (only pytest skipIf works with parameterized).
Submitted a patch to skip inside function.

siju-samuel · 2025-09-11T02:07:20Z

Hi @tushar00jain, just a gentle reminder whenever you get a chance to review this PR and trigger the CI. Thanks!

…ckend

siju-samuel · 2025-09-17T09:49:46Z

            if torch.accelerator.is_available():
>               torch.accelerator.current_stream().synchronize()
E               RuntimeError: Backend doesn't support synchronizing streams.

Previous CI run got a weird error while using torch.accelerator api in Mac unit test.
Used another utility function for the same.

siju-samuel · 2025-09-19T02:35:26Z

torchft/process_group_test.py::BabyNcclMultiPgTest::test_collective_08_reduce_scatter FAILED [ 78%]
        expected_sum = expected_row * world_sz
>       torch.testing.assert_close(out, expected_sum)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 2 / 2 (100.0%)
E       Greatest absolute difference: 0.0 at index (0,) (up to 1e-05 allowed)
E       Greatest relative difference: 0.0 at index (0,) (up to 1.3e-06 allowed)

Looks like a random failure. CUDA tests were passing before.
MAC tests passed in last run.
@tushar00jain Could you please trigger CI once again. Thanks

tushar00jain · 2025-09-19T14:43:58Z

@siju-samuel I can trigger but haven't seen those failures before. Usually it's an indidication that something is wrong. Can you run a couple times locally?

siju-samuel mentioned this pull request Aug 20, 2025

[RFC] [Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend #257

Open

7 tasks

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025

tushar00jain requested a review from d4l3k August 21, 2025 19:42

tushar00jain reviewed Aug 21, 2025

View reviewed changes

tushar00jain requested changes Aug 21, 2025

View reviewed changes

torchft/process_group.py Outdated Show resolved Hide resolved

siju-samuel force-pushed the initial_xpu_support branch 3 times, most recently from 02a0cfc to 616eec9 Compare August 25, 2025 11:08

siju-samuel requested a review from tushar00jain August 25, 2025 11:17

siju-samuel force-pushed the initial_xpu_support branch from 616eec9 to f21ea28 Compare August 26, 2025 02:46

siju-samuel force-pushed the initial_xpu_support branch from a712b82 to c65eed8 Compare September 3, 2025 10:12

siju-samuel added 5 commits September 17, 2025 10:23

[RFC] [Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Ba…

5fff4b0

…ckend

Review comments updated

7bbc332

Fix UT failures due to record_event

47cd725

UT/Lint fix

0c795e2

Add options in BaseProcessGroupXCCL

ff78244

siju-samuel force-pushed the initial_xpu_support branch from ea30786 to e792a97 Compare September 17, 2025 09:42

Lint fixes

41829de

siju-samuel force-pushed the initial_xpu_support branch from e792a97 to 41829de Compare September 17, 2025 09:44

UT failures fix

e005380

tushar00jain approved these changes Sep 19, 2025

View reviewed changes

tushar00jain merged commit f5888e9 into meta-pytorch:main Sep 19, 2025
8 of 9 checks passed

[Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend #260

[Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend #260

Uh oh!

Conversation

siju-samuel commented Aug 20, 2025

Uh oh!

meta-cla bot commented Aug 20, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Aug 26, 2025

Uh oh!

siju-samuel commented Sep 1, 2025

Uh oh!

tushar00jain commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siju-samuel commented Sep 3, 2025

Uh oh!

tushar00jain commented Sep 4, 2025

Uh oh!

siju-samuel commented Sep 8, 2025

Uh oh!

siju-samuel commented Sep 11, 2025

Uh oh!

siju-samuel commented Sep 17, 2025

Uh oh!

siju-samuel commented Sep 19, 2025

Uh oh!

tushar00jain commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Sep 3, 2025 •

edited

Loading