Skip to content

Conversation

siju-samuel
Copy link
Contributor

This PR introduces the foundational support for enabling XPU devices with XCCL as the backend in TorchFT as per RFC

Key highlights:

  • Added ProcessGroupXCCL/ProcessGroupBabyXCCL implementation
  • Integrated XPU device handling into manager/processgroup
  • Managed streams and events
  • Updated example train_ddp.py to run in Intel GPU environment

cc @tushar00jain @jeromean @gujinghui @zhangxiaoli73 @rbabukv

Copy link

meta-cla bot commented Aug 20, 2025

Hi @siju-samuel!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025
Copy link

meta-cla bot commented Aug 20, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@tushar00jain tushar00jain requested a review from d4l3k August 21, 2025 19:42
@siju-samuel siju-samuel force-pushed the initial_xpu_support branch 3 times, most recently from 02a0cfc to 616eec9 Compare August 25, 2025 11:08
@tushar00jain
Copy link
Contributor

@siju-samuel looks good, would you mind running train_diloco.py with

  • gloo + gpu tensors
  • gloo + cpu tensors
  • nccl + gpu tensors
  • xccl + gpu tensors

and post a screenshot of the gpu profiles here to make sure this change doesn't break the communication/computation overlap

@siju-samuel
Copy link
Contributor Author

@tushar00jain Attaching the profiling data for train_diloco.py with and without my changes
Profiling was done on an A6000 device

  • Gloo + GPU tensors
gloo-gpu
  • Gloo + CPU tensors
gloo-cpu
  • NCCL + GPU tensors
nccl-gpu
  • XCCL + GPU tensors
xccl-gpu

The corresponding JSON profiles are included in the ZIP:
gloo-nccl-xccl-profiles.zip

Please let me know if anything else is needed.

Also, I noticed that in the last CI run, a few tests are failing.

Mac unittest failures:

torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_failure_recovery_0 FAILED [ 13%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_0 FAILED [ 13%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_1 FAILED [ 14%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_2 FAILED [ 14%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_3 FAILED [ 15%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_4 FAILED [ 15%]
torchft/diloco_regression_test.py::DiLoCoMockedUpdateTest::test_diloco_mocked_updates_5 FAILED [ 16%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_00 FAILED [ 25%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_01 FAILED [ 26%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_02 FAILED [ 26%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_03 FAILED [ 27%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_04 FAILED [ 27%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_05 FAILED [ 28%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_06 FAILED [ 28%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_07 FAILED [ 29%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_08 FAILED [ 29%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_09 FAILED [ 30%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_10 FAILED [ 30%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_commit_failure_11 FAILED [ 31%]
torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_streaming_diloco_recovery_0 FAILED [ 32%]

Not sure why these testcases are failing in Mac

CUDA-related failures:

torchft/checkpointing/pg_transport_test.py::PGTransportTest::test_pg_transport_baby_nccl FAILED [  4%]
torchft/checkpointing/pg_transport_test.py::PGTransportTest::test_pg_transport_baby_nccl_inplace FAILED [  4%]
torchft/checkpointing/pg_transport_test.py::PGTransportTest::test_pg_transport_gloo FAILED [  4%]

When i ran locally first 2 is failing due to
TypeError: cannot pickle 'torch.Event' object:
And the 3rd testcase test_pg_transport_gloo passes when i run locally.

torchft_test/checkpointing/pg_transport_test.py:61:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.12/dist-packages/torchft/checkpointing/transport_test.py:146: in run_multi_recovery_test
    transports.append(fut.result())
/usr/lib/python3.12/concurrent/futures/_base.py:449: in result
    return self.__get_result()
/usr/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/usr/lib/python3.12/concurrent/futures/thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
/usr/local/lib/python3.12/dist-packages/torchft/checkpointing/transport_test.py:86: in run
    got = transport.recv_checkpoint(
/usr/local/lib/python3.12/dist-packages/torchft/checkpointing/pg_transport.py:239: in recv_checkpoint
    self._pg.recv([len_t], src_rank, tag=1).wait(timeout)
/usr/local/lib/python3.12/dist-packages/torchft/process_group.py:1771: in recv
    return self._run_func("recv", tensors, src_rank, tag)
/usr/local/lib/python3.12/dist-packages/torchft/process_group.py:1674: in _run_func
    pipe.send(
/usr/local/lib/python3.12/dist-packages/torchft/multiprocessing.py:15: in send
    self._pipe.send(obj)
/usr/lib/python3.12/multiprocessing/connection.py:206: in send
    self._send_bytes(_ForkingPickler.dumps(obj))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'multiprocessing.reduction.ForkingPickler'>, obj = ('func', 0, 'recv', ([tensor([0], device='cuda:1')], 0, 1), {}, device(type='cuda', index=1), ...), protocol = None

    @classmethod
    def dumps(cls, obj, protocol=None):
        buf = io.BytesIO()
>       cls(buf, protocol).dump(obj)
E       TypeError: cannot pickle 'torch.Event' object

/usr/lib/python3.12/multiprocessing/reduction.py:51: TypeError

Why is torch.Event being pickled here? If pickling is required, how should we handle serialization of a torch.Event object?
Instead of torch.Event shall I add conditional checks to handle CUDA/XPU events and streams?

@tushar00jain
Copy link
Contributor

tushar00jain commented Sep 3, 2025

@siju-samuel thanks for putting this together! The before/after for the profiles look good. Do we need any special machines to test out XCCL? Is there XCCL support in the pytorch profiler? The profiler doesn't seem to show the allreduce.

Some of the tests you mentioned that are failing in the CI are configured to be skipped so not sure why they're running.

@siju-samuel
Copy link
Contributor Author

Do we need any special machines to test out XCCL?

Yes, this is supported only with Intel GPU (XPU) hardware

Is there XCCL support in the pytorch profiler? The profiler doesn't seem to show the allreduce.

Yeah. we support, but the trace collection method at kernel level is slightly different. In the profile data i have given, the allreduce kernel came along with other kernels in same stream. Need to check whether there is actual issue in XPU or only tracing problem.

CI Failure due to TypeError: cannot pickle 'torch.Event' object

Is fixed currently with this commit c65eed8

@tushar00jain
Copy link
Contributor

@siju-samuel for the mac unit tests, maybe the skipif is not working with parameterized expand. we can add a return statement explicitly inside the test if the platform is darwin. that should make the mac unit tests pass.

@siju-samuel
Copy link
Contributor Author

Thanks @tushar00jain.
You are right. the skipIf was not working with parameterized. (only pytest skipIf works with parameterized).
Submitted a patch to skip inside function.

@siju-samuel
Copy link
Contributor Author

Hi @tushar00jain, just a gentle reminder whenever you get a chance to review this PR and trigger the CI. Thanks!

@siju-samuel
Copy link
Contributor Author

            if torch.accelerator.is_available():
>               torch.accelerator.current_stream().synchronize()
E               RuntimeError: Backend doesn't support synchronizing streams.

Previous CI run got a weird error while using torch.accelerator api in Mac unit test.
Used another utility function for the same.

@siju-samuel
Copy link
Contributor Author

torchft/process_group_test.py::BabyNcclMultiPgTest::test_collective_08_reduce_scatter FAILED [ 78%]
        expected_sum = expected_row * world_sz
>       torch.testing.assert_close(out, expected_sum)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 2 / 2 (100.0%)
E       Greatest absolute difference: 0.0 at index (0,) (up to 1e-05 allowed)
E       Greatest relative difference: 0.0 at index (0,) (up to 1.3e-06 allowed)

Looks like a random failure. CUDA tests were passing before.
MAC tests passed in last run.
@tushar00jain Could you please trigger CI once again. Thanks

@tushar00jain
Copy link
Contributor

@siju-samuel I can trigger but haven't seen those failures before. Usually it's an indidication that something is wrong. Can you run a couple times locally?

@tushar00jain tushar00jain merged commit f5888e9 into meta-pytorch:main Sep 19, 2025
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants