-
Notifications
You must be signed in to change notification settings - Fork 46
[Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend #260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend #260
Conversation
Hi @siju-samuel! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
02a0cfc
to
616eec9
Compare
616eec9
to
f21ea28
Compare
@siju-samuel looks good, would you mind running
and post a screenshot of the gpu profiles here to make sure this change doesn't break the communication/computation overlap |
@tushar00jain Attaching the profiling data for
![]()
![]()
![]()
![]() The corresponding JSON profiles are included in the ZIP: Please let me know if anything else is needed. Also, I noticed that in the last CI run, a few tests are failing. Mac unittest failures:
Not sure why these testcases are failing in Mac CUDA-related failures:
When i ran locally first 2 is failing due to
Why is |
@siju-samuel thanks for putting this together! The before/after for the profiles look good. Do we need any special machines to test out XCCL? Is there XCCL support in the pytorch profiler? The profiler doesn't seem to show the allreduce. Some of the tests you mentioned that are failing in the CI are configured to be skipped so not sure why they're running. |
a712b82
to
c65eed8
Compare
Yes, this is supported only with Intel GPU (XPU) hardware
Yeah. we support, but the trace collection method at kernel level is slightly different. In the profile data i have given, the allreduce kernel came along with other kernels in same stream. Need to check whether there is actual issue in XPU or only tracing problem.
Is fixed currently with this commit c65eed8 |
@siju-samuel for the mac unit tests, maybe the skipif is not working with parameterized expand. we can add a return statement explicitly inside the test if the platform is darwin. that should make the mac unit tests pass. |
Thanks @tushar00jain. |
Hi @tushar00jain, just a gentle reminder whenever you get a chance to review this PR and trigger the CI. Thanks! |
ea30786
to
e792a97
Compare
e792a97
to
41829de
Compare
Previous CI run got a weird error while using torch.accelerator api in Mac unit test. |
Looks like a random failure. CUDA tests were passing before. |
@siju-samuel I can trigger but haven't seen those failures before. Usually it's an indidication that something is wrong. Can you run a couple times locally? |
This PR introduces the foundational support for enabling XPU devices with XCCL as the backend in TorchFT as per RFC
Key highlights:
ProcessGroupXCCL/ProcessGroupBabyXCCL
implementationcc @tushar00jain @jeromean @gujinghui @zhangxiaoli73 @rbabukv