-
Notifications
You must be signed in to change notification settings - Fork 666
Open
Description
Hi, I'm trying to run the MetaShuffling kernel on a 8xH100 node with the following configuration:
torch=2.8.0.dev20250522+cu128
triton: commit id 2db6370f (ws-3.2.x)
fbgemm: commit id 0f8fde4
CUDA Toolkit: 12.8
However, when I run python -m moe.layers_test --testing
under fbgemm_gpu/experimental/gen_ai/test
, I got the following error:
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. [Gloo] Rank Expected number of connected peer ranks is : 00 is connected to
0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0[Gloo] Rank peer ranks. 0Expected number of connected peer ranks is : is connected to 00
peer ranks. [Gloo] Rank Expected number of connected peer ranks is : 00 is connected to
0 peer ranks. Expected number of connected peer ranks is : 0
Fatal Python error: Segmentation fault
Current thread 0x00007f8daecc7540 (most recent call first):
File "/mnt/task_runtime/triton/python/triton/backends/nvidia/compiler.py", line 256 in make_ttgir
File "/mnt/task_runtime/triton/python/triton/backends/nvidia/compiler.py", line 391 in <lambda>
File "/mnt/task_runtime/triton/python/triton/compiler/compiler.py", line 279 in compile
File "/mnt/task_runtime/triton/python/triton/runtime/jit.py", line 623 in run
File "/mnt/task_runtime/triton/python/triton/runtime/autotuner.py", line 149 in kernel_call
File "/mnt/task_runtime/triton/python/triton/testing.py", line 117 in do_bench
File "/mnt/task_runtime/triton/python/triton/runtime/autotuner.py", line 163 in _bench
File "/mnt/task_runtime/triton/python/triton/runtime/autotuner.py", line 183 in run
File "/mnt/task_runtime/triton/python/triton/runtime/jit.py", line 330 in <lambda>
File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py", line 1124 in _grouped_gemm
File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py", line 1139 in grouped_gemm
File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gen_ai/moe/layers.py", line 1188 in _routed_expert
File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gen_ai/moe/layers.py", line 781 in _no_comm_forward
File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gen_ai/moe/layers.py", line 559 in forward
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778 in _call_impl
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767 in _wrapped_call_impl
File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 168 in run_demo
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357 in wrapper
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 616 in _wrap
File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 90 in _wrap
File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
File "<string>", line 1 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cuda_utils, __triton_launcher (total: 26)
W0527 16:53:04.179000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391173 via signal SIGTERM
W0527 16:53:04.180000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391174 via signal SIGTERM
W0527 16:53:04.181000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391175 via signal SIGTERM
W0527 16:53:04.181000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391176 via signal SIGTERM
W0527 16:53:04.182000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391177 via signal SIGTERM
W0527 16:53:04.182000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391178 via signal SIGTERM
W0527 16:53:04.183000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391179 via signal SIGTERM
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] failed (exitcode: -11) local_rank: 0 (pid: 391172) of fn: run_demo (start_method: spawn)
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] Traceback (most recent call last):
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 692, in _poll
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] self._pc.join(-1)
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 196, in join
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] raise ProcessExitedException(
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:4785, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:6321, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:7857, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:11441, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[E527 16:53:12.548275280 ProcessGroupGloo.cpp:69] Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:152] timed out connecting: SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
W0527 16:53:13.708000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393605 via signal SIGTERM
W0527 16:53:13.709000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393606 via signal SIGTERM
W0527 16:53:13.710000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393607 via signal SIGTERM
W0527 16:53:13.711000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393608 via signal SIGTERM
W0527 16:53:13.712000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393610 via signal SIGTERM
W0527 16:53:13.714000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393611 via signal SIGTERM
W0527 16:53:13.715000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393612 via signal SIGTERM
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] failed (exitcode: 1) local_rank: 4 (pid: 393609) of fn: run_demo (start_method: spawn)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] Traceback (most recent call last):
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 692, in _poll
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] self._pc.join(-1)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 215, in join
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] raise ProcessRaisedException(msg, error_index, failed_process.pid)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] torch.multiprocessing.spawn.ProcessRaisedException:
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] -- Process 4 terminated with the following error:
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] Traceback (most recent call last):
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] fn(i, *args)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 616, in _wrap
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] ret = record(fn)(*args_)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] ^^^^^^^^^^^^^^^^^^
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] return f(*args, **kwargs)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] ^^^^^^^^^^^^^^^^^^
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 220, in run_demo
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] torch.distributed.destroy_process_group()
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2124, in destroy_process_group
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] assert pg is not None
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] ^^^^^^^^^^^^^^
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] AssertionError
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 280, in <module>
main()
File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 271, in main
launcher.elastic_launch(get_launch_config(), entrypoint=run_demo)(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_demo FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-27_16:53:12
host : bolt-7dzarm5enq-nn8kv4xabc.bolt-pods.turi-bolt.svc.cluster.local
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 393609)
error_file: /mnt/tmp/torchelastic_qpzswico/DEFAULT_RUN_ID_9c7b2wb0/attempt_1/4/error.json
traceback : Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 220, in run_demo
torch.distributed.destroy_process_group()
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2124, in destroy_process_group
assert pg is not None
^^^^^^^^^^^^^^
AssertionError
============================================================
Too long to paste the full log but I think this is the informative part of it. Anyone could help?
Metadata
Metadata
Assignees
Labels
No labels