Skip to content

MoE layers test triton compilation error #4192

@moderato

Description

@moderato

Hi, I'm trying to run the MetaShuffling kernel on a 8xH100 node with the following configuration:

torch=2.8.0.dev20250522+cu128
triton: commit id 2db6370f (ws-3.2.x)
fbgemm: commit id 0f8fde4
CUDA Toolkit: 12.8

However, when I run python -m moe.layers_test --testing under fbgemm_gpu/experimental/gen_ai/test, I got the following error:

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. [Gloo] Rank Expected number of connected peer ranks is : 00 is connected to 
0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0[Gloo] Rank  peer ranks. 0Expected number of connected peer ranks is :  is connected to 00
 peer ranks. [Gloo] Rank Expected number of connected peer ranks is : 00 is connected to 
0 peer ranks. Expected number of connected peer ranks is : 0
Fatal Python error: Segmentation fault

Current thread 0x00007f8daecc7540 (most recent call first):
  File "/mnt/task_runtime/triton/python/triton/backends/nvidia/compiler.py", line 256 in make_ttgir
  File "/mnt/task_runtime/triton/python/triton/backends/nvidia/compiler.py", line 391 in <lambda>
  File "/mnt/task_runtime/triton/python/triton/compiler/compiler.py", line 279 in compile
  File "/mnt/task_runtime/triton/python/triton/runtime/jit.py", line 623 in run
  File "/mnt/task_runtime/triton/python/triton/runtime/autotuner.py", line 149 in kernel_call
  File "/mnt/task_runtime/triton/python/triton/testing.py", line 117 in do_bench
  File "/mnt/task_runtime/triton/python/triton/runtime/autotuner.py", line 163 in _bench
  File "/mnt/task_runtime/triton/python/triton/runtime/autotuner.py", line 183 in run
  File "/mnt/task_runtime/triton/python/triton/runtime/jit.py", line 330 in <lambda>
  File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py", line 1124 in _grouped_gemm
  File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py", line 1139 in grouped_gemm
  File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gen_ai/moe/layers.py", line 1188 in _routed_expert
  File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gen_ai/moe/layers.py", line 781 in _no_comm_forward
  File "/usr/local/lib/python3.12/dist-packages/fbgemm_gpu/experimental/gen_ai/moe/layers.py", line 559 in forward
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1778 in _call_impl
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1767 in _wrapped_call_impl
  File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 168 in run_demo
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357 in wrapper
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 616 in _wrap
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 90 in _wrap
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cuda_utils, __triton_launcher (total: 26)

W0527 16:53:04.179000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391173 via signal SIGTERM
W0527 16:53:04.180000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391174 via signal SIGTERM
W0527 16:53:04.181000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391175 via signal SIGTERM
W0527 16:53:04.181000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391176 via signal SIGTERM
W0527 16:53:04.182000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391177 via signal SIGTERM
W0527 16:53:04.182000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391178 via signal SIGTERM
W0527 16:53:04.183000 391100 torch/multiprocessing/spawn.py:169] Terminating process 391179 via signal SIGTERM
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] failed (exitcode: -11) local_rank: 0 (pid: 391172) of fn: run_demo (start_method: spawn)
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] Traceback (most recent call last):
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 692, in _poll
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     self._pc.join(-1)
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 196, in join
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     raise ProcessExitedException(
E0527 16:53:05.358000 391100 torch/distributed/elastic/multiprocessing/api.py:737] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
TMA benchmarks will be running with experimental grid constant TMA descriptor.
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:4785, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:6321, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:7857, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=4, size=8, local=[240.62.128.148]:11441, remote=[240.62.128.148]:3473$4, error=SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
[E527 16:53:12.548275280 ProcessGroupGloo.cpp:69] Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:152] timed out connecting: SO_ERROR: Connection refused, remote=[240.62.128.148]:3473$4
W0527 16:53:13.708000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393605 via signal SIGTERM
W0527 16:53:13.709000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393606 via signal SIGTERM
W0527 16:53:13.710000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393607 via signal SIGTERM
W0527 16:53:13.711000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393608 via signal SIGTERM
W0527 16:53:13.712000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393610 via signal SIGTERM
W0527 16:53:13.714000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393611 via signal SIGTERM
W0527 16:53:13.715000 391100 torch/multiprocessing/spawn.py:169] Terminating process 393612 via signal SIGTERM
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] failed (exitcode: 1) local_rank: 4 (pid: 393609) of fn: run_demo (start_method: spawn)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] Traceback (most recent call last):
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 692, in _poll
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     self._pc.join(-1)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 215, in join
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     raise ProcessRaisedException(msg, error_index, failed_process.pid)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] torch.multiprocessing.spawn.ProcessRaisedException: 
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] 
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] -- Process 4 terminated with the following error:
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] Traceback (most recent call last):
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     fn(i, *args)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 616, in _wrap
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     ret = record(fn)(*args_)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]           ^^^^^^^^^^^^^^^^^^
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     return f(*args, **kwargs)
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]            ^^^^^^^^^^^^^^^^^^
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 220, in run_demo
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     torch.distributed.destroy_process_group()
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2124, in destroy_process_group
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]     assert pg is not None
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737]            ^^^^^^^^^^^^^^
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] AssertionError
E0527 16:53:14.480000 391100 torch/distributed/elastic/multiprocessing/api.py:737] 
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 280, in <module>
    main()
  File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 271, in main
    launcher.elastic_launch(get_launch_config(), entrypoint=run_demo)(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_demo FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-27_16:53:12
  host      : bolt-7dzarm5enq-nn8kv4xabc.bolt-pods.turi-bolt.svc.cluster.local
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 393609)
  error_file: /mnt/tmp/torchelastic_qpzswico/DEFAULT_RUN_ID_9c7b2wb0/attempt_1/4/error.json
  traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/mnt/task_runtime/fbgemm_gpu/experimental/gen_ai/test/moe/layers_test.py", line 220, in run_demo
      torch.distributed.destroy_process_group()
    File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2124, in destroy_process_group
      assert pg is not None
             ^^^^^^^^^^^^^^
  AssertionError
  
============================================================

Too long to paste the full log but I think this is the informative part of it. Anyone could help?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions