Skip to content

Failed UT: multi_client_test_nccl_local_2gpus #1980

@i-chaochen

Description

@i-chaochen

Root cause: tensorflow@f734ee8
Init fix: d29b6d6 or tensorflow#59501

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //tensorflow/dtensor/python/tests:multi_client_test_nccl_local_2gpus
-----------------------------------------------------------------------------
2023-01-31 11:16:27.156744: E tensorflow/tsl/lib/monitoring/collection_registry.cc:81] Cannot register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay
2023-01-31 11:16:27.170465: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Check per client log in Test artifacts.
2023-01-31 11:16:28.129654: E tensorflow/tsl/lib/monitoring/collection_registry.cc:81] Cannot register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay
2023-01-31 11:16:28.143067: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

It could be AMDGPUs do not support multiple NCCL managers?
tensorflow#58090

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions