Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP. #1845

czl66 · 2024-12-24T04:22:10Z

In my practice on aishell -conformer_ctc-asr-task, I found that the script only implemented single machine - multi gpus, which is inconvenient for our gpusevrers. So I modified train.py, hope can be helpful for your icefall community. :)

…machine DDP.

Merged from latest repo.

csukuangfj · 2024-12-24T04:23:58Z

Could you describe how to run it for multi-node multi-GPU training?

czl66 · 2024-12-24T04:31:29Z

Could you describe how to run it for multi-node multi-GPU training?

yes, here is the code for main bash file:

node_rank=$1
WORLD_SIZE=$2
export CUDA_VISIBLE_DEVICES=$3
echo "WORKER INFO:: node_rank=$node_rank, WORLD_SIZE=$WORLD_SIZE, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

DISTRIBUTED_ARGS="
    --nnodes ${WORLD_SIZE:-1} \
    --nproc_per_node $gpu_num \
    --node_rank ${node_rank:-0} \
    --master_addr ${MASTER_ADDR:-127.0.0.1} \
    --master_port ${MASTER_PORT:-26669}
"
torchrun $DISTRIBUTED_ARGS ./conformer_ctc/train.py --world-size $gpu_num --max-duration 200 --num-epochs 100.

and u should write another script to start the training, including assign the node, the WORLD_SIZE, the gpus.

czl66 · 2024-12-24T05:30:53Z

e.g., u have 4 machines, and each machine has 8-gpus, if one node assigns one gpu, the total nodes is 32, and you should pass $1=0,1,2,3...31, $2=32, $3='0', '1', '2', ... '7' one by one. Besides, if one node assigns 2 gpus, the total nodes is 16, and you should pass $1=0,1,2,3...15, $2=16, $3='0,1', '2,3', '4,5', '6,7' respectively.

czl66 · 2024-12-24T05:37:27Z

Could you describe how to run it for multi-node multi-GPU training?

yes, here is the code for main bash file:

node_rank=$1
WORLD_SIZE=$2
export CUDA_VISIBLE_DEVICES=$3
echo "WORKER INFO:: node_rank=$node_rank, WORLD_SIZE=$WORLD_SIZE, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

DISTRIBUTED_ARGS="
    --nnodes ${WORLD_SIZE:-1} \
    --nproc_per_node $gpu_num \
    --node_rank ${node_rank:-0} \
    --master_addr ${MASTER_ADDR:-127.0.0.1} \
    --master_port ${MASTER_PORT:-26669}
"
torchrun $DISTRIBUTED_ARGS ./conformer_ctc/train.py --world-size $gpu_num --max-duration 200 --num-epochs 100.

and u should write another script to start the training, including assign the node, the WORLD_SIZE, the gpus.

and the single machine version is provided:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
torchrun --nproc_per_node $gpu_num ./conformer_ctc/train.py --world-size $gpu_num --max-duration 200 --num-epochs 100

…tch-way decoding, faster.

czl66 · 2024-12-24T06:09:46Z

Also, when I using decode.py for ctc_decoding, I found that the speed is really slow, even it has pasted several hours, the recognizing result is not generated. So I debug, finally found the num_workers caused this problem.
Moreever, I found the decoding process only supply one sample by one sample, it is still too slow, so I convert the Monocut to dict, which can make batch-way decoding work. I hope my code is helpful. :) @csukuangfj

yfyeung · 2024-12-24T07:09:08Z

There is no need to modify egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py.

To enable multi-node multi-GPU support, simply modify the train.py file with the following changes:

Add

from icefall.dist import (
    cleanup_dist,
    get_local_rank,
    get_rank,
    get_world_size,
    setup_dist,
)

    parser.add_argument(
        "--use-multi-node",
        type=str2bool,
        default=False,
        help="""True if using multi-node multi-GPU.
        You are not supposed to set it directly.
        """,
    )

    if params.use_multi_node:
        local_rank = get_local_rank()
    else:
        local_rank = rank
    logging.info(f"rank: {rank}, world_size: {world_size}, local_rank: {local_rank}")
    if world_size > 1:
        setup_dist(rank, world_size, params.master_port, params.use_multi_node)

    device = torch.device("cpu")
    if torch.cuda.is_available():
        device = torch.device("cuda", local_rank)
    logging.info(f"Device: {device}, rank: {rank}, local_rank: {local_rank}")

    if world_size > 1:
        logging.info("Using DDP")
        model = DDP(model, device_ids=[local_rank])

    if args.use_multi_node:
        rank = get_rank()
        world_size = get_world_size()
        args.world_size = world_size
        run(rank=rank, world_size=world_size, args=args)
    else:
        world_size = args.world_size
        assert world_size >= 1
        if world_size > 1:
            mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
        else:
            run(rank=0, world_size=1, args=args)

czl66 · 2024-12-24T07:25:39Z

There is no need to modify egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py.

To enable multi-node multi-GPU support, simply modify the train.py file with the following changes:

Add

from icefall.dist import (
    cleanup_dist,
    get_local_rank,
    get_rank,
    get_world_size,
    setup_dist,
)

    parser.add_argument(
        "--use-multi-node",
        type=str2bool,
        default=False,
        help="""True if using multi-node multi-GPU.
        You are not supposed to set it directly.
        """,
    )

    if params.use_multi_node:
        local_rank = get_local_rank()
    else:
        local_rank = rank
    logging.info(f"rank: {rank}, world_size: {world_size}, local_rank: {local_rank}")
    if world_size > 1:
        setup_dist(rank, world_size, params.master_port, params.use_multi_node)

    device = torch.device("cpu")
    if torch.cuda.is_available():
        device = torch.device("cuda", local_rank)
    logging.info(f"Device: {device}, rank: {rank}, local_rank: {local_rank}")

    if world_size > 1:
        logging.info("Using DDP")
        model = DDP(model, device_ids=[local_rank])

    if args.use_multi_node:
        rank = get_rank()
        world_size = get_world_size()
        args.world_size = world_size
        run(rank=rank, world_size=world_size, args=args)
    else:
        world_size = args.world_size
        assert world_size >= 1
        if world_size > 1:
            mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
        else:
            run(rank=0, world_size=1, args=args)

yeah, you are absolutely right. In addition, I think using barrier() is a must.

czl66 · 2024-12-24T07:30:56Z

There is no need to modify egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py.

To enable multi-node multi-GPU support, simply modify the train.py file with the following changes:

Add

from icefall.dist import (
    cleanup_dist,
    get_local_rank,
    get_rank,
    get_world_size,
    setup_dist,
)

    parser.add_argument(
        "--use-multi-node",
        type=str2bool,
        default=False,
        help="""True if using multi-node multi-GPU.
        You are not supposed to set it directly.
        """,
    )

    if params.use_multi_node:
        local_rank = get_local_rank()
    else:
        local_rank = rank
    logging.info(f"rank: {rank}, world_size: {world_size}, local_rank: {local_rank}")
    if world_size > 1:
        setup_dist(rank, world_size, params.master_port, params.use_multi_node)

    device = torch.device("cpu")
    if torch.cuda.is_available():
        device = torch.device("cuda", local_rank)
    logging.info(f"Device: {device}, rank: {rank}, local_rank: {local_rank}")

    if world_size > 1:
        logging.info("Using DDP")
        model = DDP(model, device_ids=[local_rank])

    if args.use_multi_node:
        rank = get_rank()
        world_size = get_world_size()
        args.world_size = world_size
        run(rank=rank, world_size=world_size, args=args)
    else:
        world_size = args.world_size
        assert world_size >= 1
        if world_size > 1:
            mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
        else:
            run(rank=0, world_size=1, args=args)

By the way, if you set batch_size of test_dataloaders in egs/aishell/ASR/tdnn_lstm_ctc/asr_datamodule.py from None to some int value like 1 or 10, it will caused this error:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'lhotse.cut.mono.MonoCut'>.
That's why I modified. I hope this info is useful.

yfyeung · 2024-12-24T07:59:52Z

I think there is no need for torch.distributed.barrier() because DistributedDataParallel already broadcasts parameters from rank 0 during initialization. This built-in synchronization ensures all ranks have consistent weights without requiring an explicit barrier.

czl66 added 2 commits December 24, 2024 11:59

Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-…

7a0c7b7

…machine DDP.

Merge branch 'master' of https://github.com/k2-fsa/icefall

19ce1a4

Merged from latest repo.

Modified aishell/ASR/conformer_ctc/decode.py,asr_datamodule.py for ba…

448c28b

…tch-way decoding, faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP. #1845

Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP. #1845

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

csukuangfj commented Dec 24, 2024

Uh oh!

czl66 commented Dec 24, 2024 •

edited

Loading

Uh oh!

czl66 commented Dec 24, 2024 •

edited

Loading

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

yfyeung commented Dec 24, 2024 •

edited

Loading

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

yfyeung commented Dec 24, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP. #1845

Are you sure you want to change the base?

Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP. #1845

Uh oh!

Conversation

czl66 commented Dec 24, 2024

Uh oh!

csukuangfj commented Dec 24, 2024

Uh oh!

czl66 commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

czl66 commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

yfyeung commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

czl66 commented Dec 24, 2024

Uh oh!

yfyeung commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

czl66 commented Dec 24, 2024 •

edited

Loading

czl66 commented Dec 24, 2024 •

edited

Loading

yfyeung commented Dec 24, 2024 •

edited

Loading

yfyeung commented Dec 24, 2024 •

edited

Loading