Pickle FileNotFoundError in evaluation phase of distributed training #584

austinmw · 2022-09-28T20:12:24Z

austinmw
Sep 28, 2022

Describe the bug
I get an MMEngine error in MMDetection distributed training (2 instances each with 8 gpus) at the first evaluation interval phase. It seems a node can't find evaluation pickle file parts.

Reproduction

I used torchrun to kick off tools/train.py

Here's my config (yolox_s_8xb8-300e_c0c0.py):

_base_ = ['../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py']

img_scale = (640, 640)  # height, width

# model settings
model = dict(
    type='YOLOX',
    data_preprocessor=dict(
        type='DetDataPreprocessor',
        pad_size_divisor=32,
        batch_augments=[
            dict(
                type='BatchSyncRandomResize',
                random_size_range=(480, 800),
                size_divisor=32,
                interval=10)
        ]),
    backbone=dict(
        type='CSPDarknet',
        deepen_factor=0.33,
        widen_factor=0.5,
        out_indices=(2, 3, 4),
        use_depthwise=False,
        spp_kernal_sizes=(5, 9, 13),
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
        act_cfg=dict(type='Swish'),
    ),
    neck=dict(
        type='YOLOXPAFPN',
        in_channels=[128, 256, 512],
        out_channels=128,
        num_csp_blocks=1,
        use_depthwise=False,
        upsample_cfg=dict(scale_factor=2, mode='nearest'),
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
        act_cfg=dict(type='Swish')),
    bbox_head=dict(
        type='YOLOXHead',
        num_classes=80,
        in_channels=128,
        feat_channels=128,
        stacked_convs=2,
        strides=(8, 16, 32),
        use_depthwise=False,
        norm_cfg=dict(type='BN', momentum=0.03, eps=0.001),
        act_cfg=dict(type='Swish'),
        loss_cls=dict(
            type='CrossEntropyLoss',
            use_sigmoid=True,
            reduction='sum',
            loss_weight=1.0),
        loss_bbox=dict(
            type='IoULoss',
            mode='square',
            eps=1e-16,
            reduction='sum',
            loss_weight=5.0),
        loss_obj=dict(
            type='CrossEntropyLoss',
            use_sigmoid=True,
            reduction='sum',
            loss_weight=1.0),
        loss_l1=dict(type='L1Loss', reduction='sum', loss_weight=1.0)),
    train_cfg=dict(assigner=dict(type='SimOTAAssigner', center_radius=2.5)),
    # In order to align the source code, the threshold of the val phase is
    # 0.01, and the threshold of the test phase is 0.001.
    test_cfg=dict(score_thr=0.01, nms=dict(type='nms', iou_threshold=0.65)))

# dataset settings
data_root = '/opt/ml/input/data/coco/'
dataset_type = 'CocoDataset'

# file_client_args = dict(
#     backend='petrel',
#     path_mapping=dict({
#         './data/': 's3://openmmlab/datasets/detection/',
#         'data/': 's3://openmmlab/datasets/detection/'
#     }))
file_client_args = dict(backend='disk')

train_pipeline = [
    dict(type='Mosaic', img_scale=img_scale, pad_val=114.0),
    dict(
        type='RandomAffine',
        scaling_ratio_range=(0.1, 2),
        border=(-img_scale[0] // 2, -img_scale[1] // 2)),
    dict(
        type='MixUp',
        img_scale=img_scale,
        ratio_range=(0.8, 1.6),
        pad_val=114.0),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', prob=0.5),
    # According to the official implementation, multi-scale
    # training is not considered here but in the
    # 'mmdet/models/detectors/yolox.py'.
    # Resize and Pad are for the last 15 epochs when Mosaic,
    # RandomAffine, and MixUp are closed by YOLOXModeSwitchHook.
    dict(type='Resize', scale=img_scale, keep_ratio=True),
    dict(
        type='Pad',
        pad_to_square=True,
        # If the image is three-channel, the pad value needs
        # to be set separately for each channel.
        pad_val=dict(img=(114.0, 114.0, 114.0))),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
    dict(type='PackDetInputs')
]

train_dataset = dict(
    # use MultiImageMixDataset wrapper to support mosaic and mixup
    type='MultiImageMixDataset',
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        pipeline=[
            dict(type='LoadImageFromFile', file_client_args=file_client_args),
            dict(type='LoadAnnotations', with_bbox=True)
        ],
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    pipeline=train_pipeline)

test_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='Resize', scale=img_scale, keep_ratio=True),
    dict(
        type='Pad',
        pad_to_square=True,
        pad_val=dict(img=(114.0, 114.0, 114.0))),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor'))
]

train_dataloader = dict(
    batch_size=8,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=train_dataset)
val_dataloader = dict(
    batch_size=8,
    num_workers=4,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='annotations/instances_val2017.json',
        data_prefix=dict(img='val2017/'),
        test_mode=True,
        pipeline=test_pipeline))
test_dataloader = val_dataloader

val_evaluator = dict(
    type='CocoMetric',
    ann_file=data_root + 'annotations/instances_val2017.json',
    metric='bbox')
test_evaluator = val_evaluator

# training settings
max_epochs = 3
num_last_epochs = 1
interval = 1

train_cfg = dict(max_epochs=max_epochs, val_interval=interval)

# optimizer
# default 8 gpu
base_lr = 0.01
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(
        type='SGD', lr=base_lr, momentum=0.9, weight_decay=5e-4,
        nesterov=True),
    paramwise_cfg=dict(norm_decay_mult=0., bias_decay_mult=0.))

# learning rate
param_scheduler = [
    dict(
        # use quadratic formula to warm up 5 epochs
        # and lr is updated by iteration
        # TODO: fix default scope in get function
        type='mmdet.QuadraticWarmupLR',
        by_epoch=True,
        begin=0,
        #end=5,
        end=1,
        convert_to_iter_based=True),
    dict(
        # use cosine lr from 5 to 285 epoch
        type='CosineAnnealingLR',
        eta_min=base_lr * 0.05,
        #begin=5,
        begin=1,
        T_max=max_epochs - num_last_epochs,
        end=max_epochs - num_last_epochs,
        by_epoch=True,
        convert_to_iter_based=True),
    dict(
        # use fixed lr during last 15 epochs
        type='ConstantLR',
        by_epoch=True,
        factor=1,
        begin=max_epochs - num_last_epochs,
        end=max_epochs,
    )
]


default_hooks = dict(
    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3, save_best='auto')

    # checkpoint=dict(
    #     interval=interval,
    #     max_keep_ckpts=3,  # only keep latest 3 checkpoints
    # )
    )

custom_hooks = [
    dict(
        type='YOLOXModeSwitchHook',
        num_last_epochs=num_last_epochs,
        priority=48),
    dict(type='SyncNormHook', priority=48),
    dict(
        type='EMAHook',
        ema_type='ExpMomentumEMA',
        momentum=0.0001,
        update_buffers=True,
        priority=49)
]

# NOTE: `auto_scale_lr` is for automatically scaling LR,
# USER SHOULD NOT CHANGE ITS VALUES.
# base_batch_size = (8 GPUs) x (8 samples per GPU)
auto_scale_lr = dict(base_batch_size=64, enable=True)

log_config = dict(  # config to register logger hook
    interval=50,  # Interval to print the log
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook', log_dir='/opt/ml/checkpoints')
    ]
)

load_from='https://download.openmmlab.com/mmdetection/v2.0/yolox/yolox_s_8x8_300e_coco/yolox_s_8x8_300e_coco_20211121_095711-4592a793.pth'

# fp16 settings
fp16 = dict(loss_scale=512.)

Did you make any modifications on the code or config? Did you understand what you have modified?
Yes, above. I think so.
What dataset did you use?
A small subset of the COCO dataset.

Environment

The environment script seems to be broken. I'm running in a Docker container on Amazon SageMaker with the latest version of MMEngine and MMDetection 3.0.

Error traceback

File "/opt/ml/code/mmdetection/tools/train.py", line 120, in <module>
main()
File "/opt/ml/code/mmdetection/tools/train.py", line 116, in main
runner.train()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1631, in train
model = self.train_loop.run()  # type: ignore
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 94, in run
self.runner.val_loop.run()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 346, in run
metrics = self.evaluator.evaluate(len(self.dataloader.dataset))
File "/opt/conda/lib/python3.8/site-packages/mmengine/evaluator/evaluator.py", line 79, in evaluate
_results = metric.evaluate(size)
File "/opt/conda/lib/python3.8/site-packages/mmengine/evaluator/metric.py", line 105, in evaluate
results = collect_results(self.results, size, self.collect_device)
File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/dist.py", line 920, in collect_results
return collect_results_cpu(results, size, tmpdir)
File "/opt/conda/lib/python3.8/site-packages/mmengine/dist/dist.py", line 979, in collect_results_cpu
with open(osp.join(tmpdir, f'part_{rank}.pkl'), 'wb') as f:  # type: ignore
FileNotFoundError: [Errno 2] No such file or directory: '.dist_test/tmpx8oxbsxt/part_13.pkl'

Answered by C1rN09

Sep 29, 2022

Thanks for reporting the bug! I checked the code and find it's a bug in our dist.collect_results_cpu.

The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:

Make sure .dist_test folder that is shared across instances exists just in the directory you are running torchrun. If you have shared storage in other directories (e.g. /mnt/your_shared), you may create a soft link via ln -s. If you don't have one, you may try mounting through nfs or something similar.
Or, you may add collect_device='gpu' to your metrics' config to enable GPU collecting. This is currently experimental and may not be that stable.

View full answer

C1rN09 · 2022-09-29T04:48:21Z

C1rN09
Sep 29, 2022
Collaborator

Thanks for reporting the bug! I checked the code and find it's a bug in our dist.collect_results_cpu.

The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:

Make sure .dist_test folder that is shared across instances exists just in the directory you are running torchrun. If you have shared storage in other directories (e.g. /mnt/your_shared), you may create a soft link via ln -s. If you don't have one, you may try mounting through nfs or something similar.
Or, you may add collect_device='gpu' to your metrics' config to enable GPU collecting. This is currently experimental and may not be that stable.

1 reply

austinmw Oct 8, 2022
Author

So far collect_device='gpu' seems to be working for me

austinmw · 2022-10-02T22:38:13Z

austinmw
Oct 2, 2022
Author

Yep, that matches my scenario (multiple AWS p3.16xlarge instances without shared storage), thanks!

0 replies

SCZwangxiao · 2023-02-14T02:37:11Z

SCZwangxiao
Feb 14, 2023

When I set collect_device='gpu', the program will tried to allocate more than 1EB memory for GPU. This is quite strange, I am trying to figure out what happened.

1 reply

chanlilong Dec 20, 2024

Same here, tried to allocate more than 1EB memory shows up for me when i use collect_device='gpu'

umialpha · 2025-06-16T16:54:37Z

umialpha
Jun 16, 2025

Is this issue resolved? I met the same issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pickle FileNotFoundError in evaluation phase of distributed training #584

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pickle FileNotFoundError in evaluation phase of distributed training #584

Uh oh!

Uh oh!

austinmw Sep 28, 2022

Replies: 4 comments · 2 replies

Uh oh!

C1rN09 Sep 29, 2022 Collaborator

Uh oh!

austinmw Oct 8, 2022 Author

Uh oh!

Uh oh!

austinmw Oct 2, 2022 Author

Uh oh!

SCZwangxiao Feb 14, 2023

Uh oh!

chanlilong Dec 20, 2024

Uh oh!

umialpha Jun 16, 2025

austinmw
Sep 28, 2022

Replies: 4 comments 2 replies

C1rN09
Sep 29, 2022
Collaborator

austinmw Oct 8, 2022
Author

austinmw
Oct 2, 2022
Author

SCZwangxiao
Feb 14, 2023

umialpha
Jun 16, 2025