-
|
Describe the bug Reproduction I used torchrun to kick off Here's my config (
Environment The environment script seems to be broken. I'm running in a Docker container on Amazon SageMaker with the latest version of MMEngine and MMDetection 3.0. Error traceback |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
|
Thanks for reporting the bug! I checked the code and find it's a bug in our The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:
|
Beta Was this translation helpful? Give feedback.
-
|
Yep, that matches my scenario (multiple AWS p3.16xlarge instances without shared storage), thanks! |
Beta Was this translation helpful? Give feedback.
-
|
When I set |
Beta Was this translation helpful? Give feedback.
-
|
Is this issue resolved? I met the same issue |
Beta Was this translation helpful? Give feedback.
Thanks for reporting the bug! I checked the code and find it's a bug in our
dist.collect_results_cpu.The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:
.dist_testfolder that is shared across instances exists just in the directory you are runningtorchrun. If you have shared storage in other directories (e.g. /mnt/your_shared), you may create a soft link vialn -s. If you don't have one, you may try mounting throughnfsor something similar.collect_device='gpu'to your metrics' config to enable GPU collecting. This is currently experimental and may not be that stable.