During training (Tesla V100-PCIE-16GB) I get the following error
Train: 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
File "/anaconda/envs/rtfm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/rtfm/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-medekm-gpu/code/Users/michael.medek/rtfm/rtfm/finetune.py", line 451, in <module>
main(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-medekm-gpu/code/Users/michael.medek/rtfm/rtfm/finetune.py", line 408, in main
results = train(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-medekm-gpu/code/Users/michael.medek/rtfm/rtfm/train_utils.py", line 274, in train
batch[key] = batch[key].to(f"cuda:{local_rank}")
RuntimeError: Invalid device string: 'cuda:None'
Train: 0%|
Which traces to here
|
batch[key] = batch[key].to(f"cuda:{local_rank}") |
where local_rank is None, thus Invalid device string: 'cuda:None'. How is this supposed to work? The default of the function is local_rank=None which should be invalid, since it must be int, right? In evaluate() there is only local_rank: int.
By adding
local_rank = 0
rank = 0
print("WARNING! Overwriting local_rank and rank to 0!")
this issue is worked around.
During training (Tesla V100-PCIE-16GB) I get the following error
Which traces to here
rtfm/rtfm/train_utils.py
Line 274 in 9884a6b
where
local_rankis None, thusInvalid device string: 'cuda:None'. How is this supposed to work? The default of the function islocal_rank=Nonewhich should be invalid, since it must be int, right? Inevaluate()there is onlylocal_rank: int.By adding
this issue is worked around.