-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Currently the WandB runs are named according to the value of output_dir, but this requires every run to have a separate value for output_dir to avoid collisions with WandB on repeated runs, e.g. one sporadically hits this error:
actor]: 2025-11-07 08:15:29,293 - pipelinerl.utils - ERROR - Exception in actor: Run init
ialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_
timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
[preprocessor]: 2025-11-07 08:15:29,293 - pipelinerl.utils - ERROR - Exception in preproce
ss: Run initialization has timed out after 90.0 sec. Please try increasing the timeout wit
h the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
[preprocessor]: 2025-11-07 08:15:29,298 - pipelinerl.utils - ERROR - Traceback: Traceback
(most recent call last):
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/wandb/sdk/wandb_in
it.py", line 997, in init
result = wait_with_progress(
^^^^^^^^^^^^^^^^^^^
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/wandb/sdk/mailbox/
wait_with_progress.py", line 23, in wait_with_progress
return wait_all_with_progress(
^^^^^^^^^^^^^^^^^^^^^^^
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/wandb/sdk/mailbox/
wait_with_progress.py", line 77, in wait_all_with_progress
return asyncer.run(progress_loop_with_timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/wandb/sdk/lib/asyn
cio_manager.py", line 136, in run
return future.result()
^^^^^^^^^^^^^^^
File "/admin/home/lewis/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/pyth
on3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/admin/home/lewis/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/pyth
on3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
TimeoutError: Timed out waiting for response on p9imn6j7ynez
Moreover, repeated runs write to the same WandB run which is a bit counterintuitive and quite different from other frameworks which assign a new WandB ID per run (usually the auto-generated one)
It would be good to expose a run_name arg so that users can specify the desired run name in the config / runtime while being able to use a fixed value for output_dir (e.g. useful when debugging)
rafapi
Metadata
Metadata
Assignees
Labels
No labels