-
Notifications
You must be signed in to change notification settings - Fork 68
Open
Labels
Description
Over the last few days using orbax checkpoint installed from the repo directly I've had increasing numbers of grpc key awaitable_signals_contract not found issues pop up. This is on midsize TPU clusters writing to gcs. So far the checkpoints themselves seem to be ok and everything continues as normal but I wanted to flag it anyway. Shape of the error looks like this with the key number before the slash changing:
WARNING JaxRuntimeError raised while trying to get key '295/awaitable_signals_contract'.
Traceback (most recent call last):
File "/home/redacted/orbax/checkpoint/orbax/checkpoint/_src/futures/signaling_client.py", line 141, in key_value_try_get
return str(self._client.key_value_try_get(key))
jaxlib.xla_extension.XlaRuntimeError: NOT_FOUND: Config key 295/awaitable_signals_contract not found.
Additional GRPC error information from remote target coordination_service while calling /tensorflow.CoordinationService/TryGetKeyValue:
:{"created":"@1746905336.599367778","description":"Error received from peer ipv4:ip_redacted:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Config key 295/awaitable_signals_contract not found.","grpc_status":5}
I am not sure if the error is actually new or if it's just properly highlighted now because of the changes in 8bbe3a4