-
-
Notifications
You must be signed in to change notification settings - Fork 58
Description
Hi, I have been running hsds on an azure kubernetes service cluster for a couple months and it has been very solid so far. Recently I ran into some issues with the hsds cluster suddenly becoming unresponsive. It says it is still operational when curling the /about endpoint.
Running the hsinfo endpoint sometimes works and sometimes it returns this error:
username: admin (admin)
password: ********************************
ERROR 2025-05-12 13:02:47,642 got <class 'requests.exceptions.RetryError'> exception: HTTPConnectionPool(host='<hostip>', port=80): Max retries exceeded with url: /?domain=%2Fhome (Caused by ResponseError('too many 503 error responses'))
home: NO ACCESS
server version: 0.9.1
node count: 4
up: 20 hours 21 min 47 sec
h5pyd version: 0.19.0
running the hsls command also returns an error but it managed to run it once in the broken state:
folder /
admin folder 2025-01-14 13:40:43 /home/
ERROR 2025-05-12 13:02:58,239 got <class 'requests.exceptions.RetryError'> exception: HTTPConnectionPool(host=<hostip>, port=80): Max retries exceeded with url: /?domain=%2Fhome (Caused by ResponseError('too many 503 error responses'))
What is noticable is that all the pods say they were only started recently, a time which correlates with the first not all dn_ids found erros. It was no manual restart but since also the cluster nodes say they were created at that time this suggests an automatic update of the node operating system or something similar (unfortunately was not able to verify within the aks logs). It seems like the automatic restarting of the pods does not work properly. I had this issue occure in two different aks clusters where I host hsds.
When I manually restart all pods of the deployment via kubectl rollout restart deployment hsds it runs smoothly again.
Some logs of the hsds pods:
Defaulted container "sn" out of: sn, dn
2025-05-11T14:41:01.188972556Z WARN> ClientError: Cannot connect to host 10.244.1.8:6101 ssl:default [Connect call failed ('10.244.1.8', 6101)]
2025-05-11T14:41:01.189338640Z ERROR> k8s_get_dn_urls - Exception: Internal Server Error from /info request
2025-05-11T14:41:01.192585103Z WARN> scaling - got 3 dn_ids expected 4
2025-05-11T14:41:01.192715000Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:11.255341427Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:21.343061680Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:31.374071584Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:41.420208288Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:51.469857952Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:01.517216379Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:11.580419795Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:21.643042332Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:31.704066186Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:41.755255130Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:51.810123758Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:43:01.915280706Z WARN> scaling - node_numbers not consecutive - got: [2, 3]
2025-05-11T17:45:27.859414157Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T04:12:21.586426975Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T05:11:29.196915404Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T05:13:49.556992518Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T07:07:20.263299333Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T10:09:10.710862955Z WARN> select param is 620 characters long
2025-05-12T10:09:10.721442661Z WARN> select param is 721 characters long
2025-05-12T10:09:10.722004151Z WARN> select param is 721 characters long
2025-05-12T10:09:10.722612712Z WARN> select param is 721 characters long
2025-05-12T10:09:10.723117726Z WARN> select param is 721 characters long
2025-05-12T10:09:10.757419197Z WARN> select param is 720 characters long
2025-05-12T10:09:20.973273745Z WARN> select param is 620 characters long
2025-05-12T10:09:20.980846193Z WARN> select param is 620 characters long
2025-05-12T10:09:31.121470759Z WARN> select param is 703 characters long
2025-05-12T10:09:31.131655977Z WARN> select param is 620 characters long
2025-05-12T10:09:31.144136849Z WARN> select param is 620 characters long
2025-05-12T10:17:58.077461606Z WARN> select param is 620 characters long
2025-05-12T10:17:58.081012981Z WARN> select param is 703 characters long
2025-05-12T10:17:58.107836748Z WARN> select param is 512 characters long
2025-05-12T10:18:08.295661050Z WARN> select param is 620 characters long
2025-05-12T10:18:08.296238855Z WARN> select param is 620 characters long
2025-05-12T10:18:08.304142039Z WARN> select param is 620 characters long
2025-05-12T10:18:08.318738679Z WARN> select param is 512 characters long
2025-05-12T10:18:08.338515108Z WARN> select param is 821 characters long
2025-05-12T10:18:18.439841846Z WARN> select param is 620 characters long
2025-05-12T10:18:18.449873429Z WARN> select param is 620 characters long
2025-05-12T10:18:18.452095883Z WARN> select param is 620 characters long
2025-05-12T11:26:03.116042296Z WARN> Object: http://10.244.1.8:6101/domains not found
2025-05-12T11:26:03.116308489Z WARN> domain: hsdscontainer/db not found
2025-05-12T11:26:03.116316421Z WARN> fetch result - not found error for: /db
2025-05-12T11:26:03.120221251Z WARN> get_domains - domain: /db not found in crawler dict
2025-05-12T11:26:03.204039312Z WARN> Object: http://10.244.1.8:6101/domains not found
2025-05-12T11:26:03.204239955Z WARN> domain: hsdscontainer/db not found
2025-05-12T11:26:03.204248148Z WARN> fetch result - not found error for: /db
2025-05-12T11:26:03.204285204Z WARN> get_domains - domain: /db not found in crawler dict
2025-05-11T14:41:01.132076863Z WARN> ClientError: Cannot connect to host 10.244.1.8:6101 ssl:default [Connect call failed ('10.244.1.8', 6101)]
2025-05-11T14:41:01.132455476Z ERROR> k8s_get_dn_urls - Exception: Internal Server Error from /info request
2025-05-11T14:41:01.135939658Z WARN> scaling - got 3 dn_ids expected 4
2025-05-11T14:41:01.136062093Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:11.183341071Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:21.234603968Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:31.281539794Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:41.328286034Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:41:51.402948947Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:01.474730805Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:11.556206076Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:21.643824831Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:31.718458517Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:41.774393702Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:42:51.836871434Z WARN> not all dn_ids found, got: ['dn-ecbed', 'dn-7811b', 'dn-90954']
2025-05-11T14:43:01.917405594Z WARN> scaling - node_numbers not consecutive - got: [2, 3]
2025-05-11T15:54:13.376909288Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-11T16:11:18.436844293Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-11T16:45:18.576590760Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-11T17:50:58.147116960Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-11T22:49:08.902611889Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T00:36:51.453031159Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T00:46:34.141227642Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T00:58:27.314176357Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T02:55:42.823295632Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T03:56:51.112958156Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T10:09:10.726935246Z WARN> select param is 620 characters long
2025-05-12T10:09:10.748468746Z WARN> select param is 620 characters long
2025-05-12T10:09:20.956553496Z WARN> select param is 620 characters long
2025-05-12T10:09:20.959156595Z WARN> select param is 620 characters long
2025-05-12T10:09:20.978175382Z WARN> select param is 620 characters long
2025-05-12T10:09:20.992750976Z WARN> select param is 821 characters long
2025-05-12T10:09:31.113686692Z WARN> select param is 721 characters long
2025-05-12T10:09:31.114547264Z WARN> select param is 721 characters long
2025-05-12T10:09:31.115527327Z WARN> select param is 721 characters long
2025-05-12T10:09:31.116513298Z WARN> select param is 721 characters long
2025-05-12T10:09:31.117506281Z WARN> select param is 620 characters long
2025-05-12T10:09:31.127903420Z WARN> select param is 512 characters long
2025-05-12T10:18:18.439652434Z WARN> select param is 731 characters long
2025-05-12T10:18:18.441464792Z WARN> select param is 731 characters long
2025-05-12T10:18:18.457957505Z WARN> select param is 620 characters long
2025-05-12T11:02:50.994200826Z WARN> Object: http://10.244.1.8:6101/domains not found
2025-05-12T11:02:50.994490955Z WARN> domain: hsdscontainer/db not found
2025-05-12T11:02:50.994498176Z WARN> fetch result - not found error for: /db
2025-05-12T11:02:50.994586971Z WARN> get_domains - domain: /db not found in crawler dict
2025-05-12T11:02:51.057151852Z WARN> Object: http://10.244.1.8:6101/domains not found
2025-05-12T11:02:51.057360348Z WARN> domain: hsdscontainer/db not found
2025-05-12T11:02:51.057366878Z WARN> fetch result - not found error for: /db
2025-05-12T11:02:51.057444105Z WARN> get_domains - domain: /db not found in crawler dict
Defaulted container "sn" out of: sn, dn
2025-05-11T14:44:19.683711106Z WARN> scaling - node_numbers not consecutive - got: [-1, -1, 0, 1]
2025-05-11T17:05:08.308980409Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-11T20:46:56.847148063Z ERROR> Unexpected HTTPInternalServerError exception in doHealthCheck: Internal Server Error
2025-05-12T10:08:51.956746127Z WARN> Object: http://10.244.1.8:6101/domains not found
2025-05-12T10:08:51.956946269Z WARN> domain: hsdscontainer/db not found
2025-05-12T10:08:51.956953290Z WARN> fetch result - not found error for: /db
2025-05-12T10:08:51.957536942Z WARN> get_domains - domain: /db not found in crawler dict
2025-05-12T10:08:52.038433833Z WARN> Object: http://10.244.1.8:6101/domains not found
2025-05-12T10:08:52.038653385Z WARN> domain: hsdscontainer/db not found
2025-05-12T10:08:52.038661127Z WARN> fetch result - not found error for: /db
2025-05-12T10:08:52.038807758Z WARN> get_domains - domain: /db not found in crawler dict
2025-05-12T10:09:10.709447534Z WARN> select param is 731 characters long
2025-05-12T10:09:10.710076395Z WARN> select param is 731 characters long
2025-05-12T10:09:10.721969825Z WARN> select param is 620 characters long
2025-05-12T10:09:10.723074826Z WARN> select param is 703 characters long
2025-05-12T10:09:10.727338693Z WARN> select param is 620 characters long
2025-05-12T10:09:10.727926862Z WARN> select param is 620 characters long
2025-05-12T10:09:10.731749611Z WARN> select param is 620 characters long
2025-05-12T10:09:10.738564935Z WARN> select param is 620 characters long
2025-05-12T10:09:20.951849817Z WARN> select param is 731 characters long
2025-05-12T10:09:20.951907544Z WARN> select param is 731 characters long
2025-05-12T10:09:20.966722196Z WARN> select param is 703 characters long
2025-05-12T10:09:20.969765496Z WARN> select param is 620 characters long
2025-05-12T10:09:20.970208425Z WARN> select param is 620 characters long
2025-05-12T10:09:20.977098671Z WARN> select param is 721 characters long
2025-05-12T10:09:20.977589432Z WARN> select param is 721 characters long
2025-05-12T10:09:20.978174527Z WARN> select param is 721 characters long
2025-05-12T10:09:20.978750968Z WARN> select param is 721 characters long
2025-05-12T10:09:20.979427130Z WARN> select param is 620 characters long
2025-05-12T10:09:31.095829647Z WARN> select param is 731 characters long
2025-05-12T10:09:31.096910061Z WARN> select param is 731 characters long
2025-05-12T10:09:31.105725173Z WARN> select param is 620 characters long
2025-05-12T10:09:31.106109805Z WARN> select param is 620 characters long
2025-05-12T10:09:31.109941155Z WARN> select param is 721 characters long
2025-05-12T10:09:31.110459398Z WARN> select param is 721 characters long
2025-05-12T10:09:31.111022049Z WARN> select param is 721 characters long
2025-05-12T10:09:31.111707535Z WARN> select param is 721 characters long
2025-05-12T10:09:31.122561950Z WARN> select param is 720 characters long
2025-05-12T10:09:31.125992604Z WARN> select param is 620 characters long
2025-05-12T10:09:31.128788470Z WARN> select param is 821 characters long
2025-05-12T10:17:58.071601513Z WARN> select param is 620 characters long
2025-05-12T10:17:58.086751333Z WARN> select param is 721 characters long
2025-05-12T10:17:58.087332322Z WARN> select param is 721 characters long
2025-05-12T10:17:58.087855294Z WARN> select param is 721 characters long
2025-05-12T10:17:58.088446558Z WARN> select param is 721 characters long
2025-05-12T10:17:58.091847233Z WARN> select param is 620 characters long
2025-05-12T10:17:58.093061150Z WARN> select param is 620 characters long
2025-05-12T10:18:08.289839561Z WARN> select param is 731 characters long
2025-05-12T10:18:08.290545409Z WARN> select param is 731 characters long
2025-05-12T10:18:08.314513279Z WARN> select param is 620 characters long
2025-05-12T10:18:08.315536407Z WARN> select param is 620 characters long
2025-05-12T10:18:08.316131788Z WARN> select param is 620 characters long
2025-05-12T10:18:08.318705852Z WARN> select param is 721 characters long
2025-05-12T10:18:08.319222113Z WARN> select param is 721 characters long
2025-05-12T10:18:08.319809522Z WARN> select param is 721 characters long
2025-05-12T10:18:08.320492896Z WARN> select param is 721 characters long
2025-05-12T10:18:08.323175464Z WARN> select param is 620 characters long
2025-05-12T10:18:08.327344392Z WARN> select param is 620 characters long
2025-05-12T10:18:18.448973613Z WARN> select param is 721 characters long
2025-05-12T10:18:18.448991270Z WARN> select param is 721 characters long
2025-05-12T10:18:18.449343232Z WARN> select param is 721 characters long
2025-05-12T10:18:18.449854776Z WARN> select param is 721 characters long
2025-05-12T10:18:18.457936509Z WARN> select param is 512 characters long
2025-05-12T10:18:18.459066187Z WARN> select param is 620 characters long
2025-05-12T10:18:18.465449302Z WARN> select param is 620 characters long
2025-05-12T09:24:29.877099745Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.878137275Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.878464891Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.883460091Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.893297197Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.957138787Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.957742691Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:29.966824214Z WARN> returning 503 - node_state: WAITING
2025-05-12T09:24:38.930946205Z WARN> not all dn_ids found, got: ['dn-aeb2d', 'dn-90954', 'dn-f71db']
...
concatinated repetitive warnings
...
2025-05-12T11:39:40.966040865Z WARN> returning 503 - node_state: WAITING
2025-05-12T11:39:40.974103224Z WARN> returning 503 - node_state: WAITING
2025-05-12T11:39:42.976364487Z WARN> returning 503 - node_state: WAITING
2025-05-12T11:39:46.979700681Z WARN> returning 503 - node_state: WAITING
2025-05-12T11:39:48.119343001Z WARN> not all dn_ids found, got: ['dn-aeb2d', 'dn-90954', 'dn-f71db']
2025-05-12T11:39:54.981560289Z WARN> returning 503 - node_state: WAITING
2025-05-12T11:39:58.173053474Z WARN> not all dn_ids found, got: ['dn-aeb2d', 'dn-90954', 'dn-f71db']
2025-05-12T11:39:59.250147390Z WARN> returning 503 - node_state: WAITING
2025-05-12T11:40:08.219730428Z WARN> not all dn_ids found, got: ['dn-aeb2d', 'dn-90954', 'dn-f71db']
2025-05-12T11:40:10.984159278Z WARN> returning 503 - node_state: WAITING