-
Couldn't load subscription status.
- Fork 60
Fix core status tracking #13725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Fix core status tracking #13725
Conversation
20265c0 to
421f373
Compare
Signed-off-by: Younes Khoudli <[email protected]>
This header allows messages to not trigger the scheduling of a worker group. It is intended for worker status requests. Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
…s endpoint for /list instead Signed-off-by: Younes Khoudli <[email protected]>
There is no point keeping a message in the request queue if editoast isn't waiting for a response anymore. Setting ttl avoids having core doing useless work. Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
421f373 to
6782d24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue somewhere to keep track of the context?
Either way, LGTM for core, I haven't checked the rest.
Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
bec96f4 to
8786b9c
Compare
|
The issue is #13058 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| /// Should this request avoid starting a new worker if none is running? | ||
| fn no_worker_load(&self) -> bool { | ||
| false | ||
| } | ||
|
|
||
| /// Returns the timeout override for this request, if any. | ||
| fn override_timeout(&self) -> Option<Duration> { | ||
| None | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: associated constants maybe?
| impl AsCoreRequest<()> for StatusRequest { | ||
| const URL_PATH: &'static str = "/status"; | ||
|
|
||
| fn worker_id(&self) -> Option<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: maybe enum core_client::WorkerKey instead of Option<String>?
| .await; | ||
|
|
||
| let status = match status { | ||
| Ok(_) => WorkerStatus::Ready, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Ok(_) => WorkerStatus::Ready, | |
| Ok(()) => WorkerStatus::Ready, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. I have not tried it yet, but I think it does not work as expected.
How each element behaves (If I have understood the PR code correctly):
- The status endpoint (core side) returns "ready" and is ack only once the core is ready (infra + cache loaded).
- Status messages have a timeout set to 10s.
- When the status is not ready, we send another message to load the worker.
- "load_worker" messages are ack only when the core is ready.
How the system behaves (given the previous elements):
- Opening the STDCM page (for the first time) will load a worker only 10seconds after the page is opened -> It's a bit of a waste to lose 10 seconds for ‘nothing’.
- Each
worker_loadeditoast endpoint call will take 10s, which is weird from an API user point of view. - Calling multiple time the
worker_loadendpoint while the worker is loading will accumulateworker_loadmessages in the queue, which might trigger the scale of another worker. Since the frontend is polling the status, it will probably happen.
I don't know how to fix these issues easily:
- Reducing the timeout is problematic when there are many messages in the queue and core is up.
- We must differentiate between when a worker is loading and when there are no workers at all. To call loading only in the second case.
The ttl of the messages on the queue is directly linked to the timeout of the messages in editoast
We could call worker load eagerly. If a core worker core is up, and has the correct infra version it would be a no-op. We could also reduce the worker load ttl, or even remove the status endpoint entirely if we never want to know the status of core without loading an infra/timetable Note that that might be even better, currently the status endpoint doesn't use the infra_version.
Yeah, that's not ideal, but that the tradeoff of making an actual status check vs trying to track the status. Note that they only take 10s if core isn't responding, otherwise they stay fast. |
Fixes #13058