Skip to content

Conversation

@Khoyo
Copy link
Contributor

@Khoyo Khoyo commented Oct 18, 2025

Fixes #13058

@github-actions github-actions bot added area:core Work on Core Service area:editoast Work on Editoast Service area:osrdyne labels Oct 18, 2025
@Khoyo Khoyo changed the title Yk/osrdyne status fix Fix core status tracking Oct 18, 2025
@Khoyo Khoyo force-pushed the yk/osrdyne-status-fix branch 2 times, most recently from 20265c0 to 421f373 Compare October 20, 2025 05:35
Khoyo added 9 commits October 20, 2025 11:12
Signed-off-by: Younes Khoudli <[email protected]>
This header allows messages to not trigger the scheduling of a worker
group. It is intended for worker status requests.

Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
…s endpoint for /list instead

Signed-off-by: Younes Khoudli <[email protected]>
There is no point keeping a message in the request queue if editoast
isn't waiting for a response anymore. Setting ttl avoids having core
doing useless work.

Signed-off-by: Younes Khoudli <[email protected]>
Signed-off-by: Younes Khoudli <[email protected]>
@Khoyo Khoyo force-pushed the yk/osrdyne-status-fix branch from 421f373 to 6782d24 Compare October 20, 2025 09:12
@Khoyo Khoyo marked this pull request as ready for review October 20, 2025 09:48
@Khoyo Khoyo requested review from a team as code owners October 20, 2025 09:48
@Khoyo Khoyo requested a review from eckter October 20, 2025 09:48
Copy link
Contributor

@eckter eckter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue somewhere to keep track of the context?

Either way, LGTM for core, I haven't checked the rest.

@Khoyo Khoyo force-pushed the yk/osrdyne-status-fix branch from bec96f4 to 8786b9c Compare October 20, 2025 10:02
@Khoyo
Copy link
Contributor Author

Khoyo commented Oct 20, 2025

The issue is #13058

Copy link
Contributor

@leovalais leovalais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +164 to +173
/// Should this request avoid starting a new worker if none is running?
fn no_worker_load(&self) -> bool {
false
}

/// Returns the timeout override for this request, if any.
fn override_timeout(&self) -> Option<Duration> {
None
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: associated constants maybe?

impl AsCoreRequest<()> for StatusRequest {
const URL_PATH: &'static str = "/status";

fn worker_id(&self) -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: maybe enum core_client::WorkerKey instead of Option<String>?

.await;

let status = match status {
Ok(_) => WorkerStatus::Ready,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ok(_) => WorkerStatus::Ready,
Ok(()) => WorkerStatus::Ready,

Copy link
Member

@flomonster flomonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. I have not tried it yet, but I think it does not work as expected.

How each element behaves (If I have understood the PR code correctly):

  • The status endpoint (core side) returns "ready" and is ack only once the core is ready (infra + cache loaded).
  • Status messages have a timeout set to 10s.
  • When the status is not ready, we send another message to load the worker.
  • "load_worker" messages are ack only when the core is ready.

How the system behaves (given the previous elements):

  • Opening the STDCM page (for the first time) will load a worker only 10seconds after the page is opened -> It's a bit of a waste to lose 10 seconds for ‘nothing’.
  • Each worker_load editoast endpoint call will take 10s, which is weird from an API user point of view.
  • Calling multiple time the worker_load endpoint while the worker is loading will accumulate worker_load messages in the queue, which might trigger the scale of another worker. Since the frontend is polling the status, it will probably happen.

I don't know how to fix these issues easily:

  • Reducing the timeout is problematic when there are many messages in the queue and core is up.
  • We must differentiate between when a worker is loading and when there are no workers at all. To call loading only in the second case.

@Khoyo
Copy link
Contributor Author

Khoyo commented Oct 21, 2025

Calling multiple time the worker_load endpoint while the worker is loading will accumulate worker_load messages in the queue

The ttl of the messages on the queue is directly linked to the timeout of the messages in editoast

Opening the STDCM page (for the first time) will load a worker only 10seconds after the page is opened -> It's a bit of a waste to lose 10 seconds for ‘nothing’.

We could call worker load eagerly. If a core worker core is up, and has the correct infra version it would be a no-op. We could also reduce the worker load ttl, or even remove the status endpoint entirely if we never want to know the status of core without loading an infra/timetable

Note that that might be even better, currently the status endpoint doesn't use the infra_version.

Each worker_load editoast endpoint call will take 10s, which is weird from an API user point of view.

Yeah, that's not ideal, but that the tradeoff of making an actual status check vs trying to track the status. Note that they only take 10s if core isn't responding, otherwise they stay fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core Work on Core Service area:editoast Work on Editoast Service area:osrdyne

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The worker core appears ready even though it has not finished loading the timetable.

5 participants