Fix core status tracking #13725

Khoyo · 2025-10-18T20:30:48Z

Signed-off-by: Younes Khoudli <[email protected]>

This header allows messages to not trigger the scheduling of a worker group. It is intended for worker status requests. Signed-off-by: Younes Khoudli <[email protected]>

Signed-off-by: Younes Khoudli <[email protected]>

…s endpoint for /list instead Signed-off-by: Younes Khoudli <[email protected]>

There is no point keeping a message in the request queue if editoast isn't waiting for a response anymore. Setting ttl avoids having core doing useless work. Signed-off-by: Younes Khoudli <[email protected]>

Signed-off-by: Younes Khoudli <[email protected]>

eckter

Is there an issue somewhere to keep track of the context?

Either way, LGTM for core, I haven't checked the rest.

core/src/main/java/fr/sncf/osrd/api/StatusEndpoint.kt

Signed-off-by: Younes Khoudli <[email protected]>

Khoyo · 2025-10-20T10:09:43Z

The issue is #13058

leovalais

LGTM

leovalais · 2025-10-20T13:08:24Z

editoast/core_client/src/lib.rs

+    /// Should this request avoid starting a new worker if none is running?
+    fn no_worker_load(&self) -> bool {
+        false
+    }
+
+    /// Returns the timeout override for this request, if any.
+    fn override_timeout(&self) -> Option<Duration> {
+        None
+    }
+


nit: associated constants maybe?

leovalais · 2025-10-20T13:11:02Z

editoast/core_client/src/status.rs

+impl AsCoreRequest<()> for StatusRequest {
+    const URL_PATH: &'static str = "/status";
+
+    fn worker_id(&self) -> Option<String> {


For later: maybe enum core_client::WorkerKey instead of Option<String>?

leovalais · 2025-10-20T13:11:53Z

editoast/src/views/worker_load.rs

+    .await;
+
+    let status = match status {
+        Ok(_) => WorkerStatus::Ready,


Suggested change

Ok(_) => WorkerStatus::Ready,

Ok(()) => WorkerStatus::Ready,

flomonster

Thanks for this PR. I have not tried it yet, but I think it does not work as expected.

How each element behaves (If I have understood the PR code correctly):

The status endpoint (core side) returns "ready" and is ack only once the core is ready (infra + cache loaded).
Status messages have a timeout set to 10s.
When the status is not ready, we send another message to load the worker.
"load_worker" messages are ack only when the core is ready.

How the system behaves (given the previous elements):

Opening the STDCM page (for the first time) will load a worker only 10seconds after the page is opened -> It's a bit of a waste to lose 10 seconds for ‘nothing’.
Each worker_load editoast endpoint call will take 10s, which is weird from an API user point of view.
Calling multiple time the worker_load endpoint while the worker is loading will accumulate worker_load messages in the queue, which might trigger the scale of another worker. Since the frontend is polling the status, it will probably happen.

I don't know how to fix these issues easily:

Reducing the timeout is problematic when there are many messages in the queue and core is up.
We must differentiate between when a worker is loading and when there are no workers at all. To call loading only in the second case.

Khoyo · 2025-10-21T11:23:03Z

Calling multiple time the worker_load endpoint while the worker is loading will accumulate worker_load messages in the queue

The ttl of the messages on the queue is directly linked to the timeout of the messages in editoast

Opening the STDCM page (for the first time) will load a worker only 10seconds after the page is opened -> It's a bit of a waste to lose 10 seconds for ‘nothing’.

We could call worker load eagerly. If a core worker core is up, and has the correct infra version it would be a no-op. We could also reduce the worker load ttl, or even remove the status endpoint entirely if we never want to know the status of core without loading an infra/timetable

Note that that might be even better, currently the status endpoint doesn't use the infra_version.

Each worker_load editoast endpoint call will take 10s, which is weird from an API user point of view.

Yeah, that's not ideal, but that the tradeoff of making an actual status check vs trying to track the status. Note that they only take 10s if core isn't responding, otherwise they stay fast.

github-actions bot added area:core Work on Core Service area:editoast Work on Editoast Service area:osrdyne labels Oct 18, 2025

Khoyo changed the title ~~Yk/osrdyne status fix~~ Fix core status tracking Oct 18, 2025

Khoyo force-pushed the yk/osrdyne-status-fix branch 2 times, most recently from 20265c0 to 421f373 Compare October 20, 2025 05:35

Khoyo added 9 commits October 20, 2025 11:12

core: add a status endpoint

54f5abd

Signed-off-by: Younes Khoudli <[email protected]>

osrdyne: add a x-osrdyne-no-start message header

9909c7f

This header allows messages to not trigger the scheduling of a worker group. It is intended for worker status requests. Signed-off-by: Younes Khoudli <[email protected]>

editoast: add nor worker load to coreclient

2104729

Signed-off-by: Younes Khoudli <[email protected]>

editoast: expose custom timeout

217785f

Signed-off-by: Younes Khoudli <[email protected]>

editoast: use mq to request status

2931cfe

Signed-off-by: Younes Khoudli <[email protected]>

osrdyne: rip out the old worker status tracking logic

ad3e333

Signed-off-by: Younes Khoudli <[email protected]>

osrdyne: continue ripping out the status tracking logic. remove statu…

406edf8

…s endpoint for /list instead Signed-off-by: Younes Khoudli <[email protected]>

editoast: mq_client: set the message ttl using the timeout

a0a9223

There is no point keeping a message in the request queue if editoast isn't waiting for a response anymore. Setting ttl avoids having core doing useless work. Signed-off-by: Younes Khoudli <[email protected]>

editoast: stop lying

3f6bc58

Signed-off-by: Younes Khoudli <[email protected]>

Khoyo force-pushed the yk/osrdyne-status-fix branch from 421f373 to 6782d24 Compare October 20, 2025 09:12

Khoyo marked this pull request as ready for review October 20, 2025 09:48

Khoyo requested review from a team as code owners October 20, 2025 09:48

Khoyo requested a review from eckter October 20, 2025 09:48

eckter approved these changes Oct 20, 2025

View reviewed changes

core/src/main/java/fr/sncf/osrd/api/StatusEndpoint.kt Outdated Show resolved Hide resolved

Khoyo added 2 commits October 20, 2025 12:02

editoast: mq_client: use a dashmap for correlation ids

78635df

Signed-off-by: Younes Khoudli <[email protected]>

editoast: share the response tracker dashmap

8786b9c

Signed-off-by: Younes Khoudli <[email protected]>

Khoyo force-pushed the yk/osrdyne-status-fix branch from bec96f4 to 8786b9c Compare October 20, 2025 10:02

fixup! core: add a status endpoint

b227480

fixup! editoast: use mq to request status

872b72f

leovalais approved these changes Oct 20, 2025

View reviewed changes

flomonster requested changes Oct 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix core status tracking #13725

Fix core status tracking #13725

Uh oh!

Khoyo commented Oct 18, 2025 •

edited

Loading

Uh oh!

eckter left a comment

Uh oh!

Uh oh!

Khoyo commented Oct 20, 2025

Uh oh!

leovalais left a comment

Uh oh!

leovalais Oct 20, 2025

Uh oh!

leovalais Oct 20, 2025

Uh oh!

leovalais Oct 20, 2025

Uh oh!

flomonster left a comment •

edited

Loading

Uh oh!

Khoyo commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Fix core status tracking #13725

Are you sure you want to change the base?

Fix core status tracking #13725

Uh oh!

Conversation

Khoyo commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eckter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Khoyo commented Oct 20, 2025

Uh oh!

leovalais left a comment

Choose a reason for hiding this comment

Uh oh!

leovalais Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

leovalais Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

leovalais Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

flomonster left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Khoyo commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Khoyo commented Oct 18, 2025 •

edited

Loading

flomonster left a comment •

edited

Loading

Khoyo commented Oct 21, 2025 •

edited

Loading