Skip to content

Conversation

volokluev
Copy link
Member

@volokluev volokluev commented Sep 17, 2025

This PR implements a routing strategy for routing to TIER_1 and instead of downsampling storage tiers, shrinking time windows.

Design decisions

Window Sizing Algorithm

This is the easiest thing to change in this PR (and likely it will change). At the moment, it looks up the amount of outcomes for the requested time range and shrinks the time window down assuming that the distribution of datapoints is uniform across time (which is not true). Most recent datapoints are prioritized first

Pagination

Only the TraceItemTable endpoint makes use of the recommendations of this routing strategy, the endpoint and routing strategy interact across queries. This is to facilitate a simple client side UX where all the client has to do is pass the page_token across their requests and not worry about anything else.

Here's a diagram explaining the flow:

┌─────────────────┐              ┌────────────────┐                               
│                 │              │                │                               
│                 │              │                │                               
│                 │              │   Routing      │                               
│   Client        ├─────────────►│   Strategy     ┼───────────────────┐           
│                 │  page_token  │                │                   │           
│                 │              │                │                   │           
│                 │              │                │            Narrows│time window
└─────────────────┘              └────────────────┘            For Endpoint       
        ▲                                                             │           
        │                                                             │           
        │                         ┌────────────────┐                  │           
        │                         │                │                  │           
        │                         │                │                  │           
        │                         │ TraceItemTable │◄─────────────────┘           
        └─────────────────────────┼ Endpoint       │                              
             Encodes              │                │                              
             Time Window          └────────────────┘                              
             In Page Token                                                        

In order to facilitate pagination, the TraceItemTable endpoint now queries for limit + 1 rows in order to know if there are more items in this current window or if we can move on to the next one

What's missing

  1. This functionality can be tested more rigorously, I deliberately did not spend too much time on it because I know it will change and the priority is to get something out there to try
  2. We could probably have more observability into what the strategy is doing and understanding our success metrics better. These will be added as we understand the problem more

@volokluev volokluev requested review from a team as code owners September 17, 2025 23:58
@volokluev volokluev changed the title feat(cbrs): Time Window Routing feat(cbrs): Time Window Routing (DO NOT MERGE) Sep 17, 2025
routing_context.extra_info["estimation_sql"] = res.extra.get("sql", "")
return cast(int, res.result.get("data", [{}])[0].get("num_items", 0))

def _adjust_time_window(self, routing_context: RoutingContext) -> TimeWindow | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this function return None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no adjustment to be made

window_length = original_end_ts - original_start_ts

start_timestamp_proto = TimestampProto(
seconds=original_end_ts - math.floor((window_length / factor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this so we prioritize more recent data? and the user will paginate forwards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

start_timestamp: TimestampProto
end_timestamp: TimestampProto

def length_hours(self) -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging :D

ingested_items = self.get_ingested_items_for_timerange(
routing_context, original_time_window
)
factor = ingested_items / max_items
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how we actually shrink the time window

Comment on lines 44 to 47
# TODO import these from sentry-relay
class OutcomeCategory:
SPAN_INDEXED = 16
LOG_ITEM = 23
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy pasta, remove

Comment on lines 54 to 57
_ITEM_TYPE_TO_OUTCOME = {
TraceItemType.TRACE_ITEM_TYPE_SPAN: OutcomeCategory.SPAN_INDEXED,
TraceItemType.TRACE_ITEM_TYPE_LOG: OutcomeCategory.LOG_ITEM,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy pasted, remove this

column("category"),
_ITEM_TYPE_TO_OUTCOME.get(
in_msg_meta.trace_item_type,
OutcomeCategory.SPAN_INDEXED,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong, the default should not be span indexed

@volokluev volokluev changed the title feat(cbrs): Time Window Routing (DO NOT MERGE) feat(cbrs): Time Window Routing Sep 19, 2025
Copy link

codecov bot commented Sep 19, 2025

✅ All tests passed in 1293.61s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants