-
-
Notifications
You must be signed in to change notification settings - Fork 61
ref(cbrs): move EAP allocation policies into the routing strategy layer #7411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4b54c00 to
325e6eb
Compare
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a refactor of https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py, because it's also used in outcomes based strategy
snuba/web/db_query.py
Outdated
| if attribution_info.app_id.key == "eap": | ||
| return | ||
|
|
||
| quota_allowances: dict[str, QuotaAllowance] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid executing https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L848-L890, which is implemented elsewhere in the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the reason why I decided to use attribution_info.app_id.key instead of team or feature is because they can also be set to ourlogs even for eap queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are no allocation policies on the storage, you won't execute anything though right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing would execute for eap because I added the if statement, but if i didnt do that then https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L848-L890 would've executed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very hacky way to do this. You're depending on some metadata injected way up somewhere in the pipeline to be set to a specific value.
The much more reasonable thing would be to return early if allocation_policies is empty
snuba/web/db_query.py
Outdated
| if attribution_info.app_id.key == "eap": | ||
| return | ||
|
|
||
| quota_allowances: dict[str, QuotaAllowance] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just changing the type from Any to QuotaAllowance because that's more accurate
snuba/web/rpc/__init__.py
Outdated
| span.description = self.config_key() | ||
| self.routing_context = RoutingContext(timer=self._timer, in_msg=in_msg) | ||
| self.routing_context = RoutingContext( | ||
| timer=self._timer, in_msg=in_msg, query_id=uuid.uuid4().hex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query_id is later needed for getting & updating quota allowances, so might as well just set it here. Originally query_id is randomly generated right before applying allocation policies, and it seems like the only thing that matters is that query_id stays consistent for getting & updating quota allowances, because it later gets reset in the query pipeline, and ofc for querylog
| "request_invalid", | ||
| tags=self._timer.tags, | ||
| ) | ||
| elif isinstance(error, AllocationPolicyViolations): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually, this metric was never raised because if therre is a AllocationPolicyViolations, then we raise a QueryException: https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L667-L690
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| out = self._execute(in_msg) | ||
| else: | ||
| raise AllocationPolicyViolations | ||
| self.metrics.increment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also open to just not having this metric because we emit more detailed ones in output_metrics
snuba/web/rpc/__init__.py
Outdated
| ) | ||
| # thoughts on making AlocationPolicyViolations a subclass of QueryException? QueryOrError expects QueryException | ||
| raise QueryException.from_args( | ||
| AllocationPolicyViolations.__name__, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snuba/web/rpc/__init__.py
Outdated
| output_metrics_error = None | ||
| try: | ||
| self.routing_decision.strategy.output_metrics(self.routing_decision) | ||
| if self.routing_decision.routing_context.query_result is not None or isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I'm trying to do: https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L708-L720
snuba/web/rpc/__init__.py
Outdated
| tags=self._timer.tags, | ||
| ) | ||
| else: | ||
| elif not isinstance(error, QueryException): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I wanted AllocationPolicyViolations to inherit from QueryException (and then I would raise AllocationPolicyViolations as AllocationPolicyViolations, so then this metric gets emitted for non-allocation policy violation errors, since rate limiting shouldn't count as request_error anyways
| ) as span: | ||
| request_meta = extract_message_meta(routing_context.in_msg) | ||
| recommendations[allocation_policy_name] = allocation_policy.get_quota_allowance( | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that this is what tenant_ids is across all endpoints: https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L825
Maybe we should pull this into RoutingContext, but it's getting a little bloated idk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's only one project id we use the project id
| ) | ||
| record_query(_construct_hacky_querylog_payload(self, routing_decision)) | ||
|
|
||
| @final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }, | ||
| "time_budget": 8000, | ||
| }, | ||
| "allocation_policies_recommendations": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what the new querylog payload looks like;
{
"sql": "",
"sql_anonymized": "",
"start_timestamp": 1758491216,
"end_timestamp": 1758494816,
"stats": {
"final": false,
"cache_hit": 0,
"max_threads": 0,
"clickhouse_table": "na",
"query_id": "dc409fabfddd452f8c842511a24dcc62",
"is_duplicate": 0,
"consistent": false,
"can_run": true,
"is_throttled": false,
"strategy": "MetricsStrategy",
"source_request_id": "fe40d336-153b-4257-8a3e-b9b5fec9f854",
"extra_info": {
"sampling_in_storage_estimation_time_overhead": {
"type": "timing",
"value": 19,
"tags": null
},
"time_budget": 8000,
"sampling_in_storage_routing_success": {
"type": "increment",
"value": 1,
"tags": {
"tier": "TIER_8"
}
},
"sampling_in_storage_my_metric": {
"type": "increment",
"value": 1,
"tags": {
"a": "b",
"c": "d"
}
},
"sampling_in_storage_query_timing": {
"type": "timing",
"value": 1.0,
"tags": {
"tier": "TIER_8"
}
}
},
"clickhouse_settings": {},
"result_info": {
"meta": {},
"profile": {
"bytes": 420,
"elapsed": 1.0
},
"stats": {
"stat": 1
},
"sql": "SELECT * FROM your_mom"
},
"routed_tier": "TIER_8",
"allocation_policies_recommendations": {
"ConcurrentRateLimitAllocationPolicy": {
"can_run": true,
"max_threads": 10,
"explanation": {},
"is_throttled": false,
"throttle_threshold": 22,
"rejection_threshold": 22,
"quota_used": 1,
"quota_unit": "concurrent_queries",
"suggestion": "no_suggestion",
"max_bytes_to_read": 0
},
"ReferrerGuardRailPolicy": {
"can_run": true,
"max_threads": 10,
"explanation": {},
"is_throttled": false,
"throttle_threshold": 1000000000000,
"rejection_threshold": 1000000000000,
"quota_used": 0,
"quota_unit": "no_units",
"suggestion": "no_suggestion",
"max_bytes_to_read": 0
},
"BytesScannedRejectingPolicy": {
"can_run": true,
"max_threads": 10,
"explanation": {},
"is_throttled": false,
"throttle_threshold": 1000000000000,
"rejection_threshold": 1000000000000,
"quota_used": 0,
"quota_unit": "no_units",
"suggestion": "no_suggestion",
"max_bytes_to_read": 0
}
}
},
"status": "0",
"trace_id": "",
"profile": {
"time_range": null,
"table": "eap_items",
"all_columns": [],
"multi_level_condition": false,
"where_profile": {
"columns": [],
"mapping_cols": []
},
"array_join_cols": [],
"groupby_cols": []
},
"result_profile": {
"bytes": 420,
"progress_bytes": 0,
"elapsed": 1.0
},
"request_status": "na",
"slo": "na"
}
snuba/util.py
Outdated
| def get_max_bytes_to_read(quota_allowances: dict[str, QuotaAllowance]) -> int: | ||
| """ | ||
| alternative implementation: | ||
| non_zero_max_bytes = [qa.max_bytes_to_read for qa in quota_allowances.values() if qa.max_bytes_to_read > 0] | ||
| return min(non_zero_max_bytes) if non_zero_max_bytes else 0 | ||
| """ | ||
| max_bytes_to_read = min( | ||
| [qa.max_bytes_to_read for qa in quota_allowances.values()], | ||
| key=lambda mb: float("inf") if mb == 0 else mb, | ||
| ) | ||
| if max_bytes_to_read != 0: | ||
| return max_bytes_to_read | ||
| return 0 | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too high up in the folder structure. It's specific to allocation policies
snuba/web/rpc/__init__.py
Outdated
| "request_rate_limited", | ||
| tags=self._timer.tags, | ||
| ) | ||
| # thoughts on making AlocationPolicyViolations a subclass of QueryException? QueryOrError expects QueryException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure go for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decided against it because it's suppsoed to be chained together not inheritance relationship: https://github.com/getsentry/snuba/blob/master/snuba/web/__init__.py#L20-L22
| ) | ||
|
|
||
| def _get_routing_decision(self, routing_context: RoutingContext) -> RoutingDecision: | ||
| def _get_combined_allocation_policies_recommendations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other routing strategies will use this. I would put this into a utility function in the allocation policies. All this is doing is combining the results of many policies which should fall into the responsibility of the allocation policy subsystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in person, going to put this in BaseRoutingStrategy and other strats can override it. It's up to the routing strategy to decide how to combine recommendations, whether its voting or weighted voting or all or any or whatever idk
5aaa5c4 to
6731d49
Compare
snuba/web/rpc/__init__.py
Outdated
| "experiments": {}, | ||
| }, | ||
| ) | ||
| error.__cause__ = AllocationPolicyViolations.from_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the QueryException "chaining"
| recommendations[allocation_policy_name] = allocation_policy.get_quota_allowance( | ||
| { | ||
| "organization_id": request_meta.organization_id, | ||
| "referrer": request_meta.referrer, | ||
| }, | ||
| routing_context.query_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: The call to get_quota_allowance is missing the required project_id, causing two critical allocation policies to systematically fail for all EAP queries.
-
Description: The
get_quota_allowancemethod is called with a dictionary containing onlyorganization_idandreferrer. However, two of the three allocation policies,ConcurrentRateLimitAllocationPolicyandBytesScannedRejectingPolicy, require aproject_id. This omission causes them to raise anInvalidTenantsForAllocationPolicyexception. While this exception is handled and results in acan_run=Falseresponse instead of a crash, it constitutes a significant functional bug. It means that these critical resource management policies are not being correctly applied to EAP queries, leading to all such queries being rejected due to missing tenant information rather than actual resource constraints. -
Suggested fix: The
project_idneeds to be extracted fromrequest_meta.project_idsand included in the dictionary passed toallocation_policy.get_quota_allowance. Sinceproject_idsis a list, logic should be added to select the appropriateproject_idfrom the list, likely the first one if the list is not empty.
severity: 0.7, confidence: 0.95
Did we get this right? 👍 / 👎 to inform future reviews.
| def get_max_bytes_to_read(quota_allowances: dict[str, QuotaAllowance]) -> int: | ||
| """ | ||
| alternative implementation: | ||
| non_zero_max_bytes = [qa.max_bytes_to_read for qa in quota_allowances.values() if qa.max_bytes_to_read > 0] | ||
| return min(non_zero_max_bytes) if non_zero_max_bytes else 0 | ||
| """ | ||
| max_bytes_to_read = min( | ||
| [qa.max_bytes_to_read for qa in quota_allowances.values()], | ||
| key=lambda mb: float("inf") if mb == 0 else mb, | ||
| ) | ||
| if max_bytes_to_read != 0: | ||
| return max_bytes_to_read | ||
| return 0 | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this file already has a bunch of stuff in it. I would put this function it its own file
snuba/web/db_query.py
Outdated
| if attribution_info.app_id.key == "eap": | ||
| return | ||
|
|
||
| quota_allowances: dict[str, QuotaAllowance] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very hacky way to do this. You're depending on some metadata injected way up somewhere in the pipeline to be set to a specific value.
The much more reasonable thing would be to return early if allocation_policies is empty
| alternative implementation: | ||
| non_zero_max_bytes = [qa.max_bytes_to_read for qa in quota_allowances.values() if qa.max_bytes_to_read > 0] | ||
| return min(non_zero_max_bytes) if non_zero_max_bytes else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the point of this docstring?
snuba/web/rpc/__init__.py
Outdated
| if self.routing_decision.routing_context.query_result is not None or isinstance( | ||
| error, QueryException | ||
| ): | ||
| self.routing_decision.strategy.update_allocation_policies_balances(self.routing_decision, error) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite smell right to me. The routing strategy should contain this functionality within itself. it could be that output_metrics is not a good name for the method being called above but overall, there's something a routing strategy is supposed to do before the call (decide the routing tier/time window/ etc) and after the call (output metrics, update quota balances, etc). This is relying on the caller to know something three layers deep. That is a leaky abstraction
| ) as span: | ||
| request_meta = extract_message_meta(routing_context.in_msg) | ||
| recommendations[allocation_policy_name] = allocation_policy.get_quota_allowance( | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's only one project id we use the project id
48bf012 to
dc0294c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good progress, a few usability questions and suggestions
snuba/web/rpc/__init__.py
Outdated
| elif not ( | ||
| isinstance(error, QueryException) | ||
| and error.exception_type == AllocationPolicyViolations.__name__ | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is kind of ugly, is there no other way to do it?
snuba/web/rpc/__init__.py
Outdated
| if routing_strategy_after_execute_error is not None: | ||
| self.metrics.increment("after_execute_failure") | ||
| sentry_sdk.capture_message( | ||
| f"Error in routing strategy output metrics: {output_metrics_error}" | ||
| f"Error in routing strategy after execute: {routing_strategy_after_execute_error}" | ||
| ) | ||
| if settings.RAISE_ON_ROUTING_STRATEGY_FAILURES: | ||
| raise output_metrics_error | ||
| raise routing_strategy_after_execute_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this exception handling should probably be taken care of in the routing strategies
| combined_allocation_policies_recommendations = ( | ||
| self._get_combined_allocation_policies_recommendations(routing_context) | ||
| ) | ||
|
|
||
| routing_decision = RoutingDecision( | ||
| routing_context=routing_context, | ||
| strategy=self, | ||
| tier=Tier.TIER_1, | ||
| clickhouse_settings={}, | ||
| can_run=True, | ||
| clickhouse_settings=combined_allocation_policies_recommendations["settings"], | ||
| can_run=combined_allocation_policies_recommendations["can_run"], | ||
| is_throttled=combined_allocation_policies_recommendations["is_throttled"], | ||
| ) | ||
|
|
||
| if not routing_decision.can_run: | ||
| return routing_decision | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have more than one routing strategy in use right now. Can we provide sane default behavior so that this doesn't have to be copy-pasted into every new routing strategy we use?
| with mock.patch.object( | ||
| BaseRoutingStrategy, | ||
| "get_allocation_policies", | ||
| return_value=[ | ||
| ThrottleAllocationPolicy( | ||
| ResourceIdentifier(StorageKey("doesntmatter")), ["a", "b", "c"], {} | ||
| ), | ||
| ThrottleAllocationPolicyDuplicate( | ||
| ResourceIdentifier(StorageKey("doesntmatter")), ["a", "b", "c"], {} | ||
| ), | ||
| ], | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why mock when you can extend and override?
789a200 to
c7cfcda
Compare
snuba/web/rpc/__init__.py
Outdated
| tenant_ids={ | ||
| "organization_id": request_meta.organization_id, | ||
| "referrer": request_meta.referrer, | ||
| **( | ||
| {"project_id": request_meta.project_ids[0]} | ||
| if hasattr(request_meta, "project_ids") and len(request_meta.project_ids) == 1 | ||
| else {} | ||
| ), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like something that should be done in the routing strategy, not the rpc endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not set an empty tenant_ids in rpc endpoint and then populate it in routing strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, routing strategy selector should be using the fields set in tenant_ids (ie organization_id & project_id), and routing strategy selector happens before routing strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need tenant_ids at all in the routing context. This is information can entirely be calculated from the request_meta object. A cleaner solution would be to define a tenant_ids property:
class RoutingContext:
@property
def tenant_ids(self):
return {
"organization_id": request_meta.organization_id,
"referrer": request_meta.referrer,
**(
{"project_id": request_meta.project_ids[0]}
if hasattr(request_meta, "project_ids") and len(request_meta.project_ids) == 1
else {}
)
}
snuba/web/rpc/__init__.py
Outdated
| "request_rate_limited", | ||
| tags=self._timer.tags, | ||
| ) | ||
| error = QueryException.from_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason why you have to wrap it in a QueryException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A combination of:
- this is an existing pattern: https://github.com/getsentry/snuba/blob/master/snuba/web/db_query.py#L682-L691.
- I want this to be caught by: https://github.com/getsentry/snuba/blob/master/snuba/web/rpc/__init__.py#L424-L425
- Making AllocationPolicyViolations inherit from QueryException goes against the initial design: ref(cbrs): move EAP allocation policies into the routing strategy layer #7411 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment is not blocking for approval but here are my thoughts in reply:
- I think this pattern has always been fairly dysfunctional and we're just perpetuating it. The only thing that really expects a
QueryExceptionis the AllocationPolicy:
https://github.com/getsentry/snuba/blob/master/snuba/query/allocation_policies/__init__.py#L45
but this could easily be changed to allow any kind of errors, especially AllocationPolicyViolation(s)
-
You could make a
RPCAllocationPolicyException(RPCException)class and raise it here and it would be caught -
You don't need to inherit it from
QueryExceptionit can just be its own thing.QueryExceptiondenotes a problem with clickhouse query execution and we're now muddying its meaning to also mean the RPC request failed which it could be two differeint things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're not going to make the updates above, at the very least I would ask that this ugly chaining thing be abstracted into its own component rather than taking up a whole bunch of space in the RPC method. This is a core component and its functionality should be easy to understand at a glance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heres a place that also expects a QueryException: https://github.com/getsentry/snuba/blob/master/snuba/web/views.py#L468-L471
I think for now I'll just abstract it into its own component and make a follow up pr
snuba/web/rpc/storage_routing/routing_strategies/storage_routing.py
Outdated
Show resolved
Hide resolved
snuba/web/rpc/storage_routing/routing_strategies/storage_routing.py
Outdated
Show resolved
Hide resolved
snuba/web/rpc/storage_routing/routing_strategies/storage_routing.py
Outdated
Show resolved
Hide resolved
ad16694 to
0282a73
Compare
20288e8 to
e2015f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with recommended changes so you're not blocked while I'm on PTO. Please give consideration to my suggestions though and make sure you have a rollout plan

https://linear.app/getsentry/issue/EAP-79/move-eap-allocation-policies-into-the-outcomes-routing-strategy