-
Notifications
You must be signed in to change notification settings - Fork 116
[DENG-9732] Filter out remotecontent events from event etl #8140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
AND 87 | ||
) | ||
AND | ||
) IS FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Using IS FALSE
means that the event will be excluded if one or more of the fields being checked is null (including event extras being checked not being present) while the rest of the non-null field conditions match.
If that isn't the intended behavior then I'd recommend using IS NOT TRUE
instead (same goes for the other usages of IS FALSE
).
TIMESTAMP_ADD( | ||
TIMESTAMP_TRUNC(submission_timestamp, HOUR), | ||
-- Aggregates event counts over 60-minute intervals | ||
INTERVAL(DIV(EXTRACT(MINUTE FROM submission_timestamp), 60) * 60) MINUTE | ||
) AS window_start, | ||
TIMESTAMP_ADD( | ||
TIMESTAMP_TRUNC(submission_timestamp, HOUR), | ||
INTERVAL((DIV(EXTRACT(MINUTE FROM submission_timestamp), 60) + 1) * 60) MINUTE | ||
) AS window_end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): I see this was existing code just being moved into this CTE, but this logic doesn't make sense to me since the extracted minute will always be less than 60, in which case DIV(EXTRACT(MINUTE FROM submission_timestamp), 60)
would always evaluate to zero, and this could be simplified:
TIMESTAMP_ADD( | |
TIMESTAMP_TRUNC(submission_timestamp, HOUR), | |
-- Aggregates event counts over 60-minute intervals | |
INTERVAL(DIV(EXTRACT(MINUTE FROM submission_timestamp), 60) * 60) MINUTE | |
) AS window_start, | |
TIMESTAMP_ADD( | |
TIMESTAMP_TRUNC(submission_timestamp, HOUR), | |
INTERVAL((DIV(EXTRACT(MINUTE FROM submission_timestamp), 60) + 1) * 60) MINUTE | |
) AS window_end, | |
TIMESTAMP_TRUNC(submission_timestamp, HOUR) AS window_start, | |
TIMESTAMP_ADD( | |
TIMESTAMP_TRUNC(submission_timestamp, HOUR), | |
INTERVAL 1 HOUR | |
) AS window_end, |
Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's equivalent. I'll change it
event_name, | ||
event_extra_key, | ||
country, | ||
normalized_app_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (blocking): I presume trying to group by normalized_app_name
would fail given you removed that column from this CTE above.
normalized_app_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query works because normalized_app_name
is in the source table but yes this should be removed
experiment, | ||
experiment_branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Indentation.
experiment, | |
experiment_branch | |
experiment, | |
experiment_branch |
"{{ dataset['canonical_app_name'] }}" AS normalized_app_name, | ||
* REPLACE (SUM(total_events) AS total_events), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: This is changing the order of the columns being returned, which seems like something we should avoid if we can.
"{{ dataset['canonical_app_name'] }}" AS normalized_app_name, | |
* REPLACE (SUM(total_events) AS total_events), | |
submission_date, | |
window_start, | |
window_end, | |
event_category, | |
event_name, | |
event_extra_key, | |
country, | |
"{{ dataset['canonical_app_name'] }}" AS normalized_app_name, | |
channel, | |
version, | |
experiment, | |
experiment_branch, | |
SUM(total_events) AS total_events, |
AND ( | ||
event.category = "uptake.remotecontent.result" | ||
AND event.name IN ("uptake_remotesettings", "uptake_normandy") | ||
AND mozfun.norm.extract_version(client_info.app_display_version, 'major') >= 143 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Here we have an app_version_major
column readily available.
AND mozfun.norm.extract_version(client_info.app_display_version, 'major') >= 143 | |
AND app_version_major >= 143 |
8b81bbf
to
a8a3b0a
Compare
Integration report for "Clean up + only filter release channel"
|
Description
This filters the high volume
uptake.remotecontent.result
events from legacy and glean event tables. This might fix theevent_monitoring_aggregates_v1
failures but I'm still waiting for the queries to finish.I also rearranged the
event_monitoring_aggregates_v1
query a bit to aggregate events per ping before aggregating per app since I found that it uses slightly less slot time and shuffleRelated Tickets & Documents
Reviewer, please follow this checklist