-
Notifications
You must be signed in to change notification settings - Fork 822
Support exemplars in WSGI and ASGI histogram #3739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support exemplars in WSGI and ASGI histogram #3739
Conversation
26d8015 to
dd03138
Compare
|
I recently came across this issue in the FastAPI instrumentation and tracked it down to the underlying ASGI instrumentation, though it seems to be present in WSGI and Tornado as well as mentioned in the issue. Just thought I would chip in here a bit. I see two possible solutions: Option A: defer closing the span until metrics are recordedThis would be what you have started on for WSGI and seems like the more intuitive and clean method. The span end time will become a bit longer with this method but it should be negligible. Option B: re-create span context after closing the span and pass it to metricsfinally:
...
# Create span context for exemplars
span_ctx = set_span_in_context(span)
if self.duration_histogram_old:
duration_attrs_old = _parse_duration_attrs(
req_attrs, _StabilityMode.DEFAULT, context=span_ctx,
)
...This is a more minimal and local patch but less clean. I'd be happy to chip in if you need any assistance with the PR. |
|
Thanks for the help! I properly linked the issue now if there's more discussion here. Like I said in the ticket, I'm not sure if the span context is the issue (unless I have misunderstood something), but let's continue discussing there 🙏 I'd love another set of eyes. |
|
I made a PR felix-hilden#1 with my proposed change into your branch including some tests as well. The diff is a bit large since it's based on a more recent version of the upstream repo. |
dd03138 to
dba7521
Compare
|
Hi @felix-hilden and @Chromatius , thanks for creating this! Are either of you still working on this to go from WIP to Ready? I faced a similar issue with a FastAPI app, in particular the missing exemplars for otel-collector debug for reference``` Metric #1 Descriptor: -> Name: http.server.request.duration -> Description: Duration of HTTP server requests. -> Unit: s -> DataType: ExponentialHistogram -> AggregationTemporality: Delta ExponentialHistogramDataPoints #0 Data point attributes: -> url.scheme: Str(http) -> network.protocol.version: Str(1.1) -> http.request.method: Str(GET) -> http.route: Str(/latency/) -> http.response.status_code: Int(200) StartTimestamp: 2025-11-07 23:12:16.852295758 +0000 UTC Timestamp: 2025-11-07 23:12:26.860803262 +0000 UTC Count: 5 Sum: 1.119976 Min: 0.152042 Max: 0.302494 Bucket [-1.010889, -1.000000), Count: 0 Bucket (0.151906, 0.153560], Count: 1 Bucket (0.153560, 0.155232], Count: 0 Bucket (0.155232, 0.156923], Count: 0 ... Exemplars: Exemplar #0 -> Trace ID: 87672a88b4077d5d8a93e7575a66f456 -> Span ID: 2570bb41a5115383 -> Timestamp: 2025-11-07 23:12:16.909593758 +0000 UTC -> Value: 0.270580 Exemplar #1 -> Trace ID: c7410a16f8bfeafa98555181db29f29b -> Span ID: 5f597c83aaae16c5 -> Timestamp: 2025-11-07 23:12:19.284033092 +0000 UTC -> Value: 0.302494 Exemplar #2 -> Trace ID: 0904ec56214e0bf09df9e127aaa8e9fe -> Span ID: da49599715c86831 -> Timestamp: 2025-11-07 23:12:21.503676343 +0000 UTC -> Value: 0.152042 ``` |
|
Awesome! Please could you check the failing unit tests in the CI/CD. Please also do a |
| def _iterate_and_close_with_span( | ||
| iterable: Iterable[T], span: trace.Span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some more test failures that seem to do with missing span attributes. It might be that the span is ending too early.
What about this?:
- Change
_iterate_and_close_with_spanback to_end_span_after_iterating, with thespan.endandcontext.detachcall. - Keep the
span.endandcontext.detachin the Except block above. Essentially keep the span ending as before. - But keep everything to do with
set_span_in_contextbecause that is the main fix, similar to Fix FlaskInstrumentor exemplars generation for http.server.(request.)duration #3912
EDIT: Also please pull! I resolved some Changelog merge conflicts before triggering the last CI/CD run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will do, I thought I'd keep the changelog conflict as last if this takes a while still, but let's try not to 😄
Thank you for the pointers too, I'll see today about doing just that now that the test failures are clear 👍 Or if not, proposing something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright sorry if I'm being a dummy but I want to make sure I understood what your suggestion meant: the initial go I had to try to extend the span was incorrect. Not only unhelpful but also inappropriate, because we can reconstruct the context after the fact and pass it explicitly into the metrics. This I've now reverted!
I confirmed that this fixed the wsgi tests, but for some reason I wasn't able to reproduce the asgi problem in 3.12 or pypy locally. Curious if you have any ideas or experience with how flaky these tests are. But let's try to run once more.
Thanks again for all the help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not being a dummy! 😃 Yes, your last change is what I meant 👍 👍
New challenge -- thank you for your patience. The Maintainers just released version 1.39.0/0.60b0 of OTel Python, so more tests are failing now but only because of the dependency conflicts. I won't merge main into your branch myself this time to avoid complicating things, so could you please do that and try again. It looks like that's why so many failed with my trigger of CI/CD just now for yesterday's 95abee7. Re. flaky tests at other times, the odd one that fails citing missing SHA etc just happens sometimes and it's something we're aware of,
One more request: in addition to tox -e ruff, please also run tox -e lint-instrumentation-asgi and tox -e lint-instrumentation-wsgi`. They're all supposed to help with lint/format, but it's not a perfect 1:1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done now 👍
Gotta say, I tried to run things locally, but the specific lint runs still ended with the same issue. Test tox envs ran fine at first but after clearing the ".tox" dir, the error popped up.
I've now rebased the branch on the latest main, cleaned up the commit history to make sure there are no weird changes at any point, and force pushed the branch. This had no effect locally, but maybe I'm missing something :/ Could you approve the run once again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sorry I think it'll be a while.. combining all of "hard for me to reproduce", "very specific", "seemingly unrelated" and "deeply technical project I'm new to" is a bit much 😅 I'll try to set things up on a different OS next and see if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good point, it's also hard when you can't trigger the CI/CD. I'll take a look at this and get back to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I triggered re-run of the ASGI test 3 times, and every time it was the same failure at existing test_user_agent_synthetic_test_detection for instrumentation-asgi 3.12 Ubuntu. I think it's two real, existing issues with the tests that only became apparent after adding more coverage. (1) synthetic_test_detection was getting away with not properly resetting test state. (2) PyPy needs more time tolerance for the basic_metric_success assertions.
I posted fixes in 242777b and c106c3c on my own fork. I don't seem to have permissions to make a PR from my fork into yours. Could you try cherry-picking or porting the changes into your branch? Or let me know if another way is better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Okay nice, let me also trace through the changes:
- 2427: setup is needed for subtests and that's where the base clears the memory exporter too - this I get but two questions: is there merit in calling that in the "subTest" setup? Or can we not assume everyone setting up subtests wants this? Then secondly, there's a third test that does similar things called "test_user_agent_synthetic_bot_detection". Should that be changed too?
- c106: durations are counted as milliseconds when passed into those asserts, and because we introduced some processing this started suddenly failing? This I'm not so sure about. I guess we use a delta in both directions not because we expect it to be larger but just for simplicity?
That's weird, but I'll incorporate them here, many thanks again 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"test_user_agent_synthetic_bot_detection". Should that be changed too?
Good point. I think everyone doing "subTest" should get this change.
because we introduced some processing this started suddenly failing
Yes you're right! Response time is going to take a bit longer now because of the added manual context passing. The test change suggests it's going to add up to another 25 ms. To me this is small enough, but wdyt?
e76548b to
dac1b53
Compare
dac1b53 to
ab3d715
Compare
Description
I'm attempting to fix exemplar support for WSGI and ASGI applications by moving the metric recording to happen within the active span. This will eliminate the need for similar custom middleware metrics that do emit exemplars out of the box already.
Fixes #3031
Type of change
I'm not sure whether this is an omission, or something that was deliberately not developed yet. It has not been explicitly advertised, but I would certainly expect it to behave like this.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Does This PR Require a Core Repo Change?
Checklist:
See contributing.md for styleguide, changelog guidelines, and more.