Skip to content

Conversation

@felix-hilden
Copy link

@felix-hilden felix-hilden commented Sep 7, 2025

Description

I'm attempting to fix exemplar support for WSGI and ASGI applications by moving the metric recording to happen within the active span. This will eliminate the need for similar custom middleware metrics that do emit exemplars out of the box already.

Fixes #3031

Type of change

I'm not sure whether this is an omission, or something that was deliberately not developed yet. It has not been explicitly advertised, but I would certainly expect it to behave like this.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test A

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 7, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@felix-hilden felix-hilden marked this pull request as draft September 7, 2025 07:50
@felix-hilden felix-hilden force-pushed the 3031-asgi-wsgi-exemplars branch from 26d8015 to dd03138 Compare September 7, 2025 07:52
@felix-hilden felix-hilden changed the title WIP: try wsgi exemplar span and token move WIP: Support exemplars in WSGI and ASGI instrumentation Sep 7, 2025
@felix-hilden felix-hilden changed the title WIP: Support exemplars in WSGI and ASGI instrumentation WIP: Support exemplars in WSGI and ASGI histogram Sep 7, 2025
@Chromatius
Copy link

I recently came across this issue in the FastAPI instrumentation and tracked it down to the underlying ASGI instrumentation, though it seems to be present in WSGI and Tornado as well as mentioned in the issue. Just thought I would chip in here a bit.

I see two possible solutions:

Option A: defer closing the span until metrics are recorded

This would be what you have started on for WSGI and seems like the more intuitive and clean method. The span end time will become a bit longer with this method but it should be negligible.

Option B: re-create span context after closing the span and pass it to metrics

finally:
    ...
    # Create span context for exemplars
    span_ctx = set_span_in_context(span)
    
    if self.duration_histogram_old:
        duration_attrs_old = _parse_duration_attrs(
            req_attrs, _StabilityMode.DEFAULT, context=span_ctx,
        )
   ...

This is a more minimal and local patch but less clean.

I'd be happy to chip in if you need any assistance with the PR.

@felix-hilden
Copy link
Author

Thanks for the help! I properly linked the issue now if there's more discussion here. Like I said in the ticket, I'm not sure if the span context is the issue (unless I have misunderstood something), but let's continue discussing there 🙏 I'd love another set of eyes.

@Chromatius
Copy link

I made a PR felix-hilden#1 with my proposed change into your branch including some tests as well. The diff is a bit large since it's based on a more recent version of the upstream repo.

@felix-hilden felix-hilden force-pushed the 3031-asgi-wsgi-exemplars branch from dd03138 to dba7521 Compare September 13, 2025 06:17
@tammy-baylis-swi
Copy link
Contributor

Hi @felix-hilden and @Chromatius , thanks for creating this! Are either of you still working on this to go from WIP to Ready?

I faced a similar issue with a FastAPI app, in particular the missing exemplars for http.server.request.duration. When I applied the changes in this PR to AsgiInstrumentor in my local setup, it fixed things!

otel-collector debug for reference ``` Metric #1 Descriptor: -> Name: http.server.request.duration -> Description: Duration of HTTP server requests. -> Unit: s -> DataType: ExponentialHistogram -> AggregationTemporality: Delta ExponentialHistogramDataPoints #0 Data point attributes: -> url.scheme: Str(http) -> network.protocol.version: Str(1.1) -> http.request.method: Str(GET) -> http.route: Str(/latency/) -> http.response.status_code: Int(200) StartTimestamp: 2025-11-07 23:12:16.852295758 +0000 UTC Timestamp: 2025-11-07 23:12:26.860803262 +0000 UTC Count: 5 Sum: 1.119976 Min: 0.152042 Max: 0.302494 Bucket [-1.010889, -1.000000), Count: 0 Bucket (0.151906, 0.153560], Count: 1 Bucket (0.153560, 0.155232], Count: 0 Bucket (0.155232, 0.156923], Count: 0 ... Exemplars: Exemplar #0 -> Trace ID: 87672a88b4077d5d8a93e7575a66f456 -> Span ID: 2570bb41a5115383 -> Timestamp: 2025-11-07 23:12:16.909593758 +0000 UTC -> Value: 0.270580 Exemplar #1 -> Trace ID: c7410a16f8bfeafa98555181db29f29b -> Span ID: 5f597c83aaae16c5 -> Timestamp: 2025-11-07 23:12:19.284033092 +0000 UTC -> Value: 0.302494 Exemplar #2 -> Trace ID: 0904ec56214e0bf09df9e127aaa8e9fe -> Span ID: da49599715c86831 -> Timestamp: 2025-11-07 23:12:21.503676343 +0000 UTC -> Value: 0.152042 ```

@felix-hilden felix-hilden marked this pull request as ready for review November 22, 2025 09:14
@felix-hilden felix-hilden changed the title WIP: Support exemplars in WSGI and ASGI histogram Support exemplars in WSGI and ASGI histogram Nov 22, 2025
@felix-hilden felix-hilden requested a review from a team as a code owner November 22, 2025 09:32
@tammy-baylis-swi
Copy link
Contributor

Awesome! Please could you check the failing unit tests in the CI/CD. Please also do a tox -e ruff on your local and commit the lint/format fixes.

Comment on lines 747 to 748
def _iterate_and_close_with_span(
iterable: Iterable[T], span: trace.Span
Copy link
Contributor

@tammy-baylis-swi tammy-baylis-swi Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some more test failures that seem to do with missing span attributes. It might be that the span is ending too early.

What about this?:

  1. Change _iterate_and_close_with_span back to _end_span_after_iterating, with the span.end and context.detach call.
  2. Keep the span.end and context.detach in the Except block above. Essentially keep the span ending as before.
  3. But keep everything to do with set_span_in_context because that is the main fix, similar to Fix FlaskInstrumentor exemplars generation for http.server.(request.)duration #3912

EDIT: Also please pull! I resolved some Changelog merge conflicts before triggering the last CI/CD run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do, I thought I'd keep the changelog conflict as last if this takes a while still, but let's try not to 😄

Thank you for the pointers too, I'll see today about doing just that now that the test failures are clear 👍 Or if not, proposing something different.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright sorry if I'm being a dummy but I want to make sure I understood what your suggestion meant: the initial go I had to try to extend the span was incorrect. Not only unhelpful but also inappropriate, because we can reconstruct the context after the fact and pass it explicitly into the metrics. This I've now reverted!

I confirmed that this fixed the wsgi tests, but for some reason I wasn't able to reproduce the asgi problem in 3.12 or pypy locally. Curious if you have any ideas or experience with how flaky these tests are. But let's try to run once more.

Thanks again for all the help!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being a dummy! 😃 Yes, your last change is what I meant 👍 👍

New challenge -- thank you for your patience. The Maintainers just released version 1.39.0/0.60b0 of OTel Python, so more tests are failing now but only because of the dependency conflicts. I won't merge main into your branch myself this time to avoid complicating things, so could you please do that and try again. It looks like that's why so many failed with my trigger of CI/CD just now for yesterday's 95abee7. Re. flaky tests at other times, the odd one that fails citing missing SHA etc just happens sometimes and it's something we're aware of,

One more request: in addition to tox -e ruff, please also run tox -e lint-instrumentation-asgi and tox -e lint-instrumentation-wsgi`. They're all supposed to help with lint/format, but it's not a perfect 1:1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now 👍

Gotta say, I tried to run things locally, but the specific lint runs still ended with the same issue. Test tox envs ran fine at first but after clearing the ".tox" dir, the error popped up.

I've now rebased the branch on the latest main, cleaned up the commit history to make sure there are no weird changes at any point, and force pushed the branch. This had no effect locally, but maybe I'm missing something :/ Could you approve the run once again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sorry I think it'll be a while.. combining all of "hard for me to reproduce", "very specific", "seemingly unrelated" and "deeply technical project I'm new to" is a bit much 😅 I'll try to set things up on a different OS next and see if it makes a difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good point, it's also hard when you can't trigger the CI/CD. I'll take a look at this and get back to you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I triggered re-run of the ASGI test 3 times, and every time it was the same failure at existing test_user_agent_synthetic_test_detection for instrumentation-asgi 3.12 Ubuntu. I think it's two real, existing issues with the tests that only became apparent after adding more coverage. (1) synthetic_test_detection was getting away with not properly resetting test state. (2) PyPy needs more time tolerance for the basic_metric_success assertions.

I posted fixes in 242777b and c106c3c on my own fork. I don't seem to have permissions to make a PR from my fork into yours. Could you try cherry-picking or porting the changes into your branch? Or let me know if another way is better!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Okay nice, let me also trace through the changes:

  • 2427: setup is needed for subtests and that's where the base clears the memory exporter too - this I get but two questions: is there merit in calling that in the "subTest" setup? Or can we not assume everyone setting up subtests wants this? Then secondly, there's a third test that does similar things called "test_user_agent_synthetic_bot_detection". Should that be changed too?
  • c106: durations are counted as milliseconds when passed into those asserts, and because we introduced some processing this started suddenly failing? This I'm not so sure about. I guess we use a delta in both directions not because we expect it to be larger but just for simplicity?

That's weird, but I'll incorporate them here, many thanks again 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"test_user_agent_synthetic_bot_detection". Should that be changed too?

Good point. I think everyone doing "subTest" should get this change.

because we introduced some processing this started suddenly failing

Yes you're right! Response time is going to take a bit longer now because of the added manual context passing. The test change suggests it's going to add up to another 25 ms. To me this is small enough, but wdyt?

@felix-hilden felix-hilden force-pushed the 3031-asgi-wsgi-exemplars branch from e76548b to dac1b53 Compare December 4, 2025 17:20
@felix-hilden felix-hilden force-pushed the 3031-asgi-wsgi-exemplars branch from dac1b53 to ab3d715 Compare December 4, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix exemplar support

3 participants