Add a process span to ConfluentKafka instrumentation #1937

g7ed6e · 2024-07-02T15:28:39Z

Resolves #1932

@CodeBlanch : I adapted a bit the API you suggested.
@lmolkova : Can you confirm that's OK to output both a receive and a process span for a single message at consumer side ?

codecov · 2024-07-02T15:32:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.26%. Comparing base (71655ce) to head (6867f71).
Report is 398 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1937      +/-   ##
==========================================
- Coverage   73.91%   71.26%   -2.66%     
==========================================
  Files         267      311      +44     
  Lines        9615    11347    +1732     
==========================================
+ Hits         7107     8086     +979     
- Misses       2508     3261     +753

Flag	Coverage Δ
unittests-Exporter.Geneva	`53.20% <ø> (?)`
unittests-Exporter.InfluxDB	`95.88% <ø> (?)`
unittests-Exporter.Instana	`71.24% <ø> (?)`
unittests-Exporter.OneCollector	`93.07% <ø> (?)`
unittests-Exporter.Stackdriver	`75.73% <ø> (?)`
unittests-Extensions	`88.57% <ø> (?)`
unittests-Extensions.AWS	`77.24% <ø> (?)`
unittests-Extensions.Enrichment	`100.00% <ø> (?)`
unittests-Instrumentation.AWS	`75.77% <ø> (?)`
unittests-Instrumentation.AWSLambda	`87.96% <ø> (?)`
unittests-Instrumentation.AspNet	`77.00% <ø> (?)`
unittests-Instrumentation.AspNetCore	`85.27% <ø> (?)`
unittests-Instrumentation.ElasticsearchClient	`79.87% <ø> (?)`
unittests-Instrumentation.EntityFrameworkCore	`55.49% <ø> (?)`
unittests-Instrumentation.EventCounters	`76.36% <ø> (?)`
unittests-Instrumentation.GrpcNetClient	`79.61% <ø> (?)`
unittests-Instrumentation.Hangfire	`93.58% <ø> (?)`
unittests-Instrumentation.Http	`82.05% <ø> (?)`
unittests-Instrumentation.Owin	`85.79% <ø> (?)`
unittests-Instrumentation.Process	`100.00% <ø> (?)`
unittests-Instrumentation.Quartz	`78.94% <ø> (?)`
unittests-Instrumentation.Runtime	`100.00% <ø> (?)`
unittests-Instrumentation.SqlClient	`91.89% <ø> (?)`
unittests-Instrumentation.StackExchangeRedis	`67.02% <ø> (?)`
unittests-Instrumentation.Wcf	`48.91% <ø> (?)`
unittests-PersistentStorage	`65.44% <ø> (?)`
unittests-Resources.AWS	`75.88% <ø> (?)`
unittests-Resources.Azure	`82.83% <ø> (?)`
unittests-Resources.Container	`72.41% <ø> (?)`
unittests-Resources.Gcp	`72.54% <ø> (?)`
unittests-Resources.Host	`73.94% <ø> (?)`
unittests-Resources.OperatingSystem	`71.87% <ø> (?)`
unittests-Resources.Process	`100.00% <ø> (?)`
unittests-Resources.ProcessRuntime	`94.11% <ø> (?)`
unittests-Sampler.AWS	`88.09% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 337 files with indirect coverage changes

github-actions · 2024-07-11T03:17:06Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2024-07-19T03:16:59Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

CodeBlanch · 2024-07-24T16:43:00Z

src/OpenTelemetry.Instrumentation.ConfluentKafka/OpenTelemetryConsumeResultExtensions.cs

+        {
+            await handler(consumeResult, cancellationToken).ConfigureAwait(false);
+        }
+        finally


If handler throws would we want to flag the Activity as error status? Also if user wants to set the status of the span would we expect them to do that through Activity.Current? Would it be cleaner if we passed in processActivity to the callback?

Is there any kind of duration metric defined for processing a message in the semantic conventions? Just wondering if we should tick a histogram here or anything as well.

we do need to set status to error and also report error.type attribute.

And yep, messaging semconv define metrics. We've revised them recently and it's best to add metrics later (once next semconv version is released or after we freeze conventions which should happen relatively soon).

Updated with your suggestions.

When the handler throws ths activity status is set to error, the attribute error.type is set valued with the exception type (i attempted here to limit the cardinality).

The handler delegate type is enriched with the activity we may have created.

Regarding metric we already implement metrics from 1.24.0 https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/messaging/messaging-metrics.md and those metrics will be reported.

In 1.24.0 we have [messaging.deliver.duration](https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/messaging/messaging-metrics.md#metric-messagingdeliverduration) that should cover the same duration as process span.

Still, it'll be renamed and I'd consider waiting for stable messaging semconv to implement it.

Then i will implement this new metric in a PR dedicated to 1.24.0 - 1.26.0 migration

CodeBlanch · 2024-07-24T16:43:34Z

src/OpenTelemetry.Instrumentation.ConfluentKafka/OpenTelemetryConsumeResultExtensions.cs

+        if (TryExtractPropagationContext(consumeResult, out var propagationContext))
+        {
+            TKey key = consumeResult.Message.Key;
+            object? keyAsObject = key;


Just curious why this object? conversion is being performed?

This cast is no longer needed as i added a generic parameter to the private StartProcessActivity method to which the key is passed.

CodeBlanch · 2024-07-24T16:44:39Z

src/OpenTelemetry.Instrumentation.ConfluentKafka/OpenTelemetryConsumeResultExtensions.cs

+using OpenTelemetry.Context.Propagation;
+using OpenTelemetry.Trace;
+
+namespace OpenTelemetry.Instrumentation.ConfluentKafka;


Should this go in the namespace for ConsumeResult<TKey, TValue>? That would make it more discoverable. But we may not want it to be easily discoverable 😄

Good idea. I moved the class OpenTelemetryConsumeResultExtensions to Confluent.Kafka namespace.

src/OpenTelemetry.Instrumentation.ConfluentKafka/InstrumentedProducer.cs

lmolkova · 2024-07-26T05:02:27Z

src/OpenTelemetry.Instrumentation.ConfluentKafka/OpenTelemetryConsumeResultExtensions.cs

+            object? keyAsObject = key;
+            processActivity = StartProcessActivity(
+                propagationContext,
+                start,


what's the reason to pass start time? We don't need to pretend that processing starts before consumption and can use real start timestamp.

Good point! updated

src/OpenTelemetry.Instrumentation.ConfluentKafka/OpenTelemetryConsumeResultExtensions.cs

lmolkova · 2024-07-26T05:06:45Z

src/OpenTelemetry.Instrumentation.ConfluentKafka/OpenTelemetryConsumeResultExtensions.cs

+            ? new[] { new ActivityLink(propagationContext.ActivityContext) }
+            : Array.Empty<ActivityLink>();
+
+        Activity? activity = ConfluentKafkaCommon.ActivitySource.StartActivity(spanName, kind: ActivityKind.Consumer, links: activityLinks, startTime: start, parentContext: default);


you may (but don't have to) also set a parent to the same context as one if the message (if it's a single message).
That's something most of the users seem to want, but some want the opposite.

lmolkova · 2024-07-26T05:24:54Z

can you confirm that's OK to output both a receive and a process span for a single message at consumer side ?

In the scope of this PR it looks right. If I understand correctly, ConsumeAndProcessMessageAsync method receives the message with consumer.Consume(...) (this is where receive operation comes from) and then executes user-provided handler which is traced with process span.

That's the same case as described here: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/kafka.md (but with a later version of semantic conventions, so there are some small changes to naming).

For API like this

await using ServiceBusProcessor processor = client.CreateProcessor(queueName, options);

processor.ProcessMessageAsync += MessageHandler;
processor.ProcessErrorAsync += ErrorHandler;

async Task MessageHandler(ProcessMessageEventArgs args)
{
    string body = args.Message.Body.ToString();
    await args.CompleteMessageAsync(args.Message);
}

where there is no user-facing API provided by the original SDK to receive the message, messages are pushed to user application and receive operation should not be reported.

For APIs that effectively pull messages and then process them, especially if such APIs are not part of the original client library, it's best to report both - receive and process.

src/OpenTelemetry.Instrumentation.ConfluentKafka/ConsumeAndProcessMessageHandler.cs

CodeBlanch

LGTM with a CHANGELOG entry

g7ed6e · 2024-08-20T07:20:55Z

LGTM with a CHANGELOG entry

@CodeBlanch done

…ss spans

Co-authored-by: Liudmila Molkova <[email protected]>

src/OpenTelemetry.Instrumentation.ConfluentKafka/CHANGELOG.md

Co-authored-by: Piotr Kiełkowicz <[email protected]>

g7ed6e requested a review from a team July 2, 2024 15:28

g7ed6e had a problem deploying to external July 2, 2024 15:28 — with GitHub Actions Failure

g7ed6e changed the title ~~Feature/confluent kafka part 2~~ Add a process span to ConfluentKafka instrumentation Jul 2, 2024

g7ed6e force-pushed the feature/confluent-kafka-part-2 branch from b587ce5 to 7ede802 Compare July 2, 2024 15:40

g7ed6e had a problem deploying to external July 2, 2024 15:40 — with GitHub Actions Failure

github-actions bot assigned g7ed6e Jul 2, 2024

github-actions bot added infra Infra work - CI/CD, code coverage, linters comp:instrumentation.confluentkafka Things related to OpenTelemetry.Instrumentation.ConfluentKafka labels Jul 2, 2024

g7ed6e force-pushed the feature/confluent-kafka-part-2 branch from 7ede802 to a312578 Compare July 3, 2024 07:16

g7ed6e had a problem deploying to external July 3, 2024 07:16 — with GitHub Actions Failure

g7ed6e force-pushed the feature/confluent-kafka-part-2 branch from a312578 to c479be6 Compare July 3, 2024 09:21

g7ed6e had a problem deploying to external July 3, 2024 09:21 — with GitHub Actions Failure

github-actions bot added the Stale label Jul 11, 2024

Kielek removed the Stale label Jul 11, 2024

github-actions bot added Stale and removed Stale labels Jul 19, 2024

g7ed6e force-pushed the feature/confluent-kafka-part-2 branch from c479be6 to 6d6e73e Compare July 24, 2024 11:40

g7ed6e had a problem deploying to external July 24, 2024 11:40 — with GitHub Actions Failure

CodeBlanch reviewed Jul 24, 2024

View reviewed changes

lmolkova reviewed Jul 26, 2024

View reviewed changes

g7ed6e had a problem deploying to external July 26, 2024 07:27 — with GitHub Actions Failure

g7ed6e had a problem deploying to external July 26, 2024 08:01 — with GitHub Actions Failure

g7ed6e force-pushed the feature/confluent-kafka-part-2 branch from ccdaf16 to d0a7ab9 Compare July 29, 2024 07:18

g7ed6e had a problem deploying to external July 29, 2024 07:18 — with GitHub Actions Failure

github-actions bot added Stale and removed Stale labels Aug 8, 2024

vishweshbankwar had a problem deploying to external August 15, 2024 23:51 — with GitHub Actions Failure

CodeBlanch reviewed Aug 19, 2024

View reviewed changes

src/OpenTelemetry.Instrumentation.ConfluentKafka/ConsumeAndProcessMessageHandler.cs Outdated Show resolved Hide resolved

CodeBlanch approved these changes Aug 19, 2024

View reviewed changes

vishweshbankwar had a problem deploying to external August 19, 2024 20:45 — with GitHub Actions Failure

g7ed6e had a problem deploying to external August 20, 2024 07:20 — with GitHub Actions Failure

g7ed6e and others added 11 commits August 22, 2024 18:16

Update Confluent.Kafka to 2.4.0

adec204

Code cleanup

df7b660

Add a ConsumeAndProcessMessageAsync extension method to produce proce…

6313659

…ss spans

Apply pr suggestions

6fc9c09

Apply pr suggestions

bd54b9b

Fix nullable warn in UT

8f08823

Update public api

becf82b

Value error.type with fully qualified exception type.

a38fa83

Co-authored-by: Liudmila Molkova <[email protected]>

Move ConsumeAndProcessMessageHandler to Confluent.Kafka namespace

de71186

Update CHANGELOG

1594d73

Fix PublicAPI declarations

7df435c

g7ed6e force-pushed the feature/confluent-kafka-part-2 branch from a8b727a to 7df435c Compare August 22, 2024 16:17

g7ed6e had a problem deploying to external August 22, 2024 16:17 — with GitHub Actions Failure

Merge branch 'main' into feature/confluent-kafka-part-2

7ade470

Kielek had a problem deploying to external August 27, 2024 06:55 — with GitHub Actions Failure

Kielek reviewed Aug 27, 2024

View reviewed changes

src/OpenTelemetry.Instrumentation.ConfluentKafka/CHANGELOG.md Outdated Show resolved Hide resolved

Update src/OpenTelemetry.Instrumentation.ConfluentKafka/CHANGELOG.md

619bc6f

Co-authored-by: Piotr Kiełkowicz <[email protected]>

g7ed6e had a problem deploying to external September 3, 2024 09:23 — with GitHub Actions Failure

Merge branch 'main' into feature/confluent-kafka-part-2

6867f71

Kielek had a problem deploying to external September 3, 2024 09:28 — with GitHub Actions Failure

Kielek merged commit 544eb98 into open-telemetry:main Sep 3, 2024
1 check failed

g7ed6e mentioned this pull request Nov 30, 2024

[feature request] Leave Kafka activities active after Consume is called #2351

Closed

Add a process span to ConfluentKafka instrumentation #1937

Add a process span to ConfluentKafka instrumentation #1937

Uh oh!

Conversation

g7ed6e commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jul 11, 2024

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

g7ed6e Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lmolkova Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmolkova commented Jul 26, 2024

Uh oh!

Uh oh!

CodeBlanch left a comment

Choose a reason for hiding this comment

Uh oh!

g7ed6e commented Aug 20, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

g7ed6e commented Jul 2, 2024 •

edited

Loading

codecov bot commented Jul 2, 2024 •

edited

Loading

g7ed6e Jul 26, 2024 •

edited

Loading

lmolkova Jul 26, 2024 •

edited

Loading