Skip to content

Conversation

@sigram
Copy link
Contributor

@sigram sigram commented Jan 19, 2026

This PR replaces Dropwizard JSON metrics with Prometheus metrics in the CrossDC Consumer, using directly the Prometheus client_java API. It also removes the Dropwizard dependency.

Copy link
Contributor

@mlbiscoc mlbiscoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post a sample of all these metrics? Either dump it here or in a txt file? It would be easier to review the names and labels on the metrics.

ow2-asm-tree = { module = "org.ow2.asm:asm-tree", version.ref = "ow2-asm" }
# @keep transitive dependency for version alignment
perfmark-api = { module = "io.perfmark:perfmark-api", version.ref = "perfmark" }
prometheus-metrics-core = { module = "io.prometheus:prometheus-metrics-core", version.ref = "prometheus-metrics" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do this with OTEL sdk/api which also offers prometheus right out the box. Any reason you just went with re-introducing this prometheus dependency? This is way better than the dropwizard -> prometheus bridge so not going to say this is a blocker. Just having alignment would be nice.


client.commit(COLLECTION);

System.out.println("Sent producer record");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.info instead or just don't print this. Doesn't seem like we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a copy-paste from other test in this suite, I'll clean up all of these.


outputBatchSizeHistogram =
Histogram.builder()
.name("consumer_output_batch_size_histogram")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop the histogram suffix. # TYPE will have the type that it is a histogram and it gets appended with bucket which is the only suffix this will need. Same with ones below.


outputTimeHistogram =
Histogram.builder()
.name("consumer_output_time_histogram")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these times? Milliseconds? Instead of histogram suffix, append the unit of time. Also I would confirm if the buckets of these histograms are useful for whatever the time unit is.

Counter.builder()
.name("consumer_input_total")
.help("Total number of input messages")
.labelNames("type", "subtype")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I question most of these metrics really need type label. What is the cardinality of it and possible different combinations? I see in the test UPDATE is one. Is there also QUERY or something along those lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's ADMIN and CONFIGSET.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm ok. I am not a fan of naming this label being called type. I think it should have some kind of context what it means as type and subtype can be very generic. Is it an operation or message_type maybe? Then what can subtype be? In core, I made it category but it is debateable if we should just remove that label/attribute all together from metrics. If you move type to something more specific then maybe you can just move off subtype to type. Again seeing an sample text output of these metrics would help if you can.

@@ -408,19 +408,19 @@ boolean pollAndProcessRequests() {
List<SolrInputDocument> docs = update.getDocuments();
if (docs != null) {
updateReqBatch.add(docs);
metrics.counter(MetricRegistry.name(type.name(), "add")).inc(docs.size());
metrics.incrementInputCounter(type.name(), "add");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the metric below are only incrementing by 1 instead of doc.size now. Is that right?

@@ -45,6 +45,11 @@ public boolean reject(Thread t) {
return true;
}

// Prometheus Scheduler doesn't provide any method to shut down its worker threads
if (t.isDaemon()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this filter every daemon thread? Does the thread have something like prometheus in its name?

Comment on lines +149 to +151
public void recordOutputFirstAttemptSize(MirroredSolrRequest.Type type, long firstAttemptTimeNs) {
outputFirstAttemptHistogram.labelValues(type.name()).observe(firstAttemptTimeNs);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "size" but it recording a time? But also this is just recording a first attempt time based on the name. Does that mean this records literally one time for this metric and never observed again? If so, I debate if these metrics are worth having. Metrics are for aggregation across a time series and if this records just once, then I think just logging has more value.

}
List<String> deletes = update.getDeleteById();
if (deletes != null) {
updateReqBatch.deleteById(deletes);
metrics.counter(MetricRegistry.name(type.name(), "dbi")).inc(deletes.size());
metrics.incrementInputCounter(type.name(), "dbi");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to actually have this say delete_by_id instead of dbi. Same with below. Feel free to disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants