-
Notifications
You must be signed in to change notification settings - Fork 805
SOLR-18060: Add Prometheus metrics to CrossDC Consumer. #4063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
mlbiscoc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post a sample of all these metrics? Either dump it here or in a txt file? It would be easier to review the names and labels on the metrics.
| ow2-asm-tree = { module = "org.ow2.asm:asm-tree", version.ref = "ow2-asm" } | ||
| # @keep transitive dependency for version alignment | ||
| perfmark-api = { module = "io.perfmark:perfmark-api", version.ref = "perfmark" } | ||
| prometheus-metrics-core = { module = "io.prometheus:prometheus-metrics-core", version.ref = "prometheus-metrics" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to do this with OTEL sdk/api which also offers prometheus right out the box. Any reason you just went with re-introducing this prometheus dependency? This is way better than the dropwizard -> prometheus bridge so not going to say this is a blocker. Just having alignment would be nice.
|
|
||
| client.commit(COLLECTION); | ||
|
|
||
| System.out.println("Sent producer record"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use log.info instead or just don't print this. Doesn't seem like we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a copy-paste from other test in this suite, I'll clean up all of these.
|
|
||
| outputBatchSizeHistogram = | ||
| Histogram.builder() | ||
| .name("consumer_output_batch_size_histogram") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop the histogram suffix. # TYPE will have the type that it is a histogram and it gets appended with bucket which is the only suffix this will need. Same with ones below.
|
|
||
| outputTimeHistogram = | ||
| Histogram.builder() | ||
| .name("consumer_output_time_histogram") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these times? Milliseconds? Instead of histogram suffix, append the unit of time. Also I would confirm if the buckets of these histograms are useful for whatever the time unit is.
| Counter.builder() | ||
| .name("consumer_input_total") | ||
| .help("Total number of input messages") | ||
| .labelNames("type", "subtype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I question most of these metrics really need type label. What is the cardinality of it and possible different combinations? I see in the test UPDATE is one. Is there also QUERY or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's ADMIN and CONFIGSET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm ok. I am not a fan of naming this label being called type. I think it should have some kind of context what it means as type and subtype can be very generic. Is it an operation or message_type maybe? Then what can subtype be? In core, I made it category but it is debateable if we should just remove that label/attribute all together from metrics. If you move type to something more specific then maybe you can just move off subtype to type. Again seeing an sample text output of these metrics would help if you can.
| @@ -408,19 +408,19 @@ boolean pollAndProcessRequests() { | |||
| List<SolrInputDocument> docs = update.getDocuments(); | |||
| if (docs != null) { | |||
| updateReqBatch.add(docs); | |||
| metrics.counter(MetricRegistry.name(type.name(), "add")).inc(docs.size()); | |||
| metrics.incrementInputCounter(type.name(), "add"); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the metric below are only incrementing by 1 instead of doc.size now. Is that right?
| @@ -45,6 +45,11 @@ public boolean reject(Thread t) { | |||
| return true; | |||
| } | |||
|
|
|||
| // Prometheus Scheduler doesn't provide any method to shut down its worker threads | |||
| if (t.isDaemon()) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this filter every daemon thread? Does the thread have something like prometheus in its name?
| public void recordOutputFirstAttemptSize(MirroredSolrRequest.Type type, long firstAttemptTimeNs) { | ||
| outputFirstAttemptHistogram.labelValues(type.name()).observe(firstAttemptTimeNs); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says "size" but it recording a time? But also this is just recording a first attempt time based on the name. Does that mean this records literally one time for this metric and never observed again? If so, I debate if these metrics are worth having. Metrics are for aggregation across a time series and if this records just once, then I think just logging has more value.
| } | ||
| List<String> deletes = update.getDeleteById(); | ||
| if (deletes != null) { | ||
| updateReqBatch.deleteById(deletes); | ||
| metrics.counter(MetricRegistry.name(type.name(), "dbi")).inc(deletes.size()); | ||
| metrics.incrementInputCounter(type.name(), "dbi"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to actually have this say delete_by_id instead of dbi. Same with below. Feel free to disagree.
This PR replaces Dropwizard JSON metrics with Prometheus metrics in the CrossDC Consumer, using directly the Prometheus client_java API. It also removes the Dropwizard dependency.