CNDB-8641: Add a metric to count all request errors #1983

cbornet · 2025-09-04T17:30:58Z

What is the issue

https://github.com/riptano/cndb/issues/8641

PR in CNDB: https://github.com/riptano/cndb/pull/15306

What does this PR fix and why was it fixed

This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the AllRequestsMetrics class, updates to the ClientRequestsMetrics class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics.

Metrics added:

org.apache.cassandra.metrics.ClientRequest.Timeouts.All
org.apache.cassandra.metrics.ClientRequest.Unavailables.All
org.apache.cassandra.metrics.ClientRequest.Failures.All
org.apache.cassandra.metrics.ClientRequest.Invalid.All
org.apache.cassandra.metrics.ClientRequest.OtherErrors.All

Note: only requests for which a tenant can be identified are counted.

github-actions · 2025-09-04T17:31:16Z

JeremiahDJordan · 2025-09-04T17:45:17Z

src/java/org/apache/cassandra/cql3/QueryEvents.java

What about CQL statements with no keyspace specified? Or where the reason it's an IRE is the keyspace name was wrong? Will those be captured somewhere else?

I think they will be captured in ClientMetrics.unknownException with all the other exceptions (with keyspace or not).
If we don't have a keyspace, I don't think we can identify a tenant ?

we cannot identify the tenant for sure
this feature is also for non-CNDB users

we should add tests cases to cover this case

JeremiahDJordan · 2025-09-04T17:46:40Z

src/java/org/apache/cassandra/cql3/QueryEvents.java

Is this going to double count all the other exceptions like timeouts or other validation errors besides IRE?

We want to be able to put these metrics onto the grafana graph with the other errors that are happening. So we need to make sure things we collect make sense when viewed as a whole on the error chart.

I added logic to count separately timeouts, unavailable and failures.
But note that these are counted both there and in read/write metrics.
Basically, allRequestsMetrics.timeouts = readMetrics.timeouts + writeMetrics.timeouts
It should be possible to have a dashboard from allRequestsMetrics alone.

Note that some errors are already counted twice in the dashboard.
For instance a read timeout increments both coordinator_read_requests_timeouts_total and coordinator_cas_read_requests_timeouts_total.

JeremiahDJordan

Needs tests, and need to make sure the metrics make sense when viewed as a whole with the other exception metrics that are captured. But I think this is a good starting point.
You might make a dedicated tenant in astra dev and deploy this so that you can try it out and make sure the metrics make it into the grafana there.

cbornet · 2025-09-04T22:40:06Z

Yes. Tests are on the TODO. I just wanted to be sure that it's going in the right direction (so much for TDD...)

cbornet · 2025-09-08T16:45:28Z

@JeremiahDJordan I added some tests.

JeremiahDJordan

Looks good. Thanks!

sonarqubecloud · 2025-10-06T10:03:08Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
86.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-10-06T10:06:58Z

✔️ Build ds-cassandra-pr-gate/PR-1983 approved by Butler

Approved by Butler
See build details here

### What is the issue riptano/cndb#8641 PR in CNDB: riptano/cndb#15306 ### What does this PR fix and why was it fixed This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics` class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics. Metrics added: * org.apache.cassandra.metrics.ClientRequest.Timeouts.All * org.apache.cassandra.metrics.ClientRequest.Unavailables.All * org.apache.cassandra.metrics.ClientRequest.Failures.All * org.apache.cassandra.metrics.ClientRequest.Invalid.All * org.apache.cassandra.metrics.ClientRequest.OtherErrors.All Note: only requests for which a tenant can be identified are counted.

riptano/cndb#8641 PR in CNDB: riptano/cndb#15306 This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics` class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics. Metrics added: * org.apache.cassandra.metrics.ClientRequest.Timeouts.All * org.apache.cassandra.metrics.ClientRequest.Unavailables.All * org.apache.cassandra.metrics.ClientRequest.Failures.All * org.apache.cassandra.metrics.ClientRequest.Invalid.All * org.apache.cassandra.metrics.ClientRequest.OtherErrors.All Note: only requests for which a tenant can be identified are counted.

cbornet requested review from JeremiahDJordan and sbtourist September 4, 2025 17:31

cbornet mentioned this pull request Sep 4, 2025

CNDB-8641: Add metric to ClientRequestMetrics to count InvalidRequestException #1979

Closed

JeremiahDJordan reviewed Sep 4, 2025

View reviewed changes

JeremiahDJordan requested changes Sep 4, 2025

View reviewed changes

cbornet requested review from JeremiahDJordan and eolivelli September 8, 2025 16:45

cbornet force-pushed the all-requests-metric branch from ac6ed74 to 56ef14f Compare September 11, 2025 09:56

cbornet force-pushed the all-requests-metric branch from 56ef14f to 9bd5a3a Compare October 3, 2025 11:47

JeremiahDJordan approved these changes Oct 3, 2025

View reviewed changes

cbornet added 3 commits October 6, 2025 11:19

CNDB-8641: Add a metric to count all request errors

cd76a3a

Separate timeouts, unavailable and failures from otherErrors

56a480b

Add tests

babef57

cbornet force-pushed the all-requests-metric branch from 9bd5a3a to babef57 Compare October 6, 2025 09:19

cbornet merged commit d6c7b59 into main Oct 6, 2025
494 checks passed

cbornet deleted the all-requests-metric branch October 6, 2025 14:37

CNDB-8641: Add a metric to count all request errors #1983

CNDB-8641: Add a metric to count all request errors #1983

Uh oh!

Conversation

cbornet commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented Sep 4, 2025 • edited by cbornet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

JeremiahDJordan Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

cbornet Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

eolivelli Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

JeremiahDJordan Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

JeremiahDJordan Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

cbornet Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

cbornet Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

JeremiahDJordan left a comment

Choose a reason for hiding this comment

Uh oh!

cbornet commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbornet commented Sep 8, 2025

Uh oh!

JeremiahDJordan left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 6, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Oct 6, 2025

✔️ Build ds-cassandra-pr-gate/PR-1983 approved by Butler

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cbornet commented Sep 4, 2025 •

edited

Loading

github-actions bot commented Sep 4, 2025 •

edited by cbornet

Loading

cbornet commented Sep 4, 2025 •

edited

Loading