Skip to content

Conversation

@cbornet
Copy link

@cbornet cbornet commented Sep 4, 2025

What is the issue

https://github.com/riptano/cndb/issues/8641

PR in CNDB: https://github.com/riptano/cndb/pull/15306

What does this PR fix and why was it fixed

This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the AllRequestsMetrics class, updates to the ClientRequestsMetrics class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics.

Metrics added:

  • org.apache.cassandra.metrics.ClientRequest.Timeouts.All
  • org.apache.cassandra.metrics.ClientRequest.Unavailables.All
  • org.apache.cassandra.metrics.ClientRequest.Failures.All
  • org.apache.cassandra.metrics.ClientRequest.Invalid.All
  • org.apache.cassandra.metrics.ClientRequest.OtherErrors.All

Note: only requests for which a tenant can be identified are counted.

@github-actions
Copy link

github-actions bot commented Sep 4, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about CQL statements with no keyspace specified? Or where the reason it's an IRE is the keyspace name was wrong? Will those be captured somewhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they will be captured in ClientMetrics.unknownException with all the other exceptions (with keyspace or not).
If we don't have a keyspace, I don't think we can identify a tenant ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot identify the tenant for sure
this feature is also for non-CNDB users

we should add tests cases to cover this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to double count all the other exceptions like timeouts or other validation errors besides IRE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be able to put these metrics onto the grafana graph with the other errors that are happening. So we need to make sure things we collect make sense when viewed as a whole on the error chart.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added logic to count separately timeouts, unavailable and failures.
But note that these are counted both there and in read/write metrics.
Basically, allRequestsMetrics.timeouts = readMetrics.timeouts + writeMetrics.timeouts
It should be possible to have a dashboard from allRequestsMetrics alone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some errors are already counted twice in the dashboard.
For instance a read timeout increments both coordinator_read_requests_timeouts_total and coordinator_cas_read_requests_timeouts_total.

Copy link
Member

@JeremiahDJordan JeremiahDJordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests, and need to make sure the metrics make sense when viewed as a whole with the other exception metrics that are captured. But I think this is a good starting point.
You might make a dedicated tenant in astra dev and deploy this so that you can try it out and make sure the metrics make it into the grafana there.

@cbornet
Copy link
Author

cbornet commented Sep 4, 2025

Yes. Tests are on the TODO. I just wanted to be sure that it's going in the right direction (so much for TDD...)

@cbornet
Copy link
Author

cbornet commented Sep 8, 2025

@JeremiahDJordan I added some tests.

Copy link
Member

@JeremiahDJordan JeremiahDJordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@cbornet cbornet force-pushed the all-requests-metric branch from 9bd5a3a to babef57 Compare October 6, 2025 09:19
@sonarqubecloud
Copy link

sonarqubecloud bot commented Oct 6, 2025

@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1983 approved by Butler


Approved by Butler
See build details here

@cbornet cbornet merged commit d6c7b59 into main Oct 6, 2025
494 checks passed
@cbornet cbornet deleted the all-requests-metric branch October 6, 2025 14:37
michaelsembwever pushed a commit that referenced this pull request Oct 9, 2025
### What is the issue

riptano/cndb#8641

PR in CNDB: riptano/cndb#15306

### What does this PR fix and why was it fixed

This pull request introduces new metrics for tracking invalid and other
error requests for CQL statements, and integrates these metrics into the
query failure notification logic. The main changes are the addition of
the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics`
class to include these new metrics, and modifications to the query event
notification methods to record errors using the new metrics.

Metrics added:
* org.apache.cassandra.metrics.ClientRequest.Timeouts.All
* org.apache.cassandra.metrics.ClientRequest.Unavailables.All
* org.apache.cassandra.metrics.ClientRequest.Failures.All
* org.apache.cassandra.metrics.ClientRequest.Invalid.All
* org.apache.cassandra.metrics.ClientRequest.OtherErrors.All

 Note: only requests for which a tenant can be identified are counted.
michaelsembwever pushed a commit that referenced this pull request Oct 10, 2025
### What is the issue

riptano/cndb#8641

PR in CNDB: riptano/cndb#15306

### What does this PR fix and why was it fixed

This pull request introduces new metrics for tracking invalid and other
error requests for CQL statements, and integrates these metrics into the
query failure notification logic. The main changes are the addition of
the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics`
class to include these new metrics, and modifications to the query event
notification methods to record errors using the new metrics.

Metrics added:
* org.apache.cassandra.metrics.ClientRequest.Timeouts.All
* org.apache.cassandra.metrics.ClientRequest.Unavailables.All
* org.apache.cassandra.metrics.ClientRequest.Failures.All
* org.apache.cassandra.metrics.ClientRequest.Invalid.All
* org.apache.cassandra.metrics.ClientRequest.OtherErrors.All

 Note: only requests for which a tenant can be identified are counted.
michaelsembwever pushed a commit that referenced this pull request Nov 10, 2025
riptano/cndb#8641

PR in CNDB: riptano/cndb#15306

This pull request introduces new metrics for tracking invalid and other
error requests for CQL statements, and integrates these metrics into the
query failure notification logic. The main changes are the addition of
the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics`
class to include these new metrics, and modifications to the query event
notification methods to record errors using the new metrics.

Metrics added:
* org.apache.cassandra.metrics.ClientRequest.Timeouts.All
* org.apache.cassandra.metrics.ClientRequest.Unavailables.All
* org.apache.cassandra.metrics.ClientRequest.Failures.All
* org.apache.cassandra.metrics.ClientRequest.Invalid.All
* org.apache.cassandra.metrics.ClientRequest.OtherErrors.All

 Note: only requests for which a tenant can be identified are counted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants