-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-8641: Add a metric to count all request errors #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checklist before you submit for review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about CQL statements with no keyspace specified? Or where the reason it's an IRE is the keyspace name was wrong? Will those be captured somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they will be captured in ClientMetrics.unknownException with all the other exceptions (with keyspace or not).
If we don't have a keyspace, I don't think we can identify a tenant ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot identify the tenant for sure
this feature is also for non-CNDB users
we should add tests cases to cover this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to double count all the other exceptions like timeouts or other validation errors besides IRE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to be able to put these metrics onto the grafana graph with the other errors that are happening. So we need to make sure things we collect make sense when viewed as a whole on the error chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added logic to count separately timeouts, unavailable and failures.
But note that these are counted both there and in read/write metrics.
Basically, allRequestsMetrics.timeouts = readMetrics.timeouts + writeMetrics.timeouts
It should be possible to have a dashboard from allRequestsMetrics alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that some errors are already counted twice in the dashboard.
For instance a read timeout increments both coordinator_read_requests_timeouts_total and coordinator_cas_read_requests_timeouts_total.
JeremiahDJordan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs tests, and need to make sure the metrics make sense when viewed as a whole with the other exception metrics that are captured. But I think this is a good starting point.
You might make a dedicated tenant in astra dev and deploy this so that you can try it out and make sure the metrics make it into the grafana there.
|
Yes. Tests are on the TODO. I just wanted to be sure that it's going in the right direction (so much for TDD...) |
|
@JeremiahDJordan I added some tests. |
ac6ed74 to
56ef14f
Compare
56ef14f to
9bd5a3a
Compare
JeremiahDJordan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
9bd5a3a to
babef57
Compare
|
✔️ Build ds-cassandra-pr-gate/PR-1983 approved by ButlerApproved by Butler |
### What is the issue riptano/cndb#8641 PR in CNDB: riptano/cndb#15306 ### What does this PR fix and why was it fixed This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics` class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics. Metrics added: * org.apache.cassandra.metrics.ClientRequest.Timeouts.All * org.apache.cassandra.metrics.ClientRequest.Unavailables.All * org.apache.cassandra.metrics.ClientRequest.Failures.All * org.apache.cassandra.metrics.ClientRequest.Invalid.All * org.apache.cassandra.metrics.ClientRequest.OtherErrors.All Note: only requests for which a tenant can be identified are counted.
### What is the issue riptano/cndb#8641 PR in CNDB: riptano/cndb#15306 ### What does this PR fix and why was it fixed This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics` class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics. Metrics added: * org.apache.cassandra.metrics.ClientRequest.Timeouts.All * org.apache.cassandra.metrics.ClientRequest.Unavailables.All * org.apache.cassandra.metrics.ClientRequest.Failures.All * org.apache.cassandra.metrics.ClientRequest.Invalid.All * org.apache.cassandra.metrics.ClientRequest.OtherErrors.All Note: only requests for which a tenant can be identified are counted.
riptano/cndb#8641 PR in CNDB: riptano/cndb#15306 This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the `AllRequestsMetrics` class, updates to the `ClientRequestsMetrics` class to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics. Metrics added: * org.apache.cassandra.metrics.ClientRequest.Timeouts.All * org.apache.cassandra.metrics.ClientRequest.Unavailables.All * org.apache.cassandra.metrics.ClientRequest.Failures.All * org.apache.cassandra.metrics.ClientRequest.Invalid.All * org.apache.cassandra.metrics.ClientRequest.OtherErrors.All Note: only requests for which a tenant can be identified are counted.



What is the issue
https://github.com/riptano/cndb/issues/8641
PR in CNDB: https://github.com/riptano/cndb/pull/15306
What does this PR fix and why was it fixed
This pull request introduces new metrics for tracking invalid and other error requests for CQL statements, and integrates these metrics into the query failure notification logic. The main changes are the addition of the
AllRequestsMetricsclass, updates to theClientRequestsMetricsclass to include these new metrics, and modifications to the query event notification methods to record errors using the new metrics.Metrics added:
Note: only requests for which a tenant can be identified are counted.