-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-15280: Fix AbstractReadQuery.toCQLString to produce correct CQL when possible #1985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Checklist before you submit for review
|
ef1ea2c
to
3073576
Compare
3073576
to
8441449
Compare
a84996a
to
f9d1f3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
The PR "CNDB-15280: Remove user data from AbstractReadQuery.toCQLString" modifies the CQL string generation methods to redact sensitive user data from query logs. The main purpose is to prevent sensitive column values from appearing in slow query logs and other monitoring outputs by replacing them with "?" placeholders.
Key changes:
- Modified
AbstractReadQuery.toCQLString
to accept aredact
parameter for controlling value redaction - Added redaction capabilities throughout the CQL string generation chain
- Updated SAI query plan logging to use redacted strings
Reviewed Changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
test/unit/org/apache/cassandra/index/sai/plan/PlanTest.java |
Updated test mocks to support new redaction parameter and added tests for both redacted and non-redacted plan output |
test/unit/org/apache/cassandra/db/filter/DataLimitsTest.java |
Fixed test assertions to use "LIMIT" instead of "ROWS LIMIT" for CQL compatibility |
test/unit/org/apache/cassandra/db/filter/ColumnFilterTest.java |
Updated all test assertions to test both redacted and non-redacted CQL string generation |
test/unit/org/apache/cassandra/db/SinglePartitionReadCommandCQLTest.java |
Expanded test coverage to verify redaction works for various query types and added tests for quoted identifiers |
test/unit/org/apache/cassandra/db/ReadCommandCQLTester.java |
Modified helper method to test both redacted and non-redacted CQL output |
test/unit/org/apache/cassandra/db/PartitionRangeReadCommandCQLTest.java |
Added comprehensive tests for redaction across different query patterns including index-based queries |
test/unit/org/apache/cassandra/db/MultiRangeReadCommandCQLTest.java |
New test file for multi-range read command CQL string generation with redaction support |
test/unit/org/apache/cassandra/cql3/validation/entities/SecondaryIndexTest.java |
Fixed error message templates to remove placeholder syntax |
test/distributed/org/apache/cassandra/distributed/test/SlowQueryLoggerTest.java |
New integration test verifying that slow query logger redacts sensitive data |
src/java/org/apache/cassandra/schema/ColumnMetadata.java |
Fixed CQL string generation to use proper column name quoting |
src/java/org/apache/cassandra/index/sai/plan/QueryController.java |
Updated tracing to use redacted plan strings |
src/java/org/apache/cassandra/index/sai/plan/Plan.java |
Added redaction support to plan string generation methods |
src/java/org/apache/cassandra/db/rows/AbstractRow.java |
Updated to use redacted clustering string generation |
src/java/org/apache/cassandra/db/marshal/AbstractType.java |
Added new toCQLString method with redaction parameter |
src/java/org/apache/cassandra/db/filter/RowFilter.java |
Enhanced filter expressions to support redacted CQL string generation |
src/java/org/apache/cassandra/db/filter/DataLimits.java |
Fixed limit clause formatting for proper CQL syntax |
Multiple filter and query classes | Updated CQL string generation methods to support redaction parameter throughout the call chain |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR modifies AbstractReadQuery.toCQLString so it doesn't include column values. There is a boolean flag to opt-out from redaction, since seeing the queried values can be useful while debugging.
I would like to request to have two exposed methods with different names. And give better name to the methods, which targets logs and excludes column values.
Also can you go through the PR checklist? Some items can be visited already now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR and improving maitnainability by reducing code duplication. It's great to see such improvements.
I am requesting to separate fixes to producing correct CQL string from the implementation of redaction into a separate PR. There are many practical reasons, and splitting will help to do review and also address review comments. I have plenty comments just on fixing correctness of produced CQL string. And I also have comments to redaction implementation. I think it will be useful to have the redaction PR be based on the correctness PR.
My comments to the first part is maintly about improving code maintainability. I probably noticed few uncrutial misses too.
For the redaction part I have an issue that the decision when to restrict and when not, e.g., in impelemntations of toString
, is arbitrary. It will be great to summarize where and why redaction is applied and where not in the PR discription. I also suggest to bring this summary to Slack for visibility and potential discussion.
Also I would like to see improvements to the names of methods toCQLString()
(without argument) to provide semantics of redaction if they do so.
My review is currently partial as the PR is big.
src/java/org/apache/cassandra/db/filter/AbstractClusteringIndexFilter.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/db/filter/AbstractClusteringIndexFilter.java
Outdated
Show resolved
Hide resolved
" predicate: RANGE(pred4)\n", | ||
limit.toStringRecursive(false)); | ||
|
||
assertEquals("Limit 3 (rows: 3.0, cost/row: 3895.8, cost: 44171.3..55858.7)\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this plan equivalent to the above with the difference on the second line? This is difficult to verify, thus it might be good construct string plans from the common parts and avoid copy-paste.
@@ -288,16 +288,37 @@ enum ControlFlow { Continue, Break } | |||
*/ | |||
public final String toStringRecursive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it important improve the name of the method to make its new semantics clear, especially this PR changes its semantics. E.g., toRedactedStringRecursive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding toRedactedStringRecursive
and toUnredactedStringRecursive
.
return String.format("ORDER BY %s %s", column.name, operator); | ||
case ANN: | ||
return String.format("ORDER BY %s ANN OF %s", column.name, truncateValue(type.toCQLString(value, redact))); | ||
case LIKE_PREFIX: | ||
return likeToCQLString("'%s%%'", type, redact); | ||
case LIKE_SUFFIX: | ||
return likeToCQLString("'%%%s'", type, redact); | ||
case LIKE_CONTAINS: | ||
return likeToCQLString("'%%%s%%'", type, redact); | ||
case LIKE_MATCHES: | ||
return likeToCQLString("'%s'", type, redact); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing break
and return
smells to me. However, I haven't got a good idea to suggest here. I thought if likeToCQLString
can return formatting strings to be used in the common return, however, serializing like operators to strings can cause issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything actionable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a suggestion, thus no action to do in this PR. I am wonder what do you think about this mix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have anything to say, that's why I did it this way.
@k-rus this PR is already split into separate commits, one for each thing, which took quite a bit of work to do because generally one realizes things about correctness when working on masking, since both things are tightly related. Should we spend more time in dividing this in two PRs? Also, there is review feedback for both things in this PR. |
d7a1449
to
dfdecf0
Compare
@k-rus thanks for reviewing. I think I have addressed all the feedback on this partial review round. Please let me know when the rest of the review is ready. |
I’m not sure I understand why having two PRs would be more work than keeping two clean commits. My suggestion is to create the first PR focused solely on fixing the CQL string issues (as in the current commit) and then base the redaction PR on top of that one. I specifically don’t suggest basing the second PR on main, since that would introduce unnecessary extra work. Splitting the work into two PRs helps reduce the review size and avoid mixing semantically different changes. It also makes it easier to review each part separately with the existing tooling. Since code quality is a current focus, having two clear, purpose-specific PRs will definitely help with review efficiency and maintainability. |
dfdecf0
to
ae827cd
Compare
I think we have separate commits as a requirement for this kind of thing. This PR already has abundant review discussion on both correctness and refactoring that will be missing in the separate PR. Maybe that feedback should have been hold until the PR was split to help with review efficiency and maintainability. Anyway, I have created that separate PR if that helps unblock this, leaving this one for correctness: #2038 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for splitting into a separate PR as it helps a lot, since the PR is still not small with many changes.
It might be helpful in future if there is a separate ticket for fixing issues with generated CQL strings. Having a separate issue will also allow to remove connection to redaction. What do you think?
The PR looks good to me. I think I found a couple of places where toCQLstring
is missing. Also I am not sure if generated CQL doesn't need to be fixed for ANN queries. (if it's so and it requires work, I don't mind to postpone into a separate issue, and this PR will just introduce tests demonstrating the current issue)
Otherwise, my comments are nits to improve maintainability of the code.
* For multi-column keys: "k1 = 1 AND k2 = 2" | ||
* | ||
* @param metadata the table metadata | ||
* @param redact whether to redact the key value, as in "k1 = ? AND k2 = ?". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to move this comment about param redact
to the follow up PR.
if (ranges.size() == 1) | ||
{ | ||
String rangeString = ranges.get(0).toCQLString(metadata()); | ||
if (!rangeString.isEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this can only happen with single range, can it be useful to have an assert for non-empty range string in the case of many ranges? (this is because I had earlier the question about when it can be empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach is suggested in #1985 (comment)
} | ||
else | ||
{ | ||
builder.append(" WHERE ("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion looks stupid, but it can help to notice the opening parentheses:
builder.append(" WHERE ("); | |
builder.append(" WHERE ").append('('); |
You can ignore the suggestion. And I am not sure if linter will complain about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it, it might help a bit with readability
private boolean appendRanges(CqlBuilder builder) | ||
{ | ||
List<DataRange> ranges = ranges(); | ||
boolean hasRangeRestrictions = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to check if the purpose of this variable to store the result or if it is also read to guide logic. Since it's only to store the result, what do you think about replacing it with return clauses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be replaced by returns, yes.
default void appendCQLWhereClause(CqlBuilder builder) | ||
{ | ||
List<DataRange> ranges = ranges(); | ||
if (ranges.size() == 1 && ranges.get(0).isUnrestricted() && rowFilter().isEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that the condition ranges.size() == 1 && ranges.get(0).isUnrestricted()
means that there are no ranges?
What do you think about refactoring this class and use this condition already here instead of calling appendRanges
in any case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the condition rowFilter().isEmpty()
correspond to the condition on line 43 if (!filter.isEmpty())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the condition
rowFilter().isEmpty()
correspond to the condition on line 43if (!filter.isEmpty())
?
Not really, because rowFilter().isEmpty()
doesn't include the clustering.
I'm simplifying this a bit, anyway.
return String.format("ORDER BY %s %s", column.name, operator); | ||
case ANN: | ||
return String.format("ORDER BY %s ANN OF %s", column.name, valueAsCQLString(type, value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should two places with column.name
be column.name.toCQLString()
too?
createIndex("CREATE CUSTOM INDEX ON %s(v) USING 'StorageAttachedIndex'"); | ||
createIndex("CREATE CUSTOM INDEX ON %s(n) USING 'StorageAttachedIndex'"); | ||
assertToCQLString("SELECT * FROM %s ORDER BY v ANN OF [1, 2] LIMIT 10", | ||
"SELECT * FROM %s WHERE (token(k) <= -1 OR token(k) > -1) ORDER BY v ANN OF [1.0, ... LIMIT 10 ALLOW FILTERING"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does ...
mean here? Is it correct that there is no closing bracket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the generic truncation done in RowFilter.SimpleExpression.toCQLString
, which doesn't consider the data type. That might be delegated to the type in a separate ticket, if there is value on it.
createTable("CREATE TABLE %s (k int, c int, n int, PRIMARY KEY (k, c))"); | ||
createIndex("CREATE CUSTOM INDEX ON %s(n) USING 'StorageAttachedIndex'"); | ||
assertToCQLString("SELECT * FROM %s ORDER BY n LIMIT 10", | ||
"SELECT * FROM %s WHERE (token(k) <= -1 OR token(k) > -1) ORDER BY n ASC LIMIT 10 ALLOW FILTERING"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to check the correctness of the expected CQL? Is it executable or parsable?
It's difficult to check the correctness by human eyes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. I have added some validation checks, although toCQLString
doesn't always produces valid data. The cases where it doesn't produce valid data are:
- Truncation of long expression operands
MultRangeReadCommand
s like the ones tested here. This commands don't have a supported CQL representation. They are the result of internally splitting other commands, and I have tried to add a kind of understandable CQL representation of them, even if it's not supported.
// test literals | ||
createTable("CREATE TABLE %s (k text, c text, m map<text, text>, PRIMARY KEY (k, c))"); | ||
assertToCQLString("SELECT m['key'] FROM %s WHERE m['key'] = 'value' ALLOW FILTERING", | ||
"SELECT m FROM %s WHERE m['key'] = 'value' ALLOW FILTERING"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct that projection is missing the collection access, i.e., m
instead of m['key']
? Or is it something to fix in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what is received internally, due to the differences between fetched/queries columns, etc.
"SELECT * FROM %s WHERE k = 0 ALLOW FILTERING"); | ||
assertToCQLString("SELECT * FROM %s WHERE k = 0 ORDER BY c1 DESC", | ||
"SELECT * FROM %s WHERE k = 0 ORDER BY c1 DESC, c2 DESC ALLOW FILTERING"); | ||
assertToCQLString("SELECT * FROM %s WHERE k = 0 ORDER BY c1 DESC, c2 DESC", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a curious question, which, I believe, is outside the scope of this PR. Is it allowed to have mixing ASC
and DESC
for c1
and c2
? E.g., ORDER BY c1 DESC, c2 ASC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not supported, due to how clusterings are built.
Various fixes for toCQLString mainly coming from CASSANDRA-16510, and removal of code duplication.
ae827cd
to
c2702c7
Compare
|
❌ Build ds-cassandra-pr-gate/PR-1985 rejected by Butler1 regressions found Found 1 new test failures
Found 3 known test failures |
The method
AbstractReadQuery.toCQLString
prints commands as CQL queries including any column values. This includes the queried values in theWHERE
part of aSELECT
statement or the written values onINSERT
andUPDATE
statement. This method is used at least by the slow query logger, printing user data into the logs.This PR includes a number of fixes to make
toCQLString
produce correct CQL syntax, mostly coming from CASSANDRA-16510. These are fixes to prevent printing wrong or redundant CQL syntax such asSELECT * FROM t ORDER BY (v ANN OF [...])
orSELECT * FROM t WHERE c1=0 AND c1=0
. Those issues came up while writing tests for testing redaction with various types of queries, and are needed to properly test redaction. There is also some removal of code duplication. These changes are in a separate commit in an attempt to ease review.This PR originally included both those fixes and redaction in separate commit, but they have been separated in two PRs at reviewer request. The PR for redaction is #2038. This PR however contains reviewer feedback regarding changes that have been ported to that PR, so it might be harder to follow.