Skip to content

Conversation

adelapena
Copy link

@adelapena adelapena commented Sep 4, 2025

The method AbstractReadQuery.toCQLString prints commands as CQL queries including any column values. This includes the queried values in the WHERE part of a SELECT statement or the written values on INSERT and UPDATE statement. This method is used at least by the slow query logger, printing user data into the logs.

This PR includes a number of fixes to make toCQLString produce correct CQL syntax, mostly coming from CASSANDRA-16510. These are fixes to prevent printing wrong or redundant CQL syntax such as SELECT * FROM t ORDER BY (v ANN OF [...]) or SELECT * FROM t WHERE c1=0 AND c1=0. Those issues came up while writing tests for testing redaction with various types of queries, and are needed to properly test redaction. There is also some removal of code duplication. These changes are in a separate commit in an attempt to ease review.

This PR originally included both those fixes and redaction in separate commit, but they have been separated in two PRs at reviewer request. The PR for redaction is #2038. This PR however contains reviewer feedback regarding changes that have been ported to that PR, so it might be harder to follow.

@adelapena adelapena self-assigned this Sep 4, 2025
Copy link

github-actions bot commented Sep 4, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@adelapena adelapena marked this pull request as draft September 4, 2025 19:23
@adelapena adelapena marked this pull request as ready for review September 9, 2025 12:51
@adelapena adelapena marked this pull request as draft September 9, 2025 18:40
@adelapena adelapena marked this pull request as ready for review September 10, 2025 11:31
@adelapena adelapena force-pushed the CNDB-15280-main branch 4 times, most recently from a84996a to f9d1f3d Compare September 12, 2025 12:09
@adelapena adelapena requested a review from a team September 12, 2025 13:49
@k-rus k-rus requested a review from Copilot September 15, 2025 05:49
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

The PR "CNDB-15280: Remove user data from AbstractReadQuery.toCQLString" modifies the CQL string generation methods to redact sensitive user data from query logs. The main purpose is to prevent sensitive column values from appearing in slow query logs and other monitoring outputs by replacing them with "?" placeholders.

Key changes:

  • Modified AbstractReadQuery.toCQLString to accept a redact parameter for controlling value redaction
  • Added redaction capabilities throughout the CQL string generation chain
  • Updated SAI query plan logging to use redacted strings

Reviewed Changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/unit/org/apache/cassandra/index/sai/plan/PlanTest.java Updated test mocks to support new redaction parameter and added tests for both redacted and non-redacted plan output
test/unit/org/apache/cassandra/db/filter/DataLimitsTest.java Fixed test assertions to use "LIMIT" instead of "ROWS LIMIT" for CQL compatibility
test/unit/org/apache/cassandra/db/filter/ColumnFilterTest.java Updated all test assertions to test both redacted and non-redacted CQL string generation
test/unit/org/apache/cassandra/db/SinglePartitionReadCommandCQLTest.java Expanded test coverage to verify redaction works for various query types and added tests for quoted identifiers
test/unit/org/apache/cassandra/db/ReadCommandCQLTester.java Modified helper method to test both redacted and non-redacted CQL output
test/unit/org/apache/cassandra/db/PartitionRangeReadCommandCQLTest.java Added comprehensive tests for redaction across different query patterns including index-based queries
test/unit/org/apache/cassandra/db/MultiRangeReadCommandCQLTest.java New test file for multi-range read command CQL string generation with redaction support
test/unit/org/apache/cassandra/cql3/validation/entities/SecondaryIndexTest.java Fixed error message templates to remove placeholder syntax
test/distributed/org/apache/cassandra/distributed/test/SlowQueryLoggerTest.java New integration test verifying that slow query logger redacts sensitive data
src/java/org/apache/cassandra/schema/ColumnMetadata.java Fixed CQL string generation to use proper column name quoting
src/java/org/apache/cassandra/index/sai/plan/QueryController.java Updated tracing to use redacted plan strings
src/java/org/apache/cassandra/index/sai/plan/Plan.java Added redaction support to plan string generation methods
src/java/org/apache/cassandra/db/rows/AbstractRow.java Updated to use redacted clustering string generation
src/java/org/apache/cassandra/db/marshal/AbstractType.java Added new toCQLString method with redaction parameter
src/java/org/apache/cassandra/db/filter/RowFilter.java Enhanced filter expressions to support redacted CQL string generation
src/java/org/apache/cassandra/db/filter/DataLimits.java Fixed limit clause formatting for proper CQL syntax
Multiple filter and query classes Updated CQL string generation methods to support redaction parameter throughout the call chain

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR modifies AbstractReadQuery.toCQLString so it doesn't include column values. There is a boolean flag to opt-out from redaction, since seeing the queried values can be useful while debugging.

I would like to request to have two exposed methods with different names. And give better name to the methods, which targets logs and excludes column values.

Also can you go through the PR checklist? Some items can be visited already now.

Copy link

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR and improving maitnainability by reducing code duplication. It's great to see such improvements.

I am requesting to separate fixes to producing correct CQL string from the implementation of redaction into a separate PR. There are many practical reasons, and splitting will help to do review and also address review comments. I have plenty comments just on fixing correctness of produced CQL string. And I also have comments to redaction implementation. I think it will be useful to have the redaction PR be based on the correctness PR.

My comments to the first part is maintly about improving code maintainability. I probably noticed few uncrutial misses too.

For the redaction part I have an issue that the decision when to restrict and when not, e.g., in impelemntations of toString, is arbitrary. It will be great to summarize where and why redaction is applied and where not in the PR discription. I also suggest to bring this summary to Slack for visibility and potential discussion.
Also I would like to see improvements to the names of methods toCQLString() (without argument) to provide semantics of redaction if they do so.

My review is currently partial as the PR is big.

" predicate: RANGE(pred4)\n",
limit.toStringRecursive(false));

assertEquals("Limit 3 (rows: 3.0, cost/row: 3895.8, cost: 44171.3..55858.7)\n" +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this plan equivalent to the above with the difference on the second line? This is difficult to verify, thus it might be good construct string plans from the common parts and avoid copy-paste.

@@ -288,16 +288,37 @@ enum ControlFlow { Continue, Break }
*/
public final String toStringRecursive()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it important improve the name of the method to make its new semantics clear, especially this PR changes its semantics. E.g., toRedactedStringRecursive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding toRedactedStringRecursive and toUnredactedStringRecursive.

Comment on lines 1408 to 1418
return String.format("ORDER BY %s %s", column.name, operator);
case ANN:
return String.format("ORDER BY %s ANN OF %s", column.name, truncateValue(type.toCQLString(value, redact)));
case LIKE_PREFIX:
return likeToCQLString("'%s%%'", type, redact);
case LIKE_SUFFIX:
return likeToCQLString("'%%%s'", type, redact);
case LIKE_CONTAINS:
return likeToCQLString("'%%%s%%'", type, redact);
case LIKE_MATCHES:
return likeToCQLString("'%s'", type, redact);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing break and return smells to me. However, I haven't got a good idea to suggest here. I thought if likeToCQLString can return formatting strings to be used in the common return, however, serializing like operators to strings can cause issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything actionable here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a suggestion, thus no action to do in this PR. I am wonder what do you think about this mix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have anything to say, that's why I did it this way.

@adelapena
Copy link
Author

adelapena commented Oct 1, 2025

I am requesting to separate fixes to producing correct CQL string from the implementation of redaction into a separate PR. There are many practical reasons, and splitting will help to do review and also address review comments. I have plenty comments just on fixing correctness of produced CQL string. And I also have comments to redaction implementation. I think it will be useful to have the redaction PR be based on the correctness PR.

@k-rus this PR is already split into separate commits, one for each thing, which took quite a bit of work to do because generally one realizes things about correctness when working on masking, since both things are tightly related. Should we spend more time in dividing this in two PRs? Also, there is review feedback for both things in this PR.

@adelapena adelapena force-pushed the CNDB-15280-main branch 2 times, most recently from d7a1449 to dfdecf0 Compare October 2, 2025 10:51
@adelapena
Copy link
Author

@k-rus thanks for reviewing. I think I have addressed all the feedback on this partial review round. Please let me know when the rest of the review is ready.

@k-rus
Copy link

k-rus commented Oct 6, 2025

I am requesting to separate fixes to producing correct CQL string from the implementation of redaction into a separate PR. There are many practical reasons, and splitting will help to do review and also address review comments. I have plenty comments just on fixing correctness of produced CQL string. And I also have comments to redaction implementation. I think it will be useful to have the redaction PR be based on the correctness PR.

@k-rus this PR is already split into separate commits, one for each thing, which took quite a bit of work to do because generally one realizes things about correctness when working on masking, since both things are tightly related. Should we spend more time in dividing this in two PRs? Also, there is review feedback for both things in this PR.

I’m not sure I understand why having two PRs would be more work than keeping two clean commits.

My suggestion is to create the first PR focused solely on fixing the CQL string issues (as in the current commit) and then base the redaction PR on top of that one. I specifically don’t suggest basing the second PR on main, since that would introduce unnecessary extra work.

Splitting the work into two PRs helps reduce the review size and avoid mixing semantically different changes. It also makes it easier to review each part separately with the existing tooling.

Since code quality is a current focus, having two clear, purpose-specific PRs will definitely help with review efficiency and maintainability.

@adelapena adelapena changed the title CNDB-15280: Remove user data from AbstractReadQuery.toCQLString CNDB-15280: Remove user data from AbstractReadQuery.toCQLString (fixes and refactoring) Oct 7, 2025
@adelapena
Copy link
Author

I am requesting to separate fixes to producing correct CQL string from the implementation of redaction into a separate PR. There are many practical reasons, and splitting will help to do review and also address review comments. I have plenty comments just on fixing correctness of produced CQL string. And I also have comments to redaction implementation. I think it will be useful to have the redaction PR be based on the correctness PR.

@k-rus this PR is already split into separate commits, one for each thing, which took quite a bit of work to do because generally one realizes things about correctness when working on masking, since both things are tightly related. Should we spend more time in dividing this in two PRs? Also, there is review feedback for both things in this PR.

I’m not sure I understand why having two PRs would be more work than keeping two clean commits.

My suggestion is to create the first PR focused solely on fixing the CQL string issues (as in the current commit) and then base the redaction PR on top of that one. I specifically don’t suggest basing the second PR on main, since that would introduce unnecessary extra work.

Splitting the work into two PRs helps reduce the review size and avoid mixing semantically different changes. It also makes it easier to review each part separately with the existing tooling.

Since code quality is a current focus, having two clear, purpose-specific PRs will definitely help with review efficiency and maintainability.

I think we have separate commits as a requirement for this kind of thing. This PR already has abundant review discussion on both correctness and refactoring that will be missing in the separate PR. Maybe that feedback should have been hold until the PR was split to help with review efficiency and maintainability. Anyway, I have created that separate PR if that helps unblock this, leaving this one for correctness: #2038

@k-rus k-rus self-requested a review October 7, 2025 14:11
Copy link

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for splitting into a separate PR as it helps a lot, since the PR is still not small with many changes.
It might be helpful in future if there is a separate ticket for fixing issues with generated CQL strings. Having a separate issue will also allow to remove connection to redaction. What do you think?

The PR looks good to me. I think I found a couple of places where toCQLstring is missing. Also I am not sure if generated CQL doesn't need to be fixed for ANN queries. (if it's so and it requires work, I don't mind to postpone into a separate issue, and this PR will just introduce tests demonstrating the current issue)
Otherwise, my comments are nits to improve maintainability of the code.

* For multi-column keys: "k1 = 1 AND k2 = 2"
*
* @param metadata the table metadata
* @param redact whether to redact the key value, as in "k1 = ? AND k2 = ?".
Copy link

@k-rus k-rus Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to move this comment about param redact to the follow up PR.

if (ranges.size() == 1)
{
String rangeString = ranges.get(0).toCQLString(metadata());
if (!rangeString.isEmpty())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this can only happen with single range, can it be useful to have an assert for non-empty range string in the case of many ranges? (this is because I had earlier the question about when it can be empty)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is suggested in #1985 (comment)

}
else
{
builder.append(" WHERE (");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion looks stupid, but it can help to notice the opening parentheses:

Suggested change
builder.append(" WHERE (");
builder.append(" WHERE ").append('(');

You can ignore the suggestion. And I am not sure if linter will complain about it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it, it might help a bit with readability

private boolean appendRanges(CqlBuilder builder)
{
List<DataRange> ranges = ranges();
boolean hasRangeRestrictions = false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check if the purpose of this variable to store the result or if it is also read to guide logic. Since it's only to store the result, what do you think about replacing it with return clauses?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be replaced by returns, yes.

default void appendCQLWhereClause(CqlBuilder builder)
{
List<DataRange> ranges = ranges();
if (ranges.size() == 1 && ranges.get(0).isUnrestricted() && rowFilter().isEmpty())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that the condition ranges.size() == 1 && ranges.get(0).isUnrestricted() means that there are no ranges?
What do you think about refactoring this class and use this condition already here instead of calling appendRanges in any case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the condition rowFilter().isEmpty() correspond to the condition on line 43 if (!filter.isEmpty())?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the condition rowFilter().isEmpty() correspond to the condition on line 43 if (!filter.isEmpty())?

Not really, because rowFilter().isEmpty() doesn't include the clustering.

I'm simplifying this a bit, anyway.

Comment on lines 1408 to 1410
return String.format("ORDER BY %s %s", column.name, operator);
case ANN:
return String.format("ORDER BY %s ANN OF %s", column.name, valueAsCQLString(type, value));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should two places with column.name be column.name.toCQLString() too?

createIndex("CREATE CUSTOM INDEX ON %s(v) USING 'StorageAttachedIndex'");
createIndex("CREATE CUSTOM INDEX ON %s(n) USING 'StorageAttachedIndex'");
assertToCQLString("SELECT * FROM %s ORDER BY v ANN OF [1, 2] LIMIT 10",
"SELECT * FROM %s WHERE (token(k) <= -1 OR token(k) > -1) ORDER BY v ANN OF [1.0, ... LIMIT 10 ALLOW FILTERING");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ... mean here? Is it correct that there is no closing bracket?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the generic truncation done in RowFilter.SimpleExpression.toCQLString, which doesn't consider the data type. That might be delegated to the type in a separate ticket, if there is value on it.

createTable("CREATE TABLE %s (k int, c int, n int, PRIMARY KEY (k, c))");
createIndex("CREATE CUSTOM INDEX ON %s(n) USING 'StorageAttachedIndex'");
assertToCQLString("SELECT * FROM %s ORDER BY n LIMIT 10",
"SELECT * FROM %s WHERE (token(k) <= -1 OR token(k) > -1) ORDER BY n ASC LIMIT 10 ALLOW FILTERING");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check the correctness of the expected CQL? Is it executable or parsable?
It's difficult to check the correctness by human eyes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I have added some validation checks, although toCQLString doesn't always produces valid data. The cases where it doesn't produce valid data are:

  • Truncation of long expression operands
  • MultRangeReadCommands like the ones tested here. This commands don't have a supported CQL representation. They are the result of internally splitting other commands, and I have tried to add a kind of understandable CQL representation of them, even if it's not supported.

// test literals
createTable("CREATE TABLE %s (k text, c text, m map<text, text>, PRIMARY KEY (k, c))");
assertToCQLString("SELECT m['key'] FROM %s WHERE m['key'] = 'value' ALLOW FILTERING",
"SELECT m FROM %s WHERE m['key'] = 'value' ALLOW FILTERING");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that projection is missing the collection access, i.e., m instead of m['key']? Or is it something to fix in future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is received internally, due to the differences between fetched/queries columns, etc.

"SELECT * FROM %s WHERE k = 0 ALLOW FILTERING");
assertToCQLString("SELECT * FROM %s WHERE k = 0 ORDER BY c1 DESC",
"SELECT * FROM %s WHERE k = 0 ORDER BY c1 DESC, c2 DESC ALLOW FILTERING");
assertToCQLString("SELECT * FROM %s WHERE k = 0 ORDER BY c1 DESC, c2 DESC",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a curious question, which, I believe, is outside the scope of this PR. Is it allowed to have mixing ASC and DESC for c1 and c2? E.g., ORDER BY c1 DESC, c2 ASC

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not supported, due to how clusterings are built.

Various fixes for toCQLString mainly coming from CASSANDRA-16510,
and removal of code duplication.
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1985 rejected by Butler


1 regressions found
See build details here


Found 1 new test failures

Test Explanation Runs Upstream
o.a.c.sensors.SensorsIndexWriteTest.BeforeFirstTest NEW 🔴 1 / 10

Found 3 known test failures

@adelapena adelapena requested a review from k-rus October 17, 2025 10:45
@adelapena adelapena changed the title CNDB-15280: Remove user data from AbstractReadQuery.toCQLString (fixes and refactoring) CNDB-15280: Fix AbstractReadQuery.toCQLString to produce correct CQL when possible Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants