Skip to content

Conversation

@michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Nov 6, 2025

What is the issue

Fixes: https://github.com/riptano/cndb/issues/15919
Test PR: https://github.com/riptano/cndb/pull/15949

What does this PR fix and why was it fixed

In the original implementation for #820, we introduced the PrimaryKeyMapIterator to iterate all primary keys in an sstable and then do an anti-join on the result of an equality query. That design works, but requires some additional reads from disk to get primary keys that are unnecessary.

There are two possible solutions:

  1. We can use row ids (either sstable or segment) to do the complement of the resulting posting lists. This will be the most performant, since it avoids object allocations. The main issue with this solution is that it is much more complicated to implement and had unaddressed edge cases.
  2. We can use the primaryKeyFromRowId that takes primary key bounds and then uses a row id, when rows are from the same sstable. This will be worse that solution 1 because it creates an object per key and requires comparing sstable ids before comparing sstable row ids, but it is a significant improvement over the current solution, which hits disk to load the primary key.

When testing on my local machine and reviewing the JMH benchmarks, I can see that the current solution is about 16x worse than the minimum solution (2) and 32x worse than the optimal (1) solution. Given that the benchmarks in question are highly specific to the use case, I do no think we have sufficient motivation to introduce the exceedingly complex (1) solution.

Note that the ideal solution to 1, that would have much less complexity, is to convert posting lists into a single iterator of sstable row ids, and then to take the complement of them.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@michaeljmarshall michaeljmarshall marked this pull request as ready for review November 7, 2025 04:12
Copy link

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the other minor things I mentioned, we need a test for verifying how NEQ works together with token range / partition key restriction.

Results on current commit:

     [java] Result "org.apache.cassandra.test.microbench.index.sai.NEQQueryBench.queryNEQ":
     [java]   7198946.987 ±(99.9%) 155705.068 ns/op [Average]
     [java]   (min, avg, max) = (7086466.212, 7198946.987, 7464311.708), stdev = 102989.262
     [java]   CI (99.9%): [7043241.919, 7354652.054] (assumes normal distribution)
     [java]
     [java] Benchmark               (numRowsWithinPartition)  Mode  Cnt        Score        Error  Units
     [java] NEQQueryBench.queryNEQ                      1000  avgt   10  1109937.157 ±   2362.813  ns/op
     [java] NEQQueryBench.queryNEQ                     10000  avgt   10  7198946.987 ± 155705.068  ns/op
Benchmark results for baseline.

     [java] Result "org.apache.cassandra.test.microbench.index.sai.NEQQueryBench.queryNEQ":
     [java]   327531893.145 ±(99.9%) 1209098.423 ns/op [Average]
     [java]   (min, avg, max) = (326375963.710, 327531893.145, 329248861.548), stdev = 799743.744
     [java]   CI (99.9%): [326322794.722, 328740991.569] (assumes normal distribution)
     [java]
     [java] Benchmark               (numRowsWithinPartition)  Mode  Cnt          Score         Error  Units
     [java] NEQQueryBench.queryNEQ                      1000  avgt   10   32187371.679 ±   67260.318  ns/op
     [java] NEQQueryBench.queryNEQ                     10000  avgt   10  327531893.145 ± 1209098.423  ns/op
     [java] Result "org.apache.cassandra.test.microbench.index.sai.NEQQueryBench.queryNEQ":
     [java]   17761666.739 ±(99.9%) 260418.205 ns/op [Average]
     [java]   (min, avg, max) = (17644295.340, 17761666.739, 18241636.689), stdev = 172250.519
     [java]   CI (99.9%): [17501248.533, 18022084.944] (assumes normal distribution)
     [java]
     [java] Benchmark               (numRowsWithinPartition)  Mode  Cnt         Score        Error  Units
     [java] NEQQueryBench.queryNEQ                      1000  avgt   10   2169861.695 ±  11355.830  ns/op
     [java] NEQQueryBench.queryNEQ                     10000  avgt   10  17761666.739 ± 260418.205  ns/op
@michaeljmarshall
Copy link
Member Author

@pkolaczk - I added a benchmark and discovered that in the most extreme cases, we can get most of the benefit for an extremely small code change, compared to what I had proposed before. I think we should go with this one for now and leave a deeper re-write for later, if there is demand.

@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2112 rejected by Butler


1 regressions found
See build details here


Found 1 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.datamodels.QueryWriteLifecycleWithCompositePartitionKeyTest.terminated successfully NEW 🔴 0 / 17

Found 1 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants