-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-15619: Allow to customize SAI format to write and to consider as current() depending on a keyspace #2041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checklist before you submit for review
|
/** Cost to begin processing PKs into index ordinals for estimateAnnSortCost */ | ||
// DC introduced the one-to-many ordinal mapping optimization | ||
public final static double ANN_SORT_OPEN_COST = Version.current().onOrAfter(Version.DC) ? 370 : 4200; | ||
public final static double ANN_SORT_OPEN_COST = Version.current().onOrAfter(Version.DC) ? 370 : 4200; // it's okay to make this depending on the main SAI version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eolivelli why is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay because in production we use 'EC' or 'AA' everywhere
where we use 'AA' there is no vector search
so ALL the production clusters that support vector search use 'EC' or later
we can remove this check and make it a fixed constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can’t let keyspaces on AA use ORDER BY.
is this just Ann order by or also the generic one?
If this is just for ANN then it’s ok. The base will be EC and we won’t let people make a vector index to attempt ANN if their keyspace doesn’t support EC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a quick look, it seems that this and the following ANN_SORT_KEY_COST
definitions are the only places (outside of tests) that use Version.current()
(the version without a keyspace).
I strongly suggest we fix this, remove Version.current()
and have the version selection take String
instead of Optional<String>
. IMO, the concept of a "main" SAI version is confusing and is dangerous (makes it way too easy to rely on the default with debatable justification), so we should eliminate if it's not strictly needed (and it doesn't seem it is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I have removed Version.current()
and Version.CURRENT
, so Version.current(String)
is the only door. VersionSelector
is moved to Version.Selector
, which doesn't have an optional keyspace argument anymore.
Setting as ready to review so CI runs on push, and added |
src/java/org/apache/cassandra/index/sai/disk/format/IndexComponents.java
Outdated
Show resolved
Hide resolved
Preconditions.checkArgument(!sealed, "Should not add components for SSTable %s at this point; the completion marker has already been written", descriptor); | ||
// When a sstable doesn't have any complete group, we use a marker empty one with a generation of -1: | ||
Preconditions.checkArgument(buildId != EMPTY_GROUP_MARKER, "Should not be adding component to empty components"); | ||
// TODO.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find a good way to implement this assertion, maybe we can remove the commented code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added back the checks using a memoized instance of the empty group marker for the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of very minor nits, but lgtm otherwise.
IndexComponentsImpl components; | ||
if (buildId == null) | ||
{ | ||
Version version = context == null ? Version.current(descriptor.ksname) : context.version(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line is dead code/unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
}); | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the goal of adding that override? The default implementation in IndexComponents
already returns buildId().version()
, so it seems this override returns the same thing is isn't necessary. Obviously not a big deal even if unnecessary but ...
var indexWriter = new OnDiskGraphIndexWriter.Builder(builder.getGraph(), indexFile.toPath()) | ||
.withStartOffset(termsOffset) | ||
.withVersion(Version.current().onDiskFormat().jvectorFileFormatVersion()) | ||
.withVersion(perIndexComponents.context().version().onDiskFormat().jvectorFileFormatVersion()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the .context()
part can be removed (at least for consistency with other calls in this file :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, removed.
CassandraOnHeapGraph.writePqHeader(pqOutput.asSequentialWriter(), unitVectors, VectorCompression.CompressionType.PRODUCT_QUANTIZATION); | ||
compressedVectors.write(pqOutput.asSequentialWriter(), Version.current().onDiskFormat().jvectorFileFormatVersion()); | ||
CassandraOnHeapGraph.writePqHeader(pqOutput.asSequentialWriter(), unitVectors, VectorCompression.CompressionType.PRODUCT_QUANTIZATION, perIndexComponents.version()); | ||
compressedVectors.write(pqOutput.asSequentialWriter(), context.version().onDiskFormat().jvectorFileFormatVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we use perIndexComponents.version()
the line above, and context.version()
here. It's obviously not a big deal since the value is always going to be the same, but that also feel a tad confusing for people not as familiar with the code, that on seeing it done differently may expect those 2 things aren't the same. Tl;dr, I think it's worth being consistent to facilitate future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I have put the context's version in a local var and used it on all the calls.
Thanks for the review. I think I've addressed all the suggestions. Now I'll rebase and squash, and start working on a port to |
… current() depending on a keyspace There is a new cassandra.sai.version.selector.class system property allowing to provide an implementation of the o.a.c.index.sai.disk.format.Version.Selector interface to specify that version of the SAI on-disk index format should be used for each keyspace.
be59b99
to
8eba026
Compare
|
❌ Build ds-cassandra-pr-gate/PR-2041 rejected by Butler1 regressions found Found 1 new test failures
No known test failures found |
The occasional failure of |
… current() depending on a keyspace (#2064) There is a new cassandra.sai.version.selector.class system property allowing to provide an implementation of the o.a.c.index.sai.disk.format.Version.Selector interface to specify that version of the SAI on-disk index format should be used for each keyspace. This is a cherry-pick of #2041 Co-authored-by: Enrico Olivelli <[email protected]>
What is the issue
See https://github.com/riptano/cndb/issues/15619
What does this PR fix and why was it fixed
volatile
variable, now on some hot paths we are using afinal
field (in IndexContext or other places)CNDB side implementation: https://github.com/riptano/cndb/issues/15619