Skip to content

Conversation

michaeljmarshall
Copy link
Member

What is the issue

Fixes: https://github.com/riptano/cndb/issues/15640

What does this PR fix and why was it fixed

In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR.

The core idea is that:

  • some models are documented to provide unit length vectors, and in those cases, we should skip the computational check
  • otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the writePQ method as needed

Embedding normalization notes

(I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.)

Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.):

  • OpenAI (ada-002, v3-small, v3-large) → already normalized. OpenAI FAQ literally says embeddings are unit-length.
  • BERT → depends. The SBERT “-cos-” models add a Normalize layer so they’re fine; vanilla BERT doesn’t.
  • Google Gecko → normalized out of the box per Vertex AI docs.
  • NVIDIA QA-4 → nothing in the NVIDIA NIM model card about normalization, so assume not normalized and handle it yourself.
  • Cohere v3 → not explicitly in their API docs

TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.

Copy link

github-actions bot commented Oct 9, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version this is unnecessary
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@michaeljmarshall
Copy link
Member Author

VectorKeyRestrictedOnPartitionTest fails with the below message. It passes locally. I am classifying it as flaky.

Expecting actual:
  80
to be close to:
  89
by less than 10% but difference was 10.112359550561797%.
(a difference of exactly 10% being considered valid)

Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2059 rejected by Butler


1 regressions found
See build details here


Found 1 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.VectorKeyRestrictedOnPartitionTest.partitionRestrictedWidePartitionBqCompressedTest[eb] (compression) REGRESSION 🔵🔴 0 / 9

No known test failures found

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user declares that their embeddings use a source_model with unit length vectors but this is not really the case ?

This may happen by mistake or to people playing with the API

@michaeljmarshall
Copy link
Member Author

What happens if the user declares that their embeddings use a source_model with unit length vectors but this is not really the case ?

If a user configures the wrong source model, we'll do other things wrong too. We use the source model to get optimizations that we can only apply because we know the model that produced the embeddings. I personally do not think we should plan for such scenarios, but if that is required, we can remove the model optimization for unit length check and just do the check on every inserted vector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants