Skip to content

Conversation

@michaelsembwever
Copy link
Member

https://github.com/riptano/cndb/issues/15777

Port into main-5.0 commit e967f51

CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059)
Fixes: https://github.com/riptano/cndb/issues/15640

In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR.

The core idea is that:
* some models are documented to provide unit length vectors, and in those cases, we should skip the computational check
* otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the `writePQ` method as needed

(I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.)

Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.):

* **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) literally says embeddings are unit-length.
* **BERT** → depends. The SBERT “-cos-” models add a [`Normalize` layer](https://www.sbert.net/docs/package_reference/layers.html#normalize) so they’re fine; vanilla BERT doesn’t.
* **Google Gecko** → normalized out of the box per [Vertex AI docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).
* **NVIDIA QA-4** → nothing in the [NVIDIA NIM model card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about normalization, so assume *not* normalized and handle it yourself.
* **Cohere v3** → not explicitly in their [API docs](https://docs.cohere.com/docs/cohere-embed)

TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.

#2059)

Fixes: riptano/cndb#15640

In order to lay the ground work for Fused ADC, I want to refactor some
of the PQ/BQ logic. The unit length computation needs to move, so I
decided to move it out to its own PR.

The core idea is that:
* some models are documented to provide unit length vectors, and in
those cases, we should skip the computational check
* otherwise, we should check at runtime until we hit a non-unit length
vector, and then we can skip the check and configure the `writePQ`
method as needed

(I asked chat gpt to provide proof for the config changes proposed in
this PR. Here is it's generated description.)

Quick rundown of which models spit out normalized vectors (so cosine ==
dot product, etc.):

* **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI
FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)
literally says embeddings are unit-length.
* **BERT** → depends. The SBERT “-cos-” models add a [`Normalize`
layer](https://www.sbert.net/docs/package_reference/layers.html#normalize)
so they’re fine; vanilla BERT doesn’t.
* **Google Gecko** → normalized out of the box per [Vertex AI
docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).
* **NVIDIA QA-4** → nothing in the [NVIDIA NIM model
card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about
normalization, so assume *not* normalized and handle it yourself.
* **Cohere v3** → not explicitly in their [API
docs](https://docs.cohere.com/docs/cohere-embed)

TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual
normalization due to lack of documentation.
@github-actions
Copy link

github-actions bot commented Nov 3, 2025

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 3, 2025

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2097 rejected by Butler


2 regressions found
See build details here


Found 2 new test failures

Test Explanation Runs Upstream
paxos_test.TestPaxos.test_contention_many_threads (offheap-bti) REGRESSION 🔴🔵 0 / 14
o.a.c.cql3.validation.operations.AggregationQueriesTest.testAggregationQueryShouldNotTimeoutWhenItExceedesReadTimeout (compression) REGRESSION 🔴🔴 2 / 14

Found 5 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants