CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059) #2097

michaelsembwever · 2025-11-03T12:40:10Z

https://github.com/riptano/cndb/issues/15777

Port into main-5.0 commit e967f51

CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059)
Fixes: https://github.com/riptano/cndb/issues/15640

In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR.

The core idea is that:
* some models are documented to provide unit length vectors, and in those cases, we should skip the computational check
* otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the `writePQ` method as needed

(I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.)

Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.):

* **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) literally says embeddings are unit-length.
* **BERT** → depends. The SBERT “-cos-” models add a [`Normalize` layer](https://www.sbert.net/docs/package_reference/layers.html#normalize) so they’re fine; vanilla BERT doesn’t.
* **Google Gecko** → normalized out of the box per [Vertex AI docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).
* **NVIDIA QA-4** → nothing in the [NVIDIA NIM model card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about normalization, so assume *not* normalized and handle it yourself.
* **Cohere v3** → not explicitly in their [API docs](https://docs.cohere.com/docs/cohere-embed)

TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.

#2059) Fixes: riptano/cndb#15640 In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR. The core idea is that: * some models are documented to provide unit length vectors, and in those cases, we should skip the computational check * otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the `writePQ` method as needed (I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.) Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.): * **OpenAI (ada-002, v3-small, v3-large)** → already normalized. [OpenAI FAQ](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) literally says embeddings are unit-length. * **BERT** → depends. The SBERT “-cos-” models add a [`Normalize` layer](https://www.sbert.net/docs/package_reference/layers.html#normalize) so they’re fine; vanilla BERT doesn’t. * **Google Gecko** → normalized out of the box per [Vertex AI docs](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). * **NVIDIA QA-4** → nothing in the [NVIDIA NIM model card](https://docs.api.nvidia.com/nim/reference/nvidia-embed-qa-4) about normalization, so assume *not* normalized and handle it yourself. * **Cohere v3** → not explicitly in their [API docs](https://docs.cohere.com/docs/cohere-embed) TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.

github-actions · 2025-11-03T12:40:27Z

sonarqubecloud · 2025-11-03T13:41:09Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-11-03T13:44:10Z

❌ Build ds-cassandra-pr-gate/PR-2097 rejected by Butler

2 regressions found
See build details here

Found 2 new test failures

Test	Explanation	Runs	Upstream
paxos_test.TestPaxos.test_contention_many_threads (offheap-bti)	REGRESSION	🔴🔵	0 / 14
o.a.c.cql3.validation.operations.AggregationQueriesTest.testAggregationQueryShouldNotTimeoutWhenItExceedesReadTimeout (compression)	REGRESSION	🔴🔴	2 / 14

Found 5 known test failures

michaelsembwever force-pushed the mck-cndb-15777-main-5.0 branch from 45b1309 to ab576ef Compare November 3, 2025 12:50

djatnieks approved these changes Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059) #2097

CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059) #2097

michaelsembwever commented Nov 3, 2025

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

sonarqubecloud bot commented Nov 3, 2025

Uh oh!

cassci-bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059) #2097

Are you sure you want to change the base?

CNDB-15777: CNDB-15640: Determine if vectors are unit length at insert (#2059) #2097

Conversation

michaelsembwever commented Nov 3, 2025

Uh oh!

github-actions bot commented Nov 3, 2025

Checklist before you submit for review

Uh oh!

sonarqubecloud bot commented Nov 3, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Nov 3, 2025

❌ Build ds-cassandra-pr-gate/PR-2097 rejected by Butler

Found 2 new test failures

Found 5 known test failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants