Preserve input memory location / dtype for NN Descent by jinsolp · Pull Request #1928 · rapidsai/cuvs

jinsolp · 2026-03-18T01:54:21Z

Closes #1901

Previous Code

We almost always allocate device side fp16 arrays. This was for...
- allowing wmma usage
- allowing data modification for CosineExpanded preprocessing

Current PR Changes

No logical changes apart from removing dispatching fp32 input to use fp32 vs fp16 distance computation. This is removed now and will default to using the input type (e.g. keep fp32 as fp32). One exception is when compress_to_fp16=True and input type is fp32. In this case we conver to fp16 to exploit wmma.
Reducing redundant memory:
- We only allocate device side arrays corresponding to input dtype if input is not device-accessible (allocate half types for fp32 if compress_to_fp16=True).
- Remove preprocessing for CosineExpanded metric (because we don't want to allocate additional device side data arrays) and do the computation inside the calculate_metric function.

Peak memory usage Changes

food data (5M x 384) = 7.25GiB
sports data (13M x 284) = 18.55GiB
notice how for FP32->FP16 Device (meaning data is already on device), previous code allocates a new fp16 array, resulting in more gpu mem usage. This PR ensures that we convert to fp16 on-th-fly (resulting in the overhead in time) instead of allocating new fp16 memory for that.

Performance Changes

Conversion Overhead: On-the-fly conversion introduces negligible overhead.
Cosine Metric: Now reads l2 norms inside the calculate_metric function, aligning with access pattern used by the L2 distance metric. Adds minimal overhead (e.g. previously 18.2937s VS 18.7598s for 5Mx384 data)

copy-pr-bot · 2026-03-18T01:55:13Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cjnolet · 2026-04-09T14:07:40Z

c/include/cuvs/neighbors/nn_descent.h

+  size_t max_iterations;
+  float termination_threshold;
+  bool return_distances;
+  bool compress_to_fp16;


I think we should remove this. If the user wants fp16, they should just do this themselves. The problem is that this flips the ownership model (albeit it's done only temporarily, it still leads to unexpected behavior when we have to copy the data in fp16 form). Better if the user just converts this themselves. The problem is that we could offer this for every index type, but it's not really necessary when the user could just convert the d-type and call the index building process w/ it. Then they wouldn't have to deal w/ the additional copy in device memory at all.

That could be an option, but if we do so, the downstream ML algos will experience a slowdown.
For example, UMAP or HDBSCAN only supports fp32, and the users are forced to experience a 2x slowdown in the knn computation step because nn descent will always use the fp32 distance computation.
Are we okay with this?

cjnolet · 2026-04-09T14:10:33Z

c/include/cuvs/neighbors/nn_descent.h

+ * performance and memory usage.
 * - `NND_DIST_COMP_FP16`: Use fp16 distance computation.
+ *
+ * @deprecated To be removed in 26.08. Use cuvsNNDescentIndexParams_v6 with compress_to_fp16


Is the compress_to_fp16 the only thing that's different between the old API and the new one? If that's the case, I suggest we remove the compress_to_fp16 option altogether and never copy the dataset. I think setting the distance type is useful, but I don't think copying the dataset is useful.

Essentially the existing distance types (NND_DIST_COMP_x) and the new compress_to_fp16 do the same thing. I think the name could be a bit misleading, but this means "compress fp32 to fp16 to use fp16 distance computation". I'll change the name to use_fp16_dist_comp.

The reason it's changed from having 3 different dist comp options is because now the default behavior would be to use the original dtype.
Previously with the three distance computation types:

NND_DIST_COMP_AUTO: if fp32, dispatch to fp32 or fp16 computation depending on dim. no affect for other dtype inputs.

NND_DIST_COMP_FP32: force fp32 input to fp32 distance computation

NND_DIST_COMP_FP16: force fp32 input to fp16 distance computation

Since now we want fp32 input to always compute distance in fp32, having the AUTO and the FP32 options doesn't make sense. So I decided to use a single boolean instead to decide whether to use fp32 distance computation OR fp16 distance computation for fp32 input.

jinsolp added 2 commits March 18, 2026 01:41

keeping input data mem location

666a688

Merge branch 'rapidsai:main' into nnd-keep-input-data-mem

f4cd395

jinsolp self-assigned this Mar 18, 2026

jinsolp requested review from a team as code owners March 18, 2026 01:54

github-project-automation bot added this to Unstructured Data Processing Mar 18, 2026

jinsolp added breaking Introduces a breaking change improvement Improves an existing functionality labels Mar 18, 2026

jinsolp marked this pull request as draft March 18, 2026 01:55

jinsolp added 3 commits March 18, 2026 20:03

compute norms in fp32 if data is fp32

2855e58

change padding for uint8 and int8

fbb9023

Merge branch 'main' into nnd-keep-input-data-mem

0c1ff7d

jinsolp marked this pull request as ready for review March 20, 2026 00:15

jinsolp changed the title ~~[WIP] Preserve input memory location / dtype for NN Descent~~ Preserve input memory location / dtype for NN Descent Mar 20, 2026

jinsolp added 5 commits March 20, 2026 21:47

rm print and revert test

157fe30

Merge branch 'main' into nnd-keep-input-data-mem

ace6cae

Merge branch 'main' into nnd-keep-input-data-mem

686eb4b

fix c abi breakages

e59e6c7

Merge branch 'main' into nnd-keep-input-data-mem

632c24b

cjnolet reviewed Apr 9, 2026

View reviewed changes

Merge branch 'main' into nnd-keep-input-data-mem

836ef94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve input memory location / dtype for NN Descent#1928

Preserve input memory location / dtype for NN Descent#1928
jinsolp wants to merge 11 commits intorapidsai:mainfrom
jinsolp:nnd-keep-input-data-mem

jinsolp commented Mar 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 18, 2026

Uh oh!

cjnolet Apr 9, 2026

Uh oh!

jinsolp Apr 9, 2026

Uh oh!

cjnolet Apr 9, 2026

Uh oh!

jinsolp Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jinsolp commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Peak memory usage Changes

Performance Changes

Uh oh!

copy-pr-bot bot commented Mar 18, 2026

Uh oh!

cjnolet Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jinsolp Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jinsolp Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jinsolp commented Mar 18, 2026 •

edited

Loading