Skip to content

[Cleanup] Combine Batched and Regular KMeans Impl#2015

Open
tarang-jain wants to merge 16 commits intorapidsai:mainfrom
tarang-jain:combine-batch
Open

[Cleanup] Combine Batched and Regular KMeans Impl#2015
tarang-jain wants to merge 16 commits intorapidsai:mainfrom
tarang-jain:combine-batch

Conversation

@tarang-jain
Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain commented Apr 10, 2026

Combine batched and regular k-means implementations

  • Unified the batched (host-data) and regular (device-data) k-means fit into a single kmeans_fit template that works with both host and device mdspans via batch_load_iterator
  • Unified the device and host initialization paths in init_centroids — both now use raft::matrix::sample_rows which handles host/device transparently
  • Removed the inertia_check parameter — inertia-based convergence checking now always runs. Zero clustering cost (perfect fit) logs a warning instead of asserting. This is needed because spectral clustering can cause all points to converge on the cluster centroids itself.
  • Added init_size parameter to control how many samples are drawn for KMeansPlusPlus initialization. Defaults to n_samples for device data, min(3 * n_clusters, n_samples) for host data
  • Replaced per-iteration centroid raft::copy with std::swap of buffer pointers
  • For streaming fit, precompute data norms once and cache them: host norms cached to a host buffer on the first iteration and copied back for subsequent iterations. process_batch no longer computes norms internally
  • Replaced raw cudaPointerGetAttributes call with raft::memory_type_from_pointer
  • Updated compute_weight_scale to use raft handle and mdspan-based raft::copy
  • Precompute centroid norms once per Lloyd iteration and pass to minClusterAndDistanceCompute via a new optional precomputed_centroid_norms parameter, avoiding redundant recomputation across batches

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 10, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tarang-jain tarang-jain self-assigned this Apr 10, 2026
@tarang-jain tarang-jain added improvement Improves an existing functionality non-breaking Introduces a non-breaking change cpp labels Apr 10, 2026
@tarang-jain tarang-jain marked this pull request as ready for review April 14, 2026 01:10
@tarang-jain tarang-jain requested review from a team as code owners April 14, 2026 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant