Skip to content

Bug: UMAP fails with cuDF DataFrames containing string node columns #765

@lmeyerov

Description

@lmeyerov

Bug: UMAP fails with cuDF DataFrames containing string node columns

Summary

UMAP with engine='cuml' fails when the graph has cuDF DataFrames with string-typed node ID columns. The error "String arrays are not supported by cupy" occurs before featurization, making all feature_engine settings fail.

Environment

  • PyGraphistry: v0.43.2
  • cuDF: 24.12.0
  • cuML: 24.12.0
  • Python: 3.10.16

Reproduction

import pandas as pd
import cudf
import graphistry

# Create graph with string node IDs
nodes_df = pd.DataFrame([
    ['A', 'town1', 'Park'],
    ['B', 'town1', 'Store'],
], columns=['name', 'town', 'type'])
edges_df = pd.DataFrame([['A', 'B']], columns=['src', 'dst'])

# Convert to cuDF
nodes_cudf = cudf.from_pandas(nodes_df)
edges_cudf = cudf.from_pandas(edges_df)

g = graphistry.nodes(nodes_cudf, 'name').edges(edges_cudf, 'src', 'dst')

# This fails with "String arrays are not supported by cupy"
g.umap(n_components=2, n_neighbors=2, engine='cuml')

Error Traceback

TypeError: String arrays are not supported by cupy
  File "graphistry/umap_utils.py", line 827, in umap
    nodes = res._nodes[res._node].values
  File "cudf/core/single_column_frame.py", line 109, in values
    return self._column.values
  File "cudf/core/column/string.py", line 5932, in values
    raise TypeError("String arrays are not supported by cupy")

Root Cause

Location: graphistry/umap_utils.py:827

nodes = res._nodes[res._node].values  # ❌ Fails when res._nodes is cuDF with string dtype

When res._nodes[res._node] is a cuDF Series with string dtype (dtype='object'), calling .values attempts to convert to a cupy array, which raises the error because cupy doesn't support string arrays.

This happens before featurization, so the issue is not in feature_engine resolution or featurize() - it's in extracting the node IDs themselves.

Proposed Fix

# Handle cuDF string columns that can't convert to cupy arrays directly
node_series = res._nodes[res._node]
if 'cudf' in str(getmodule(node_series)):
    # cuDF string columns (dtype='object') can't use .values (raises cupy error)
    # Convert to pandas first, then get numpy array
    if node_series.dtype == 'object' or str(node_series.dtype) == 'object':
        logger.debug('Converting cuDF string column to pandas for node extraction')
        nodes = node_series.to_pandas().values
    else:
        import cupy as cp
        nodes = cp.asnumpy(node_series.values)
else:
    nodes = node_series.values

Testing Evidence

Test 1: pandas DataFrame → UMAP cuml ✅ WORKS
Test 2: cuDF DataFrame → UMAP cuml ❌ FAILS (all feature_engine settings)

The fix converts cuDF string columns to pandas before extracting values, avoiding the cupy limitation while preserving GPU acceleration for the actual UMAP computation.

Impact

This affects any workflow using:

  • cuDF DataFrames as input
  • String-typed node ID columns (very common)
  • UMAP with engine='cuml' or engine='auto' (when cuML is available)

Workaround (before fix)

Use engine='umap_learn' to force CPU mode:

g.umap(n_components=2, n_neighbors=2, engine='umap_learn')

Or convert to pandas before UMAP:

g_pandas = g.nodes(g._nodes.to_pandas())
g_pandas.umap(n_components=2, n_neighbors=2, engine='cuml')

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions