-
Notifications
You must be signed in to change notification settings - Fork 217
Description
Bug: UMAP fails with cuDF DataFrames containing string node columns
Summary
UMAP with engine='cuml'
fails when the graph has cuDF DataFrames with string-typed node ID columns. The error "String arrays are not supported by cupy" occurs before featurization, making all feature_engine
settings fail.
Environment
- PyGraphistry: v0.43.2
- cuDF: 24.12.0
- cuML: 24.12.0
- Python: 3.10.16
Reproduction
import pandas as pd
import cudf
import graphistry
# Create graph with string node IDs
nodes_df = pd.DataFrame([
['A', 'town1', 'Park'],
['B', 'town1', 'Store'],
], columns=['name', 'town', 'type'])
edges_df = pd.DataFrame([['A', 'B']], columns=['src', 'dst'])
# Convert to cuDF
nodes_cudf = cudf.from_pandas(nodes_df)
edges_cudf = cudf.from_pandas(edges_df)
g = graphistry.nodes(nodes_cudf, 'name').edges(edges_cudf, 'src', 'dst')
# This fails with "String arrays are not supported by cupy"
g.umap(n_components=2, n_neighbors=2, engine='cuml')
Error Traceback
TypeError: String arrays are not supported by cupy
File "graphistry/umap_utils.py", line 827, in umap
nodes = res._nodes[res._node].values
File "cudf/core/single_column_frame.py", line 109, in values
return self._column.values
File "cudf/core/column/string.py", line 5932, in values
raise TypeError("String arrays are not supported by cupy")
Root Cause
Location: graphistry/umap_utils.py:827
nodes = res._nodes[res._node].values # ❌ Fails when res._nodes is cuDF with string dtype
When res._nodes[res._node]
is a cuDF Series with string dtype (dtype='object'), calling .values
attempts to convert to a cupy array, which raises the error because cupy doesn't support string arrays.
This happens before featurization, so the issue is not in feature_engine
resolution or featurize() - it's in extracting the node IDs themselves.
Proposed Fix
# Handle cuDF string columns that can't convert to cupy arrays directly
node_series = res._nodes[res._node]
if 'cudf' in str(getmodule(node_series)):
# cuDF string columns (dtype='object') can't use .values (raises cupy error)
# Convert to pandas first, then get numpy array
if node_series.dtype == 'object' or str(node_series.dtype) == 'object':
logger.debug('Converting cuDF string column to pandas for node extraction')
nodes = node_series.to_pandas().values
else:
import cupy as cp
nodes = cp.asnumpy(node_series.values)
else:
nodes = node_series.values
Testing Evidence
Test 1: pandas DataFrame → UMAP cuml ✅ WORKS
Test 2: cuDF DataFrame → UMAP cuml ❌ FAILS (all feature_engine settings)
The fix converts cuDF string columns to pandas before extracting values, avoiding the cupy limitation while preserving GPU acceleration for the actual UMAP computation.
Impact
This affects any workflow using:
- cuDF DataFrames as input
- String-typed node ID columns (very common)
- UMAP with
engine='cuml'
orengine='auto'
(when cuML is available)
Workaround (before fix)
Use engine='umap_learn'
to force CPU mode:
g.umap(n_components=2, n_neighbors=2, engine='umap_learn')
Or convert to pandas before UMAP:
g_pandas = g.nodes(g._nodes.to_pandas())
g_pandas.umap(n_components=2, n_neighbors=2, engine='cuml')