-
Notifications
You must be signed in to change notification settings - Fork 217
Description
UMAP fails with cryptic error when no numeric columns available
Environment
- PyGraphistry version: Latest (checked 2025-10-10)
- Python version: 3.8/3.10
- Engine:
umap_learn
(CPU mode)
Summary
When calling .umap()
on a graph with only string/object columns and without skrub
installed, PyGraphistry's intended behavior is to drop all non-numeric columns. However, when this results in zero features remaining, it fails with a cryptic sklearn error ValueError: at least one array or dtype is required
instead of providing a helpful, actionable error message explaining the situation and suggesting solutions.
Minimal Reproduction
import pandas as pd
import graphistry
# Create graph with only string columns
nodes = pd.DataFrame([
['node_1', 'TypeA', 'CategoryX'],
['node_2', 'TypeB', 'CategoryY'],
['node_3', 'TypeA', 'CategoryX'],
['node_4', 'TypeC', 'CategoryZ'],
['node_5', 'TypeB', 'CategoryY'],
['node_6', 'TypeA', 'CategoryX'],
], columns=['id', 'type', 'category'])
edges = pd.DataFrame([
['node_1', 'node_2'],
['node_2', 'node_3'],
['node_3', 'node_4'],
['node_4', 'node_5'],
['node_5', 'node_6'],
], columns=['src', 'dst'])
g = graphistry.nodes(nodes, 'id').edges(edges, 'src', 'dst')
# This fails with unclear error
g2 = g.umap(n_components=2, n_neighbors=3, engine='umap_learn')
Error Traceback
-*-*- DataFrame is not numeric and no skrub, dropping non-numeric
* Ignoring target column of shape (6, 0) in UMAP fit, as it is not one dimensional
Traceback (most recent call last):
File "/tmp/test.py", line 18, in umap
res = res._process_umap(
File ".../graphistry/umap_utils.py", line 573, in _process_umap
emb = res._umap_fit_transform(X_, y_, umap_fit_kwargs, umap_transform_kwargs)
File ".../graphistry/umap_utils.py", line 376, in _umap_fit_transform
self.umap_fit(X, y, umap_fit_kwargs)
File ".../graphistry/umap_utils.py", line 353, in umap_fit
self._umap.fit(X, y, **umap_fit_kwargs)
File ".../umap/umap_.py", line 2372, in fit
X = check_array(...)
File ".../sklearn/utils/validation.py", line 778, in check_array
dtype_orig = np.result_type(*dtypes_orig)
ValueError: at least one array or dtype is required
Root Cause
File: graphistry/feature_utils.py
(line 1030-1033)
PyGraphistry's intended behavior when feature_engine='auto'
and skrub
is not installed:
- Drop all non-numeric columns with warning:
"DataFrame is not numeric and no skrub, dropping non-numeric"
- Continue with numeric columns only
The bug: When ALL columns are non-numeric, the code drops everything but doesn't validate that features remain:
# feature_utils.py:1030-1033
elif not all_numeric and (not has_skrub or feature_engine in ["pandas", "none"]):
numeric_ndf = ndf.select_dtypes(include=[np.number])
logger.warning("-*-*- DataFrame is not numeric and no skrub, dropping non-numeric")
X_enc, _, data_encoder, _ = get_numeric_transformers(numeric_ndf, None)
# numeric_ndf can be empty (0 columns)! No validation here.
When numeric_ndf
is empty, it gets passed to sklearn's UMAP which fails with:
ValueError: at least one array or dtype is required
Expected Behavior
Should raise a clear, actionable error message that:
- Explains what happened (zero numeric columns)
- Explains why (skrub not available)
- Provides actionable solutions
raise ValueError(
f"UMAP requires numeric features for dimensionality reduction. "
f"All {original_column_count} columns were non-numeric (dtype=object) and dropped.\n\n"
"To fix this, you can:\n"
"1. Install skrub for automatic categorical encoding:\n"
" pip install skrub\n"
" Then use: g.umap(feature_engine='auto') # or 'skrub'\n\n"
"2. Add numeric feature columns to your DataFrame\n\n"
"3. Specify feature columns explicitly:\n"
" g.umap(X=['numeric_col1', 'numeric_col2'])\n\n"
"4. Pre-encode categorical data using sklearn or pandas:\n"
" from sklearn.preprocessing import LabelEncoder\n"
" nodes['category_encoded'] = LabelEncoder().fit_transform(nodes['category'])"
)
Suggested Fix
Location: graphistry/feature_utils.py
after line 1033
Add validation after selecting numeric columns:
elif not all_numeric and (not has_skrub or feature_engine in ["pandas", "none"]):
numeric_ndf = ndf.select_dtypes(include=[np.number])
logger.warning("-*-*- DataFrame is not numeric and no skrub, dropping non-numeric")
# ADD THIS VALIDATION:
if len(numeric_ndf.columns) == 0:
raise ValueError(
f"UMAP requires numeric features for dimensionality reduction. "
f"All {len(ndf.columns)} columns were non-numeric (dtype=object) and dropped.\n\n"
"To fix this, you can:\n"
"1. Install skrub for automatic categorical encoding:\n"
" pip install skrub\n"
" Then use: g.umap(feature_engine='auto') or g.umap(feature_engine='skrub')\n\n"
"2. Add numeric feature columns to your DataFrame\n\n"
"3. Specify feature columns explicitly: g.umap(X=['col1', 'col2'])\n\n"
"4. Pre-encode categorical data using sklearn or pandas"
)
X_enc, _, data_encoder, _ = get_numeric_transformers(numeric_ndf, None)
Workaround
Until this is fixed, users encountering this error should:
Option 1: Install skrub for automatic encoding
pip install skrub
g2 = g.umap(n_components=2, n_neighbors=3, feature_engine='auto')
Option 2: Add numeric features
# Manual encoding
nodes['type_encoded'] = nodes['type'].astype('category').cat.codes
nodes['category_encoded'] = nodes['category'].astype('category').cat.codes
g = graphistry.nodes(nodes, 'id')
g2 = g.umap(n_components=2, n_neighbors=3)
Option 3: Pre-encode with sklearn
from sklearn.preprocessing import LabelEncoder
for col in ['type', 'category']:
nodes[f'{col}_encoded'] = LabelEncoder().fit_transform(nodes[col])
g = graphistry.nodes(nodes, 'id')
g2 = g.umap(n_components=2, n_neighbors=3)
Impact
Severity: Medium (UX issue, not a functional bug)
- Poor user experience - cryptic sklearn error instead of helpful guidance
- Users unfamiliar with the feature_engine system may struggle to understand what went wrong
- Blocks usage of UMAP on purely categorical data without clear guidance on solutions
- The actual functionality (dropping non-numeric columns) is working as designed
Scope:
- Affects users calling
.umap()
on graphs with zero numeric node features - Particularly impacts users working with purely categorical/qualitative data
- Only affects environments where
skrub
is not installed - With
skrub
installed, categorical encoding happens automatically (no error)
Frequency: Uncommon but frustrating when encountered
- Most users have at least some numeric features
- Users who encounter it would benefit greatly from better error messaging
Related Issues
None directly related. This is specifically about improving error messaging when PyGraphistry's intentional behavior (dropping non-numeric columns without skrub) results in zero features.