Skip to content

UMAP fails with cryptic error when no numeric columns available #770

@lmeyerov

Description

@lmeyerov

UMAP fails with cryptic error when no numeric columns available

Environment

  • PyGraphistry version: Latest (checked 2025-10-10)
  • Python version: 3.8/3.10
  • Engine: umap_learn (CPU mode)

Summary

When calling .umap() on a graph with only string/object columns and without skrub installed, PyGraphistry's intended behavior is to drop all non-numeric columns. However, when this results in zero features remaining, it fails with a cryptic sklearn error ValueError: at least one array or dtype is required instead of providing a helpful, actionable error message explaining the situation and suggesting solutions.

Minimal Reproduction

import pandas as pd
import graphistry

# Create graph with only string columns
nodes = pd.DataFrame([
    ['node_1', 'TypeA', 'CategoryX'],
    ['node_2', 'TypeB', 'CategoryY'],
    ['node_3', 'TypeA', 'CategoryX'],
    ['node_4', 'TypeC', 'CategoryZ'],
    ['node_5', 'TypeB', 'CategoryY'],
    ['node_6', 'TypeA', 'CategoryX'],
], columns=['id', 'type', 'category'])

edges = pd.DataFrame([
    ['node_1', 'node_2'],
    ['node_2', 'node_3'],
    ['node_3', 'node_4'],
    ['node_4', 'node_5'],
    ['node_5', 'node_6'],
], columns=['src', 'dst'])

g = graphistry.nodes(nodes, 'id').edges(edges, 'src', 'dst')

# This fails with unclear error
g2 = g.umap(n_components=2, n_neighbors=3, engine='umap_learn')

Error Traceback

-*-*- DataFrame is not numeric and no skrub, dropping non-numeric
* Ignoring target column of shape (6, 0) in UMAP fit, as it is not one dimensional
Traceback (most recent call last):
  File "/tmp/test.py", line 18, in umap
    res = res._process_umap(
  File ".../graphistry/umap_utils.py", line 573, in _process_umap
    emb = res._umap_fit_transform(X_, y_, umap_fit_kwargs, umap_transform_kwargs)
  File ".../graphistry/umap_utils.py", line 376, in _umap_fit_transform
    self.umap_fit(X, y, umap_fit_kwargs)
  File ".../graphistry/umap_utils.py", line 353, in umap_fit
    self._umap.fit(X, y, **umap_fit_kwargs)
  File ".../umap/umap_.py", line 2372, in fit
    X = check_array(...)
  File ".../sklearn/utils/validation.py", line 778, in check_array
    dtype_orig = np.result_type(*dtypes_orig)
ValueError: at least one array or dtype is required

Root Cause

File: graphistry/feature_utils.py (line 1030-1033)

PyGraphistry's intended behavior when feature_engine='auto' and skrub is not installed:

  • Drop all non-numeric columns with warning: "DataFrame is not numeric and no skrub, dropping non-numeric"
  • Continue with numeric columns only

The bug: When ALL columns are non-numeric, the code drops everything but doesn't validate that features remain:

# feature_utils.py:1030-1033
elif not all_numeric and (not has_skrub or feature_engine in ["pandas", "none"]):
    numeric_ndf = ndf.select_dtypes(include=[np.number])
    logger.warning("-*-*- DataFrame is not numeric and no skrub, dropping non-numeric")
    X_enc, _, data_encoder, _ = get_numeric_transformers(numeric_ndf, None)
    # numeric_ndf can be empty (0 columns)! No validation here.

When numeric_ndf is empty, it gets passed to sklearn's UMAP which fails with:

ValueError: at least one array or dtype is required

Expected Behavior

Should raise a clear, actionable error message that:

  1. Explains what happened (zero numeric columns)
  2. Explains why (skrub not available)
  3. Provides actionable solutions
raise ValueError(
    f"UMAP requires numeric features for dimensionality reduction. "
    f"All {original_column_count} columns were non-numeric (dtype=object) and dropped.\n\n"
    "To fix this, you can:\n"
    "1. Install skrub for automatic categorical encoding:\n"
    "   pip install skrub\n"
    "   Then use: g.umap(feature_engine='auto')  # or 'skrub'\n\n"
    "2. Add numeric feature columns to your DataFrame\n\n"
    "3. Specify feature columns explicitly:\n"
    "   g.umap(X=['numeric_col1', 'numeric_col2'])\n\n"
    "4. Pre-encode categorical data using sklearn or pandas:\n"
    "   from sklearn.preprocessing import LabelEncoder\n"
    "   nodes['category_encoded'] = LabelEncoder().fit_transform(nodes['category'])"
)

Suggested Fix

Location: graphistry/feature_utils.py after line 1033

Add validation after selecting numeric columns:

elif not all_numeric and (not has_skrub or feature_engine in ["pandas", "none"]):
    numeric_ndf = ndf.select_dtypes(include=[np.number])
    logger.warning("-*-*- DataFrame is not numeric and no skrub, dropping non-numeric")

    # ADD THIS VALIDATION:
    if len(numeric_ndf.columns) == 0:
        raise ValueError(
            f"UMAP requires numeric features for dimensionality reduction. "
            f"All {len(ndf.columns)} columns were non-numeric (dtype=object) and dropped.\n\n"
            "To fix this, you can:\n"
            "1. Install skrub for automatic categorical encoding:\n"
            "   pip install skrub\n"
            "   Then use: g.umap(feature_engine='auto') or g.umap(feature_engine='skrub')\n\n"
            "2. Add numeric feature columns to your DataFrame\n\n"
            "3. Specify feature columns explicitly: g.umap(X=['col1', 'col2'])\n\n"
            "4. Pre-encode categorical data using sklearn or pandas"
        )

    X_enc, _, data_encoder, _ = get_numeric_transformers(numeric_ndf, None)

Workaround

Until this is fixed, users encountering this error should:

Option 1: Install skrub for automatic encoding

pip install skrub
g2 = g.umap(n_components=2, n_neighbors=3, feature_engine='auto')

Option 2: Add numeric features

# Manual encoding
nodes['type_encoded'] = nodes['type'].astype('category').cat.codes
nodes['category_encoded'] = nodes['category'].astype('category').cat.codes

g = graphistry.nodes(nodes, 'id')
g2 = g.umap(n_components=2, n_neighbors=3)

Option 3: Pre-encode with sklearn

from sklearn.preprocessing import LabelEncoder

for col in ['type', 'category']:
    nodes[f'{col}_encoded'] = LabelEncoder().fit_transform(nodes[col])

g = graphistry.nodes(nodes, 'id')
g2 = g.umap(n_components=2, n_neighbors=3)

Impact

Severity: Medium (UX issue, not a functional bug)

  • Poor user experience - cryptic sklearn error instead of helpful guidance
  • Users unfamiliar with the feature_engine system may struggle to understand what went wrong
  • Blocks usage of UMAP on purely categorical data without clear guidance on solutions
  • The actual functionality (dropping non-numeric columns) is working as designed

Scope:

  • Affects users calling .umap() on graphs with zero numeric node features
  • Particularly impacts users working with purely categorical/qualitative data
  • Only affects environments where skrub is not installed
  • With skrub installed, categorical encoding happens automatically (no error)

Frequency: Uncommon but frustrating when encountered

  • Most users have at least some numeric features
  • Users who encounter it would benefit greatly from better error messaging

Related Issues

None directly related. This is specifically about improving error messaging when PyGraphistry's intentional behavior (dropping non-numeric columns without skrub) results in zero features.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions