UMAP fails with cryptic error when no numeric columns available

# UMAP fails with cryptic error when no numeric columns available

## Environment
- **PyGraphistry version**: Latest (checked 2025-10-10)
- **Python version**: 3.8/3.10
- **Engine**: `umap_learn` (CPU mode)

## Summary
When calling `.umap()` on a graph with only string/object columns and without `skrub` installed, PyGraphistry's intended behavior is to drop all non-numeric columns. However, when this results in zero features remaining, it fails with a cryptic sklearn error `ValueError: at least one array or dtype is required` instead of providing a helpful, actionable error message explaining the situation and suggesting solutions.

## Minimal Reproduction

```python
import pandas as pd
import graphistry

# Create graph with only string columns
nodes = pd.DataFrame([
    ['node_1', 'TypeA', 'CategoryX'],
    ['node_2', 'TypeB', 'CategoryY'],
    ['node_3', 'TypeA', 'CategoryX'],
    ['node_4', 'TypeC', 'CategoryZ'],
    ['node_5', 'TypeB', 'CategoryY'],
    ['node_6', 'TypeA', 'CategoryX'],
], columns=['id', 'type', 'category'])

edges = pd.DataFrame([
    ['node_1', 'node_2'],
    ['node_2', 'node_3'],
    ['node_3', 'node_4'],
    ['node_4', 'node_5'],
    ['node_5', 'node_6'],
], columns=['src', 'dst'])

g = graphistry.nodes(nodes, 'id').edges(edges, 'src', 'dst')

# This fails with unclear error
g2 = g.umap(n_components=2, n_neighbors=3, engine='umap_learn')
```

## Error Traceback

```
-*-*- DataFrame is not numeric and no skrub, dropping non-numeric
* Ignoring target column of shape (6, 0) in UMAP fit, as it is not one dimensional
Traceback (most recent call last):
  File "/tmp/test.py", line 18, in umap
    res = res._process_umap(
  File ".../graphistry/umap_utils.py", line 573, in _process_umap
    emb = res._umap_fit_transform(X_, y_, umap_fit_kwargs, umap_transform_kwargs)
  File ".../graphistry/umap_utils.py", line 376, in _umap_fit_transform
    self.umap_fit(X, y, umap_fit_kwargs)
  File ".../graphistry/umap_utils.py", line 353, in umap_fit
    self._umap.fit(X, y, **umap_fit_kwargs)
  File ".../umap/umap_.py", line 2372, in fit
    X = check_array(...)
  File ".../sklearn/utils/validation.py", line 778, in check_array
    dtype_orig = np.result_type(*dtypes_orig)
ValueError: at least one array or dtype is required
```

## Root Cause

**File**: `graphistry/feature_utils.py` (line 1030-1033)

PyGraphistry's **intended behavior** when `feature_engine='auto'` and `skrub` is not installed:
- Drop all non-numeric columns with warning: `"DataFrame is not numeric and no skrub, dropping non-numeric"`
- Continue with numeric columns only

**The bug**: When ALL columns are non-numeric, the code drops everything but doesn't validate that features remain:

```python
# feature_utils.py:1030-1033
elif not all_numeric and (not has_skrub or feature_engine in ["pandas", "none"]):
    numeric_ndf = ndf.select_dtypes(include=[np.number])
    logger.warning("-*-*- DataFrame is not numeric and no skrub, dropping non-numeric")
    X_enc, _, data_encoder, _ = get_numeric_transformers(numeric_ndf, None)
    # numeric_ndf can be empty (0 columns)! No validation here.
```

When `numeric_ndf` is empty, it gets passed to sklearn's UMAP which fails with:
```
ValueError: at least one array or dtype is required
```

## Expected Behavior

Should raise a clear, actionable error message that:
1. Explains what happened (zero numeric columns)
2. Explains why (skrub not available)
3. Provides actionable solutions

```python
raise ValueError(
    f"UMAP requires numeric features for dimensionality reduction. "
    f"All {original_column_count} columns were non-numeric (dtype=object) and dropped.\n\n"
    "To fix this, you can:\n"
    "1. Install skrub for automatic categorical encoding:\n"
    "   pip install skrub\n"
    "   Then use: g.umap(feature_engine='auto')  # or 'skrub'\n\n"
    "2. Add numeric feature columns to your DataFrame\n\n"
    "3. Specify feature columns explicitly:\n"
    "   g.umap(X=['numeric_col1', 'numeric_col2'])\n\n"
    "4. Pre-encode categorical data using sklearn or pandas:\n"
    "   from sklearn.preprocessing import LabelEncoder\n"
    "   nodes['category_encoded'] = LabelEncoder().fit_transform(nodes['category'])"
)
```

## Suggested Fix

**Location**: `graphistry/feature_utils.py` after line 1033

**Add validation after selecting numeric columns**:

```python
elif not all_numeric and (not has_skrub or feature_engine in ["pandas", "none"]):
    numeric_ndf = ndf.select_dtypes(include=[np.number])
    logger.warning("-*-*- DataFrame is not numeric and no skrub, dropping non-numeric")

    # ADD THIS VALIDATION:
    if len(numeric_ndf.columns) == 0:
        raise ValueError(
            f"UMAP requires numeric features for dimensionality reduction. "
            f"All {len(ndf.columns)} columns were non-numeric (dtype=object) and dropped.\n\n"
            "To fix this, you can:\n"
            "1. Install skrub for automatic categorical encoding:\n"
            "   pip install skrub\n"
            "   Then use: g.umap(feature_engine='auto') or g.umap(feature_engine='skrub')\n\n"
            "2. Add numeric feature columns to your DataFrame\n\n"
            "3. Specify feature columns explicitly: g.umap(X=['col1', 'col2'])\n\n"
            "4. Pre-encode categorical data using sklearn or pandas"
        )

    X_enc, _, data_encoder, _ = get_numeric_transformers(numeric_ndf, None)
```

## Workaround

Until this is fixed, users encountering this error should:

**Option 1: Install skrub for automatic encoding**
```bash
pip install skrub
```
```python
g2 = g.umap(n_components=2, n_neighbors=3, feature_engine='auto')
```

**Option 2: Add numeric features**
```python
# Manual encoding
nodes['type_encoded'] = nodes['type'].astype('category').cat.codes
nodes['category_encoded'] = nodes['category'].astype('category').cat.codes

g = graphistry.nodes(nodes, 'id')
g2 = g.umap(n_components=2, n_neighbors=3)
```

**Option 3: Pre-encode with sklearn**
```python
from sklearn.preprocessing import LabelEncoder

for col in ['type', 'category']:
    nodes[f'{col}_encoded'] = LabelEncoder().fit_transform(nodes[col])

g = graphistry.nodes(nodes, 'id')
g2 = g.umap(n_components=2, n_neighbors=3)
```

## Impact

**Severity**: Medium (UX issue, not a functional bug)
- Poor user experience - cryptic sklearn error instead of helpful guidance
- Users unfamiliar with the feature_engine system may struggle to understand what went wrong
- Blocks usage of UMAP on purely categorical data without clear guidance on solutions
- The actual functionality (dropping non-numeric columns) is **working as designed**

**Scope**:
- Affects users calling `.umap()` on graphs with zero numeric node features
- Particularly impacts users working with purely categorical/qualitative data
- Only affects environments where `skrub` is not installed
- With `skrub` installed, categorical encoding happens automatically (no error)

**Frequency**: Uncommon but frustrating when encountered
- Most users have at least some numeric features
- Users who encounter it would benefit greatly from better error messaging

## Related Issues

None directly related. This is specifically about improving error messaging when PyGraphistry's intentional behavior (dropping non-numeric columns without skrub) results in zero features.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UMAP fails with cryptic error when no numeric columns available #770

UMAP fails with cryptic error when no numeric columns available

Environment

Summary

Minimal Reproduction

Error Traceback

Root Cause

Expected Behavior

Suggested Fix

Workaround

Impact

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UMAP fails with cryptic error when no numeric columns available #770

Description

UMAP fails with cryptic error when no numeric columns available

Environment

Summary

Minimal Reproduction

Error Traceback

Root Cause

Expected Behavior

Suggested Fix

Workaround

Impact

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions