Skip to content

Explore minimal GSL C extension to replace rb-gsl dependency #87

@cardmagic

Description

@cardmagic

Summary

The classifier gem currently depends on rb-gsl for 10x faster LSI operations, but rb-gsl is outdated and difficult to maintain. This issue implements a zero-dependency C extension by porting the existing Ruby SVD implementation to C.

Benefits

  • No install requirements for users
  • Simple cross-platform support
  • Full control over the code
  • ~5-10x faster than pure Ruby
  • Algorithm already works (just translating Ruby → C)

Execution Plan

Phase 1: Project Setup

1.1 Create extension directory structure
```
ext/classifier/
├── extconf.rb
├── classifier_ext.c # Main entry point, Ruby bindings
├── linalg.h # Shared header
├── vector.c # Vector implementation
├── matrix.c # Matrix implementation
└── svd.c # Jacobi SVD implementation
```

1.2 Create extconf.rb

  • Basic mkmf configuration
  • No external library detection needed
  • Set appropriate compiler flags (-O3, -ffast-math)

1.3 Update gemspec

  • Add `s.extensions = ['ext/classifier/extconf.rb']`
  • Add `ext/**/*` to file list
  • Add rake-compiler as dev dependency

1.4 Update Rakefile

  • Add compile task for development
  • Ensure tests run against compiled extension

Phase 2: Vector Implementation

2.1 Core Vector struct and allocation
```c
typedef struct {
size_t size;
double *data;
} ClassifierVector;
```

2.2 Vector methods to implement

Method Description
`alloc(size)` Allocate zero-filled vector
`alloc(array)` Allocate from Ruby array
`[]` / `[]=` Element access
`size` Return size
`to_a` Convert to Ruby array
`sum` Sum all elements
`each` Yield each element
`collect` Map with block, return new vector
`normalize` Return unit vector (v /
`row` / `col` Return self (orientation marker)
`*` Dot product (vector) or scale (scalar)

2.3 Memory management

  • Use Ruby's TypedData API for proper GC integration
  • Implement `_dump` / `_load` for Marshal support

Phase 3: Matrix Implementation

3.1 Core Matrix struct
```c
typedef struct {
size_t rows;
size_t cols;
double *data; // Row-major storage
} ClassifierMatrix;
```

3.2 Matrix methods to implement

Method Description
`alloc(*rows)` Allocate from nested arrays
`[]` / `[]=` Element access
`size` Return [rows, cols]
`row_size` / `column_size` Dimensions
`trans` Return transposed copy
`column(n)` Return column as Vector
`diag(vector)` Class method: diagonal matrix from vector
`*` Matrix multiplication
`SV_decomp` SVD (Phase 4)

Phase 4: SVD Implementation

4.1 Port Jacobi SVD from vector.rb

The existing Ruby implementation uses one-sided Jacobi rotations:

```c
// Pseudocode - port from lib/classifier/extensions/vector.rb
void jacobi_svd(Matrix *a, Matrix *u, Matrix *v, Vector *s) {
// 1. Compute Q = A^T * A (or A * A^T)
// 2. Initialize V = Identity
// 3. Iterate Jacobi rotations until convergence
// - For each off-diagonal element
// - Compute rotation angle
// - Apply Givens rotation to Q and V
// 4. Extract singular values as sqrt(diag(Q))
// 5. Compute U = A * V * S^-1
}
```

4.2 Key functions

  • `apply_jacobi_rotation()` - Apply 2x2 rotation
  • `matrix_multiply()` - Used internally
  • `check_convergence()` - Compare diagonal changes

4.3 Return format
Return Ruby array: `[U, V, singular_values]` matching GSL convention


Phase 5: Ruby Integration

5.1 Create wrapper module
```ruby

lib/classifier/extensions/native_linalg.rb

module Classifier
module NativeLinalg
class Vector
# C-backed implementation
end
class Matrix
# C-backed implementation
end
end
end
```

5.2 Update detection logic in lsi.rb
```ruby
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true'

Try native extension first (new)

require 'classifier/classifier_ext'
Classifier::LSI.backend = :native
rescue LoadError
begin
# Fall back to rb-gsl (deprecated)
require 'gsl'
Classifier::LSI.backend = :gsl
rescue LoadError
# Pure Ruby fallback
Classifier::LSI.backend = :ruby
end
end
```

5.3 Update usage sites

  • `lsi.rb` - Use backend-agnostic calls
  • `content_node.rb` - Use backend-agnostic calls
  • Consider creating adapter layer to reduce conditionals

Phase 6: Testing

6.1 Unit tests for C extension

  • Vector allocation and operations
  • Matrix allocation and operations
  • SVD correctness (compare with known results)
  • Edge cases (empty, single element, non-square)

6.2 Integration tests

  • Run existing LSI tests with native backend
  • Ensure identical results to pure Ruby
  • Test Marshal serialization

6.3 Test matrix

Backend Environment Variable
Native C (default)
rb-gsl `CLASSIFIER_BACKEND=gsl`
Pure Ruby `NATIVE_VECTOR=true`

Phase 7: Benchmarking

7.1 Create benchmark script
```ruby

benchmark/svd_benchmark.rb

Compare: Native C vs rb-gsl vs Pure Ruby

Varying matrix sizes: 10x10, 50x50, 100x100, 500x500

```

7.2 Expected results

Implementation Relative Speed
Pure Ruby 1x
Native C ext 5-10x
rb-gsl ~10x

Phase 8: Documentation & Cleanup

8.1 Update README

  • Document native extension
  • Installation instructions
  • Backend selection

8.2 Update CLAUDE.md

  • Add test commands for native extension

8.3 Deprecation notices

  • Add deprecation warning for rb-gsl backend
  • Plan removal in future major version

File Structure (Final)

```
classifier/
├── ext/
│ └── classifier/
│ ├── extconf.rb (~20 lines)
│ ├── classifier_ext.c (~150 lines)
│ ├── linalg.h (~50 lines)
│ ├── vector.c (~200 lines)
│ ├── matrix.c (~250 lines)
│ └── svd.c (~200 lines)
├── lib/
│ └── classifier/
│ ├── lsi.rb (modified)
│ ├── lsi/
│ │ └── content_node.rb (modified)
│ └── extensions/
│ ├── vector.rb (kept as fallback)
│ └── native_linalg.rb (new, optional adapter)
└── test/
└── native_ext_test.rb (new)
```

Estimated Size

Component Lines
C code total ~850
Ruby changes ~50
Tests ~200

Tasks

  • Phase 1: Project setup (extconf.rb, gemspec, Rakefile)
  • Phase 2: Vector implementation in C
  • Phase 3: Matrix implementation in C
  • Phase 4: Port Jacobi SVD to C
  • Phase 5: Ruby integration and detection logic
  • Phase 6: Testing (unit + integration)
  • Phase 7: Benchmarking
  • Phase 8: Documentation and cleanup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions