-
Notifications
You must be signed in to change notification settings - Fork 122
Description
Summary
The classifier gem currently depends on rb-gsl for 10x faster LSI operations, but rb-gsl is outdated and difficult to maintain. This issue implements a zero-dependency C extension by porting the existing Ruby SVD implementation to C.
Benefits
- No install requirements for users
- Simple cross-platform support
- Full control over the code
- ~5-10x faster than pure Ruby
- Algorithm already works (just translating Ruby → C)
Execution Plan
Phase 1: Project Setup
1.1 Create extension directory structure
```
ext/classifier/
├── extconf.rb
├── classifier_ext.c # Main entry point, Ruby bindings
├── linalg.h # Shared header
├── vector.c # Vector implementation
├── matrix.c # Matrix implementation
└── svd.c # Jacobi SVD implementation
```
1.2 Create extconf.rb
- Basic mkmf configuration
- No external library detection needed
- Set appropriate compiler flags (-O3, -ffast-math)
1.3 Update gemspec
- Add `s.extensions = ['ext/classifier/extconf.rb']`
- Add `ext/**/*` to file list
- Add rake-compiler as dev dependency
1.4 Update Rakefile
- Add compile task for development
- Ensure tests run against compiled extension
Phase 2: Vector Implementation
2.1 Core Vector struct and allocation
```c
typedef struct {
size_t size;
double *data;
} ClassifierVector;
```
2.2 Vector methods to implement
| Method | Description |
|---|---|
| `alloc(size)` | Allocate zero-filled vector |
| `alloc(array)` | Allocate from Ruby array |
| `[]` / `[]=` | Element access |
| `size` | Return size |
| `to_a` | Convert to Ruby array |
| `sum` | Sum all elements |
| `each` | Yield each element |
| `collect` | Map with block, return new vector |
| `normalize` | Return unit vector (v / |
| `row` / `col` | Return self (orientation marker) |
| `*` | Dot product (vector) or scale (scalar) |
2.3 Memory management
- Use Ruby's TypedData API for proper GC integration
- Implement `_dump` / `_load` for Marshal support
Phase 3: Matrix Implementation
3.1 Core Matrix struct
```c
typedef struct {
size_t rows;
size_t cols;
double *data; // Row-major storage
} ClassifierMatrix;
```
3.2 Matrix methods to implement
| Method | Description |
|---|---|
| `alloc(*rows)` | Allocate from nested arrays |
| `[]` / `[]=` | Element access |
| `size` | Return [rows, cols] |
| `row_size` / `column_size` | Dimensions |
| `trans` | Return transposed copy |
| `column(n)` | Return column as Vector |
| `diag(vector)` | Class method: diagonal matrix from vector |
| `*` | Matrix multiplication |
| `SV_decomp` | SVD (Phase 4) |
Phase 4: SVD Implementation
4.1 Port Jacobi SVD from vector.rb
The existing Ruby implementation uses one-sided Jacobi rotations:
```c
// Pseudocode - port from lib/classifier/extensions/vector.rb
void jacobi_svd(Matrix *a, Matrix *u, Matrix *v, Vector *s) {
// 1. Compute Q = A^T * A (or A * A^T)
// 2. Initialize V = Identity
// 3. Iterate Jacobi rotations until convergence
// - For each off-diagonal element
// - Compute rotation angle
// - Apply Givens rotation to Q and V
// 4. Extract singular values as sqrt(diag(Q))
// 5. Compute U = A * V * S^-1
}
```
4.2 Key functions
- `apply_jacobi_rotation()` - Apply 2x2 rotation
- `matrix_multiply()` - Used internally
- `check_convergence()` - Compare diagonal changes
4.3 Return format
Return Ruby array: `[U, V, singular_values]` matching GSL convention
Phase 5: Ruby Integration
5.1 Create wrapper module
```ruby
lib/classifier/extensions/native_linalg.rb
module Classifier
module NativeLinalg
class Vector
# C-backed implementation
end
class Matrix
# C-backed implementation
end
end
end
```
5.2 Update detection logic in lsi.rb
```ruby
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true'
Try native extension first (new)
require 'classifier/classifier_ext'
Classifier::LSI.backend = :native
rescue LoadError
begin
# Fall back to rb-gsl (deprecated)
require 'gsl'
Classifier::LSI.backend = :gsl
rescue LoadError
# Pure Ruby fallback
Classifier::LSI.backend = :ruby
end
end
```
5.3 Update usage sites
- `lsi.rb` - Use backend-agnostic calls
- `content_node.rb` - Use backend-agnostic calls
- Consider creating adapter layer to reduce conditionals
Phase 6: Testing
6.1 Unit tests for C extension
- Vector allocation and operations
- Matrix allocation and operations
- SVD correctness (compare with known results)
- Edge cases (empty, single element, non-square)
6.2 Integration tests
- Run existing LSI tests with native backend
- Ensure identical results to pure Ruby
- Test Marshal serialization
6.3 Test matrix
| Backend | Environment Variable |
|---|---|
| Native C | (default) |
| rb-gsl | `CLASSIFIER_BACKEND=gsl` |
| Pure Ruby | `NATIVE_VECTOR=true` |
Phase 7: Benchmarking
7.1 Create benchmark script
```ruby
benchmark/svd_benchmark.rb
Compare: Native C vs rb-gsl vs Pure Ruby
Varying matrix sizes: 10x10, 50x50, 100x100, 500x500
```
7.2 Expected results
| Implementation | Relative Speed |
|---|---|
| Pure Ruby | 1x |
| Native C ext | 5-10x |
| rb-gsl | ~10x |
Phase 8: Documentation & Cleanup
8.1 Update README
- Document native extension
- Installation instructions
- Backend selection
8.2 Update CLAUDE.md
- Add test commands for native extension
8.3 Deprecation notices
- Add deprecation warning for rb-gsl backend
- Plan removal in future major version
File Structure (Final)
```
classifier/
├── ext/
│ └── classifier/
│ ├── extconf.rb (~20 lines)
│ ├── classifier_ext.c (~150 lines)
│ ├── linalg.h (~50 lines)
│ ├── vector.c (~200 lines)
│ ├── matrix.c (~250 lines)
│ └── svd.c (~200 lines)
├── lib/
│ └── classifier/
│ ├── lsi.rb (modified)
│ ├── lsi/
│ │ └── content_node.rb (modified)
│ └── extensions/
│ ├── vector.rb (kept as fallback)
│ └── native_linalg.rb (new, optional adapter)
└── test/
└── native_ext_test.rb (new)
```
Estimated Size
| Component | Lines |
|---|---|
| C code total | ~850 |
| Ruby changes | ~50 |
| Tests | ~200 |
Tasks
- Phase 1: Project setup (extconf.rb, gemspec, Rakefile)
- Phase 2: Vector implementation in C
- Phase 3: Matrix implementation in C
- Phase 4: Port Jacobi SVD to C
- Phase 5: Ruby integration and detection logic
- Phase 6: Testing (unit + integration)
- Phase 7: Benchmarking
- Phase 8: Documentation and cleanup