[AI-010b] Embedding Cache & Multi-Model Support

# [AI-010b] Embedding Cache & Multi-Model Support

**Story Points**: 2  
**Epic**: AI Integration  
**Dependencies**: AI-010a (Real Embedding Generation) - MUST COMPLETE FIRST  
**Branch**: feature/AI-010b  
**Related**: Split from original AI-010 (5 points) - Part 2 of 2

---

## Description

Add caching layer and multi-model support to the VectorGenerator. This story focuses on performance optimization through caching and flexibility through supporting multiple embedding models with different dimensions.

**Prerequisites**: AI-010a must be complete (real embedding generation working)

**Target State**: VectorGenerator with LRU cache, <100ms for cached embeddings, and support for multiple embedding models.

---

## User Stories

- As a system, I need caching for frequently used embeddings to improve performance
- As a developer, I need support for multiple embedding models for flexibility
- As an admin, I need cache metrics to monitor performance

---

## BDD Scenarios

```gherkin
Feature: Embedding Cache & Multi-Model Support

Scenario: Cache frequently used embeddings
  Given I have previously embedded text
  When I request embeddings again
  Then cached embeddings are returned
  And response time is under 100ms

Scenario: Multiple embedding models
  Given I have different embedding models available
  When I select a specific model
  Then embeddings use that model
  And dimensions match model specifications

Scenario: Cache invalidation
  Given I have cached embeddings
  When the cache size limit is reached
  Then least recently used embeddings are evicted
  And cache remains performant

Scenario: Cache metrics
  Given the cache is in use
  When I query cache statistics
  Then I see hit rate, miss rate, and size
  And metrics are accurate
```

---

## Acceptance Criteria

- [ ] Caching for frequently used embeddings
- [ ] <100ms response time for cached embeddings
- [ ] Support for multiple embedding models (nomic-embed-text, all-minilm, etc.)
- [ ] Dynamic dimension handling per model
- [ ] Cache hit/miss metrics tracked
- [ ] LRU eviction policy implemented
- [ ] Performance SLAs met
- [ ] Tests passing (container-first TDD)

---

## Technical Approach

### Add Caching Layer
```python
from functools import lru_cache
from typing import Dict, Tuple

class CacheMetrics:
    def __init__(self):
        self.hits = 0
        self.misses = 0
    
    def record_hit(self):
        self.hits += 1
    
    def record_miss(self):
        self.misses += 1
    
    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

class VectorGenerator:
    def __init__(self, model_manager: ModelManager, cache_size: int = 1000):
        self.model_manager = model_manager
        self.ollama_service = OllamaService()
        self.cache: Dict[str, np.ndarray] = {}  # Simple dict cache
        self.cache_size = cache_size
        self.metrics = CacheMetrics()
        self.access_order = []  # For LRU tracking
    
    async def generate(self, text: str, model: str = "nomic-embed-text") -> np.ndarray:
        """Generate embeddings with caching"""
        # Create cache key
        cache_key = f"{model}:{hash(text)}"
        
        # Check cache
        if cache_key in self.cache:
            self.metrics.record_hit()
            self._update_access(cache_key)
            return self.cache[cache_key]
        
        # Cache miss - generate embedding
        self.metrics.record_miss()
        embedding = await self._generate_embedding(text, model)
        
        # Add to cache with LRU eviction
        self._add_to_cache(cache_key, embedding)
        
        return embedding
    
    def _add_to_cache(self, key: str, value: np.ndarray):
        """Add to cache with LRU eviction"""
        if len(self.cache) >= self.cache_size:
            # Evict least recently used
            lru_key = self.access_order.pop(0)
            del self.cache[lru_key]
        
        self.cache[key] = value
        self.access_order.append(key)
    
    def _update_access(self, key: str):
        """Update LRU access order"""
        self.access_order.remove(key)
        self.access_order.append(key)
    
    async def _generate_embedding(self, text: str, model: str) -> np.ndarray:
        """Generate embedding for specific model"""
        response = await self.ollama_service.embed(model, text)
        return np.array(response['embedding'], dtype=np.float32)
    
    def get_cache_stats(self) -> Dict[str, any]:
        """Get cache statistics"""
        return {
            'size': len(self.cache),
            'max_size': self.cache_size,
            'hits': self.metrics.hits,
            'misses': self.metrics.misses,
            'hit_rate': self.metrics.hit_rate
        }
```

### Multi-Model Support
```python
SUPPORTED_MODELS = {
    'nomic-embed-text': {'dimensions': 768, 'type': 'text'},
    'all-minilm-L6-v2': {'dimensions': 384, 'type': 'text'},
    'all-mpnet-base-v2': {'dimensions': 768, 'type': 'text'}
}

async def get_model_dimensions(self, model: str) -> int:
    """Get dimensions for specific model"""
    if model in SUPPORTED_MODELS:
        return SUPPORTED_MODELS[model]['dimensions']
    
    # Query Ollama for model info
    info = await self.ollama_service.model_info(model)
    return info.get('dimensions', 768)  # Default to 768
```

---

## Test Strategy

### Cache Tests
- Cache hit scenarios
- Cache miss scenarios
- LRU eviction
- Cache statistics accuracy

### Performance Tests
- <100ms for cached embeddings
- Cache hit rate >80% for repeated text
- Memory usage within limits

### Multi-Model Tests
- Different embedding models
- Correct dimensions per model
- Model switching

### Integration Tests
- End-to-end with caching
- Multiple models in sequence
- Cache persistence across requests

---

## Implementation Notes

### Cache Configuration
- **Default Size**: 1000 embeddings
- **Eviction Policy**: LRU (Least Recently Used)
- **Cache Key**: `{model}:{hash(text)}`
- **Memory Estimate**: ~3MB for 1000 embeddings (768 dims)

### Supported Models (Initial)
1. **nomic-embed-text** (768 dims) - Primary
2. **all-minilm-L6-v2** (384 dims) - Lightweight
3. **all-mpnet-base-v2** (768 dims) - High quality

### Performance Targets
- **Cached**: <100ms
- **Uncached**: <1s (from AI-010a)
- **Cache Hit Rate**: >80% for typical usage

---

## Definition of Done

- [ ] Code reviewed and approved (2 reviewers)
- [ ] All tests passing (container environment)
- [ ] Performance SLAs met (<100ms cached, <1s uncached)
- [ ] Cache metrics working
- [ ] Multi-model support verified
- [ ] Documentation updated
- [ ] No breaking changes to AI-010a

---

## Related Issues

- **Depends On**: AI-010a (Real Embedding Generation) - MUST BE COMPLETE
- **Original**: This is part 2 of original AI-010 (5 points split into 3+2)

---

## Estimated Effort

**Story Points**: 2  
**Time Estimate**: 2-3 days  
**Complexity**: Low-Medium

### Breakdown
- Day 1: Cache implementation and LRU eviction
- Day 2: Multi-model support and testing
- Day 3: Performance optimization and documentation

---

**Priority**: High  
**Type**: Feature  
**Component**: AI  
**Epic**: AI Integration


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AI-010b] Embedding Cache & Multi-Model Support #86

[AI-010b] Embedding Cache & Multi-Model Support

Description

User Stories

BDD Scenarios

Acceptance Criteria

Technical Approach

Add Caching Layer

Multi-Model Support

Test Strategy

Cache Tests

Performance Tests

Multi-Model Tests

Integration Tests

Implementation Notes

Cache Configuration

Supported Models (Initial)

Performance Targets

Definition of Done

Related Issues

Estimated Effort

Breakdown

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[AI-010b] Embedding Cache & Multi-Model Support #86

Description

[AI-010b] Embedding Cache & Multi-Model Support

Description

User Stories

BDD Scenarios

Acceptance Criteria

Technical Approach

Add Caching Layer

Multi-Model Support

Test Strategy

Cache Tests

Performance Tests

Multi-Model Tests

Integration Tests

Implementation Notes

Cache Configuration

Supported Models (Initial)

Performance Targets

Definition of Done

Related Issues

Estimated Effort

Breakdown

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions