Skip to content

[AI-010b] Embedding Cache & Multi-Model Support #86

@quaid

Description

@quaid

[AI-010b] Embedding Cache & Multi-Model Support

Story Points: 2
Epic: AI Integration
Dependencies: AI-010a (Real Embedding Generation) - MUST COMPLETE FIRST
Branch: feature/AI-010b
Related: Split from original AI-010 (5 points) - Part 2 of 2


Description

Add caching layer and multi-model support to the VectorGenerator. This story focuses on performance optimization through caching and flexibility through supporting multiple embedding models with different dimensions.

Prerequisites: AI-010a must be complete (real embedding generation working)

Target State: VectorGenerator with LRU cache, <100ms for cached embeddings, and support for multiple embedding models.


User Stories

  • As a system, I need caching for frequently used embeddings to improve performance
  • As a developer, I need support for multiple embedding models for flexibility
  • As an admin, I need cache metrics to monitor performance

BDD Scenarios

Feature: Embedding Cache & Multi-Model Support

Scenario: Cache frequently used embeddings
  Given I have previously embedded text
  When I request embeddings again
  Then cached embeddings are returned
  And response time is under 100ms

Scenario: Multiple embedding models
  Given I have different embedding models available
  When I select a specific model
  Then embeddings use that model
  And dimensions match model specifications

Scenario: Cache invalidation
  Given I have cached embeddings
  When the cache size limit is reached
  Then least recently used embeddings are evicted
  And cache remains performant

Scenario: Cache metrics
  Given the cache is in use
  When I query cache statistics
  Then I see hit rate, miss rate, and size
  And metrics are accurate

Acceptance Criteria

  • Caching for frequently used embeddings
  • <100ms response time for cached embeddings
  • Support for multiple embedding models (nomic-embed-text, all-minilm, etc.)
  • Dynamic dimension handling per model
  • Cache hit/miss metrics tracked
  • LRU eviction policy implemented
  • Performance SLAs met
  • Tests passing (container-first TDD)

Technical Approach

Add Caching Layer

from functools import lru_cache
from typing import Dict, Tuple

class CacheMetrics:
    def __init__(self):
        self.hits = 0
        self.misses = 0
    
    def record_hit(self):
        self.hits += 1
    
    def record_miss(self):
        self.misses += 1
    
    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

class VectorGenerator:
    def __init__(self, model_manager: ModelManager, cache_size: int = 1000):
        self.model_manager = model_manager
        self.ollama_service = OllamaService()
        self.cache: Dict[str, np.ndarray] = {}  # Simple dict cache
        self.cache_size = cache_size
        self.metrics = CacheMetrics()
        self.access_order = []  # For LRU tracking
    
    async def generate(self, text: str, model: str = "nomic-embed-text") -> np.ndarray:
        """Generate embeddings with caching"""
        # Create cache key
        cache_key = f"{model}:{hash(text)}"
        
        # Check cache
        if cache_key in self.cache:
            self.metrics.record_hit()
            self._update_access(cache_key)
            return self.cache[cache_key]
        
        # Cache miss - generate embedding
        self.metrics.record_miss()
        embedding = await self._generate_embedding(text, model)
        
        # Add to cache with LRU eviction
        self._add_to_cache(cache_key, embedding)
        
        return embedding
    
    def _add_to_cache(self, key: str, value: np.ndarray):
        """Add to cache with LRU eviction"""
        if len(self.cache) >= self.cache_size:
            # Evict least recently used
            lru_key = self.access_order.pop(0)
            del self.cache[lru_key]
        
        self.cache[key] = value
        self.access_order.append(key)
    
    def _update_access(self, key: str):
        """Update LRU access order"""
        self.access_order.remove(key)
        self.access_order.append(key)
    
    async def _generate_embedding(self, text: str, model: str) -> np.ndarray:
        """Generate embedding for specific model"""
        response = await self.ollama_service.embed(model, text)
        return np.array(response['embedding'], dtype=np.float32)
    
    def get_cache_stats(self) -> Dict[str, any]:
        """Get cache statistics"""
        return {
            'size': len(self.cache),
            'max_size': self.cache_size,
            'hits': self.metrics.hits,
            'misses': self.metrics.misses,
            'hit_rate': self.metrics.hit_rate
        }

Multi-Model Support

SUPPORTED_MODELS = {
    'nomic-embed-text': {'dimensions': 768, 'type': 'text'},
    'all-minilm-L6-v2': {'dimensions': 384, 'type': 'text'},
    'all-mpnet-base-v2': {'dimensions': 768, 'type': 'text'}
}

async def get_model_dimensions(self, model: str) -> int:
    """Get dimensions for specific model"""
    if model in SUPPORTED_MODELS:
        return SUPPORTED_MODELS[model]['dimensions']
    
    # Query Ollama for model info
    info = await self.ollama_service.model_info(model)
    return info.get('dimensions', 768)  # Default to 768

Test Strategy

Cache Tests

  • Cache hit scenarios
  • Cache miss scenarios
  • LRU eviction
  • Cache statistics accuracy

Performance Tests

  • <100ms for cached embeddings
  • Cache hit rate >80% for repeated text
  • Memory usage within limits

Multi-Model Tests

  • Different embedding models
  • Correct dimensions per model
  • Model switching

Integration Tests

  • End-to-end with caching
  • Multiple models in sequence
  • Cache persistence across requests

Implementation Notes

Cache Configuration

  • Default Size: 1000 embeddings
  • Eviction Policy: LRU (Least Recently Used)
  • Cache Key: {model}:{hash(text)}
  • Memory Estimate: ~3MB for 1000 embeddings (768 dims)

Supported Models (Initial)

  1. nomic-embed-text (768 dims) - Primary
  2. all-minilm-L6-v2 (384 dims) - Lightweight
  3. all-mpnet-base-v2 (768 dims) - High quality

Performance Targets

  • Cached: <100ms
  • Uncached: <1s (from AI-010a)
  • Cache Hit Rate: >80% for typical usage

Definition of Done

  • Code reviewed and approved (2 reviewers)
  • All tests passing (container environment)
  • Performance SLAs met (<100ms cached, <1s uncached)
  • Cache metrics working
  • Multi-model support verified
  • Documentation updated
  • No breaking changes to AI-010a

Related Issues

  • Depends On: AI-010a (Real Embedding Generation) - MUST BE COMPLETE
  • Original: This is part 2 of original AI-010 (5 points split into 3+2)

Estimated Effort

Story Points: 2
Time Estimate: 2-3 days
Complexity: Low-Medium

Breakdown

  • Day 1: Cache implementation and LRU eviction
  • Day 2: Multi-model support and testing
  • Day 3: Performance optimization and documentation

Priority: High
Type: Feature
Component: AI
Epic: AI Integration

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions