-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
[AI-010b] Embedding Cache & Multi-Model Support
Story Points: 2
Epic: AI Integration
Dependencies: AI-010a (Real Embedding Generation) - MUST COMPLETE FIRST
Branch: feature/AI-010b
Related: Split from original AI-010 (5 points) - Part 2 of 2
Description
Add caching layer and multi-model support to the VectorGenerator. This story focuses on performance optimization through caching and flexibility through supporting multiple embedding models with different dimensions.
Prerequisites: AI-010a must be complete (real embedding generation working)
Target State: VectorGenerator with LRU cache, <100ms for cached embeddings, and support for multiple embedding models.
User Stories
- As a system, I need caching for frequently used embeddings to improve performance
- As a developer, I need support for multiple embedding models for flexibility
- As an admin, I need cache metrics to monitor performance
BDD Scenarios
Feature: Embedding Cache & Multi-Model Support
Scenario: Cache frequently used embeddings
Given I have previously embedded text
When I request embeddings again
Then cached embeddings are returned
And response time is under 100ms
Scenario: Multiple embedding models
Given I have different embedding models available
When I select a specific model
Then embeddings use that model
And dimensions match model specifications
Scenario: Cache invalidation
Given I have cached embeddings
When the cache size limit is reached
Then least recently used embeddings are evicted
And cache remains performant
Scenario: Cache metrics
Given the cache is in use
When I query cache statistics
Then I see hit rate, miss rate, and size
And metrics are accurateAcceptance Criteria
- Caching for frequently used embeddings
- <100ms response time for cached embeddings
- Support for multiple embedding models (nomic-embed-text, all-minilm, etc.)
- Dynamic dimension handling per model
- Cache hit/miss metrics tracked
- LRU eviction policy implemented
- Performance SLAs met
- Tests passing (container-first TDD)
Technical Approach
Add Caching Layer
from functools import lru_cache
from typing import Dict, Tuple
class CacheMetrics:
def __init__(self):
self.hits = 0
self.misses = 0
def record_hit(self):
self.hits += 1
def record_miss(self):
self.misses += 1
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
class VectorGenerator:
def __init__(self, model_manager: ModelManager, cache_size: int = 1000):
self.model_manager = model_manager
self.ollama_service = OllamaService()
self.cache: Dict[str, np.ndarray] = {} # Simple dict cache
self.cache_size = cache_size
self.metrics = CacheMetrics()
self.access_order = [] # For LRU tracking
async def generate(self, text: str, model: str = "nomic-embed-text") -> np.ndarray:
"""Generate embeddings with caching"""
# Create cache key
cache_key = f"{model}:{hash(text)}"
# Check cache
if cache_key in self.cache:
self.metrics.record_hit()
self._update_access(cache_key)
return self.cache[cache_key]
# Cache miss - generate embedding
self.metrics.record_miss()
embedding = await self._generate_embedding(text, model)
# Add to cache with LRU eviction
self._add_to_cache(cache_key, embedding)
return embedding
def _add_to_cache(self, key: str, value: np.ndarray):
"""Add to cache with LRU eviction"""
if len(self.cache) >= self.cache_size:
# Evict least recently used
lru_key = self.access_order.pop(0)
del self.cache[lru_key]
self.cache[key] = value
self.access_order.append(key)
def _update_access(self, key: str):
"""Update LRU access order"""
self.access_order.remove(key)
self.access_order.append(key)
async def _generate_embedding(self, text: str, model: str) -> np.ndarray:
"""Generate embedding for specific model"""
response = await self.ollama_service.embed(model, text)
return np.array(response['embedding'], dtype=np.float32)
def get_cache_stats(self) -> Dict[str, any]:
"""Get cache statistics"""
return {
'size': len(self.cache),
'max_size': self.cache_size,
'hits': self.metrics.hits,
'misses': self.metrics.misses,
'hit_rate': self.metrics.hit_rate
}Multi-Model Support
SUPPORTED_MODELS = {
'nomic-embed-text': {'dimensions': 768, 'type': 'text'},
'all-minilm-L6-v2': {'dimensions': 384, 'type': 'text'},
'all-mpnet-base-v2': {'dimensions': 768, 'type': 'text'}
}
async def get_model_dimensions(self, model: str) -> int:
"""Get dimensions for specific model"""
if model in SUPPORTED_MODELS:
return SUPPORTED_MODELS[model]['dimensions']
# Query Ollama for model info
info = await self.ollama_service.model_info(model)
return info.get('dimensions', 768) # Default to 768Test Strategy
Cache Tests
- Cache hit scenarios
- Cache miss scenarios
- LRU eviction
- Cache statistics accuracy
Performance Tests
- <100ms for cached embeddings
- Cache hit rate >80% for repeated text
- Memory usage within limits
Multi-Model Tests
- Different embedding models
- Correct dimensions per model
- Model switching
Integration Tests
- End-to-end with caching
- Multiple models in sequence
- Cache persistence across requests
Implementation Notes
Cache Configuration
- Default Size: 1000 embeddings
- Eviction Policy: LRU (Least Recently Used)
- Cache Key:
{model}:{hash(text)} - Memory Estimate: ~3MB for 1000 embeddings (768 dims)
Supported Models (Initial)
- nomic-embed-text (768 dims) - Primary
- all-minilm-L6-v2 (384 dims) - Lightweight
- all-mpnet-base-v2 (768 dims) - High quality
Performance Targets
- Cached: <100ms
- Uncached: <1s (from AI-010a)
- Cache Hit Rate: >80% for typical usage
Definition of Done
- Code reviewed and approved (2 reviewers)
- All tests passing (container environment)
- Performance SLAs met (<100ms cached, <1s uncached)
- Cache metrics working
- Multi-model support verified
- Documentation updated
- No breaking changes to AI-010a
Related Issues
- Depends On: AI-010a (Real Embedding Generation) - MUST BE COMPLETE
- Original: This is part 2 of original AI-010 (5 points split into 3+2)
Estimated Effort
Story Points: 2
Time Estimate: 2-3 days
Complexity: Low-Medium
Breakdown
- Day 1: Cache implementation and LRU eviction
- Day 2: Multi-model support and testing
- Day 3: Performance optimization and documentation
Priority: High
Type: Feature
Component: AI
Epic: AI Integration