-
Notifications
You must be signed in to change notification settings - Fork 27
feat: add Ollama as chat completion and embedding provider #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…very and optimized batch processing Details: - Added OLLAMA to ModelProvider enum in model catalog - Created comprehensive Ollama provider with async client support and dynamic model discovery using /api/tags and /api/show endpoints - Implemented OllamaModelManager for automatic model classification (chat vs embedding) and metadata extraction (context length, parameters) - Built optimized batch clients (OllamaBatchChatCompletionsClient, OllamaBatchEmbeddingsClient) leveraging Ollama's native parallel processing - Added OllamaLanguageModel and OllamaEmbeddingModel configuration classes with auto_pull support - Created ResolvedOllamaModelConfig for session configuration resolution - Updated model registry to support Ollama model initialization with high RPM/TPM limits for local models - Fixed validation logic in utils.py to recognize ResolvedOllamaModelConfig - Enhanced test infrastructure with Ollama provider support and increased RPM to 100 for testing - Added ollama Python package dependency Impact: - Replaces LiteLLM dependency for Ollama models with direct native integration - Enables automatic model discovery without pre-registration - Provides optimized performance for local Ollama deployments - Successfully passes semantic classification tests with qwen3:4b and embeddinggemma:latest models
| def catalog(self) -> LocalCatalog: | ||
| """Get the catalog object.""" | ||
| return LocalCatalog(self.duckdb_conn) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to reset changes to this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please rebase!
…-through rate limiting
|
Based on feedback, simplified Ollama integration:
|
Changed multiple Ollama model validation from hard error to warning to allow users more flexibility while still informing them of potential issues. Changes: - Replaced ConfigurationError with logger.warning for multiple Ollama models - Removed FENIC_ALLOW_MULTIPLE_OLLAMA_MODELS environment variable override - Updated warning message to explain VRAM/RAM constraints and model unloading - Updated tests to verify multiple models load successfully with warnings - Removed import os (no longer needed) Users can now configure multiple Ollama models in a session but will receive a warning that models will be unloaded/reloaded if they don't fit in VRAM/RAM simultaneously, causing performance degradation. Mixed providers (e.g., Ollama + OpenAI) work without warnings as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks so much for putting this together! 🙏 Since this is a bigger PR, I'm going to break my review into a couple of passes. Mind knocking out some cleanup items first? Then we can focus on the meaty stuff in round two. Makes it easier to give good feedback on both!
| default_profile_name=model_config.default_profile | ||
| ) | ||
| elif isinstance(model_config, ResolvedOllamaModelConfig): | ||
| from fenic._inference.ollama.ollama_batch_embeddings_client import OllamaBatchEmbeddingsClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow the same pattern as other model providers and wrap the imports for OllamaBatchEmbeddingsClient and OllamaBatchCompletionsClient in a try/except. This will allow Ollama to be an optional dependency, so users who don’t have it installed won’t encounter import errors at runtime.
| def catalog(self) -> LocalCatalog: | ||
| """Get the catalog object.""" | ||
| return LocalCatalog(self.duckdb_conn) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please rebase!
|
|
||
|
|
||
| class RateLimitBucket: | ||
| """Manages a token bucket for rate limiting.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the changes to the comments please.
| OpenAILanguageModel, | ||
| ) | ||
| from fenic.api.session.config import SemanticConfig | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use pytest.importorskip("ollama") here so these tests are skipped if Ollama isn’t installed. This aligns with our goal of making Ollama an optional dependency.
| "zstandard>=0.23.0", | ||
| "json-schema-to-pydantic>=0.4.1", | ||
| "pymupdf>=1.26.4", | ||
| "ollama>=0.5.4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make Ollama an optional dependency, similar to how we handle Anthropic, Google, and Cohere?
That way, users who don’t have Ollama installed won’t need to pull it in by default, but those who want to use it can enable it through an extra (e.g. pip install fenic[ollama]).
|
|
||
|
|
||
| def main(config: Optional[fc.SessionConfig] = None): | ||
| def main(config: Optional[fc.SessionConfig] = None, language_model_provider: str = "openai", language_model_name: str = "gpt-4o-mini"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep OpenAI as the default provider for all examples, so let's revert the recent changes to the example scripts.
For testing different providers, we already use examples_session_config (see tests/conftest.py#L171
). This allows the example scripts to be run with Ollama—or any other provider—during unit tests, without needing to modify the examples themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do all unit tests pass using your choice of local models?
| self._metrics = LMMetrics() | ||
|
|
||
| # Ollama-specific optimizations | ||
| self._ollama_parallel = int(os.getenv("OLLAMA_NUM_PARALLEL", "4")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using the passthrough strategy, do we still need the _ollama_parallel and _ollama_max_queue members? I believe we can safely remove them.
As an aside, because Ollama runs in a separate process, there’s no guarantee that these environment variables will be set for the process running the Fenic script.
| def parse_openrouter_rate_limit_headers( | ||
| headers: dict | None, | ||
| ) -> tuple[int | None, float | None]: | ||
| """Parse OpenRouter rate limit headers into (rpm_hint, retry_at_epoch_seconds). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the change here.
| class ResolvedOllamaModelConfig: | ||
| model_name: str | ||
| host: str | ||
| rpm: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for rpm here anymore.
| return FatalException(Exception(f"Embedding model '{self.model}' could not be loaded or pulled")) | ||
|
|
||
| # Create async client for this request | ||
| client = ollama.AsyncClient(host=self._host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of constructing a new ollama.AsyncClient for each inference request, we should cache it and reuse a single instance (both here and for the completions client). For reference, see this example: AnthropicBatchChatCompletionsClient.
TL;DR
Adds native Ollama provider support to fenic, enabling local LLM inference with automatic model discovery, dynamic embedding dimensions,
and optimized batch processing.
What changed?
This PR introduces comprehensive Ollama integration as a first-class model provider in fenic, allowing users to run semantic operations
with locally hosted models. The implementation includes:
1. Complete Ollama Provider Implementation (~856 new lines):
src/fenic/_inference/ollama/ollama_provider.py(45 lines): Core provider class with connection managementsrc/fenic/_inference/ollama/ollama_batch_chat_completions_client.py(365 lines): Batch completion client with structured outputsupport and completion time tracking
src/fenic/_inference/ollama/ollama_batch_embeddings_client.py(268 lines): Batch embedding client with dynamic dimension detectionsrc/fenic/_inference/ollama/ollama_model_manager.py(178 lines): Model metadata management using Ollama's/api/showendpoint forcapabilities detection
2. Enhanced Rate Limiting Strategy:
src/fenic/_inference/rate_limit_strategy.pywithOllamaQueueAwareRateLimitStrategythat respects local model constraints(OLLAMA_NUM_PARALLEL, queue limits)
3. Dynamic Model Management:
src/fenic/core/_inference/model_catalog.py4. API Integration:
OllamaLanguageModelandOllamaEmbeddingModelconfiguration classes insrc/fenic/api/session/config.pysrc/fenic/api/__init__.py)5. Example Updates:
--language-model-provider ollamaand--embedding-model-provider ollamaCLI argumentsexamples/hello_world/hello_world.py,examples/feedback_clustering/feedback_clustering.py,examples/news_analysis/news_analysis.py,examples/semantic_joins/semantic_joins.py, andexamples/enrichment/enrichment.py6. Bug Fixes:
How to test?
1. Install and start Ollama: