feat: add Ollama as chat completion and embedding provider #238

iceysteel · 2025-09-29T07:42:13Z

TL;DR

Adds native Ollama provider support to fenic, enabling local LLM inference with automatic model discovery, dynamic embedding dimensions,
and optimized batch processing.

What changed?

This PR introduces comprehensive Ollama integration as a first-class model provider in fenic, allowing users to run semantic operations
with locally hosted models. The implementation includes:

1. Complete Ollama Provider Implementation (~856 new lines):

src/fenic/_inference/ollama/ollama_provider.py (45 lines): Core provider class with connection management
src/fenic/_inference/ollama/ollama_batch_chat_completions_client.py (365 lines): Batch completion client with structured output
support and completion time tracking
src/fenic/_inference/ollama/ollama_batch_embeddings_client.py (268 lines): Batch embedding client with dynamic dimension detection
src/fenic/_inference/ollama/ollama_model_manager.py (178 lines): Model metadata management using Ollama's /api/show endpoint for
capabilities detection

2. Enhanced Rate Limiting Strategy:

Extended src/fenic/_inference/rate_limit_strategy.py with OllamaQueueAwareRateLimitStrategy that respects local model constraints
(OLLAMA_NUM_PARALLEL, queue limits)

3. Dynamic Model Management:

Auto-pull functionality for missing models during validation in src/fenic/core/_inference/model_catalog.py
Dynamic embedding dimension detection from model metadata instead of hardcoded values
Capabilities-based model type detection (embedding vs completion models)

4. API Integration:

Added OllamaLanguageModel and OllamaEmbeddingModel configuration classes in src/fenic/api/session/config.py
Exported Ollama models in main API (src/fenic/api/__init__.py)
Updated session config to support Ollama provider

5. Example Updates:

All examples now support --language-model-provider ollama and --embedding-model-provider ollama CLI arguments
Updated examples/hello_world/hello_world.py, examples/feedback_clustering/feedback_clustering.py,
examples/news_analysis/news_analysis.py, examples/semantic_joins/semantic_joins.py, and examples/enrichment/enrichment.py

6. Bug Fixes:

Fixed JSON structured output for local models by simplifying prompts to avoid schema confusion
Fixed embedding dimension mismatches in tests by using dynamic detection
Improved model validation error handling

How to test?

1. Install and start Ollama:

curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

2. Run examples with Ollama models:

cd examples/hello_world
OPENAI_API_KEY=dummy-key uv run hello_world.py --language-model-provider ollama --language-model-name gemma3:4b

cd ../feedback_clustering
OPENAI_API_KEY=dummy-key uv run feedback_clustering.py \
  --language-model-provider ollama --language-model-name gemma3:4b \
  --embedding-model-provider ollama --embedding-model-name embeddinggemma:latest

3. Test configuration directly:

import fenic as fc

session = fc.Session.get_or_create(
    fc.SessionConfig(
        semantic=fc.SemanticConfig(
            language_models={"gemma3": fc.OllamaLanguageModel(model_name="gemma3:4b", auto_pull=True)},
            embedding_models={"embeddinggemma": fc.OllamaEmbeddingModel(model_name="embeddinggemma:latest", auto_pull=True)}
        )
    )
)

Not in scope of this PR

- Cloud deployment of Ollama models
- Advanced Ollama configuration options (model parameters, stop sequences)
- Integration with Ollama's streaming API
- Performance benchmarking against cloud providers

…very and optimized batch processing Details: - Added OLLAMA to ModelProvider enum in model catalog - Created comprehensive Ollama provider with async client support and dynamic model discovery using /api/tags and /api/show endpoints - Implemented OllamaModelManager for automatic model classification (chat vs embedding) and metadata extraction (context length, parameters) - Built optimized batch clients (OllamaBatchChatCompletionsClient, OllamaBatchEmbeddingsClient) leveraging Ollama's native parallel processing - Added OllamaLanguageModel and OllamaEmbeddingModel configuration classes with auto_pull support - Created ResolvedOllamaModelConfig for session configuration resolution - Updated model registry to support Ollama model initialization with high RPM/TPM limits for local models - Fixed validation logic in utils.py to recognize ResolvedOllamaModelConfig - Enhanced test infrastructure with Ollama provider support and increased RPM to 100 for testing - Added ollama Python package dependency Impact: - Replaces LiteLLM dependency for Ollama models with direct native integration - Enables automatic model discovery without pre-registration - Provides optimized performance for local Ollama deployments - Successfully passes semantic classification tests with qwen3:4b and embeddinggemma:latest models

iceysteel · 2025-09-29T07:51:14Z

src/fenic/_backends/local/session_state.py

    def catalog(self) -> LocalCatalog:
        """Get the catalog object."""
        return LocalCatalog(self.duckdb_conn)



need to reset changes to this file

Yes, please rebase!

…-through rate limiting

iceysteel · 2025-10-02T09:13:44Z

Based on feedback, simplified Ollama integration:

Block multiple Ollama models per session due to memory constraints
- Added validation with clear error message explaining VRAM/RAM sharing
- Environment variable override (FENIC_ALLOW_MULTIPLE_OLLAMA_MODELS=true) for testing
- Comprehensive test coverage (9 validation tests)
Replaced complex rate limiting with pass-through strategy
- Removed OllamaQueueAwareRateLimitStrategy
- Added OllamaPassThroughRateLimitStrategy
- Relies on Ollama's server-side queue management
Fixed structured output for semantic.extract
- Removed duplicate JSON schema injection in Ollama client
- Models were returning schema instead of extracted data
- Now properly extracts data with both gemma3 and qwen3 models
- still fails sometimes as small models output incorrect json keys (should be 'output', small models put "answer")

Changed multiple Ollama model validation from hard error to warning to allow users more flexibility while still informing them of potential issues. Changes: - Replaced ConfigurationError with logger.warning for multiple Ollama models - Removed FENIC_ALLOW_MULTIPLE_OLLAMA_MODELS environment variable override - Updated warning message to explain VRAM/RAM constraints and model unloading - Updated tests to verify multiple models load successfully with warnings - Removed import os (no longer needed) Users can now configure multiple Ollama models in a session but will receive a warning that models will be unloaded/reloaded if they don't fit in VRAM/RAM simultaneously, causing performance degradation. Mixed providers (e.g., Ollama + OpenAI) work without warnings as expected.

rohitrastogi

Hey, thanks so much for putting this together! 🙏 Since this is a bigger PR, I'm going to break my review into a couple of passes. Mind knocking out some cleanup items first? Then we can focus on the meaty stuff in round two. Makes it easier to give good feedback on both!

rohitrastogi · 2025-10-13T20:22:49Z

src/fenic/_backends/local/model_registry.py

                    default_profile_name=model_config.default_profile
                )
+            elif isinstance(model_config, ResolvedOllamaModelConfig):
+                from fenic._inference.ollama.ollama_batch_embeddings_client import OllamaBatchEmbeddingsClient


We should follow the same pattern as other model providers and wrap the imports for OllamaBatchEmbeddingsClient and OllamaBatchCompletionsClient in a try/except. This will allow Ollama to be an optional dependency, so users who don’t have it installed won’t encounter import errors at runtime.

rohitrastogi · 2025-10-13T20:43:01Z

src/fenic/_backends/local/session_state.py

    def catalog(self) -> LocalCatalog:
        """Get the catalog object."""
        return LocalCatalog(self.duckdb_conn)



Yes, please rebase!

rohitrastogi · 2025-10-13T20:44:53Z

src/fenic/_inference/rate_limit_strategy.py



 class RateLimitBucket:
-    """Manages a token bucket for rate limiting."""


Please revert the changes to the comments please.

rohitrastogi · 2025-10-13T20:46:17Z

tests/api/session/test_ollama_validation.py

+    OpenAILanguageModel,
+)
+from fenic.api.session.config import SemanticConfig
+


We should use pytest.importorskip("ollama") here so these tests are skipped if Ollama isn’t installed. This aligns with our goal of making Ollama an optional dependency.

rohitrastogi · 2025-10-13T20:47:11Z

pyproject.toml

  "zstandard>=0.23.0",
  "json-schema-to-pydantic>=0.4.1",
  "pymupdf>=1.26.4",
+  "ollama>=0.5.4",


Can we make Ollama an optional dependency, similar to how we handle Anthropic, Google, and Cohere?
That way, users who don’t have Ollama installed won’t need to pull it in by default, but those who want to use it can enable it through an extra (e.g. pip install fenic[ollama]).

rohitrastogi · 2025-10-13T21:32:16Z

examples/enrichment/enrichment.py



-def main(config: Optional[fc.SessionConfig] = None):
+def main(config: Optional[fc.SessionConfig] = None, language_model_provider: str = "openai", language_model_name: str = "gpt-4o-mini"):


We should keep OpenAI as the default provider for all examples, so let's revert the recent changes to the example scripts.

For testing different providers, we already use examples_session_config (see tests/conftest.py#L171
). This allows the example scripts to be run with Ollama—or any other provider—during unit tests, without needing to modify the examples themselves.

Also, do all unit tests pass using your choice of local models?

rohitrastogi · 2025-10-13T21:36:21Z

src/fenic/_inference/ollama/ollama_batch_chat_completions_client.py

+        self._metrics = LMMetrics()
+
+        # Ollama-specific optimizations
+        self._ollama_parallel = int(os.getenv("OLLAMA_NUM_PARALLEL", "4"))


Since we are using the passthrough strategy, do we still need the _ollama_parallel and _ollama_max_queue members? I believe we can safely remove them.

As an aside, because Ollama runs in a separate process, there’s no guarantee that these environment variables will be set for the process running the Fenic script.

rohitrastogi · 2025-10-13T21:42:28Z

src/fenic/_inference/request_utils.py

 def parse_openrouter_rate_limit_headers(
    headers: dict | None,
 ) -> tuple[int | None, float | None]:
-    """Parse OpenRouter rate limit headers into (rpm_hint, retry_at_epoch_seconds).


Please revert the change here.

rohitrastogi · 2025-10-13T21:46:53Z

src/fenic/core/_resolved_session_config.py

+class ResolvedOllamaModelConfig:
+    model_name: str
+    host: str
+    rpm: int


No need for rpm here anymore.

rohitrastogi · 2025-10-13T21:51:04Z

src/fenic/_inference/ollama/ollama_batch_embeddings_client.py

+                return FatalException(Exception(f"Embedding model '{self.model}' could not be loaded or pulled"))
+
+            # Create async client for this request
+            client = ollama.AsyncClient(host=self._host)


Instead of constructing a new ollama.AsyncClient for each inference request, we should cache it and reuse a single instance (both here and for the completions client). For reference, see this example: AnthropicBatchChatCompletionsClient.

iceysteel added 7 commits September 23, 2025 01:14

feat: export ollama provider and update examples with cli args

d56135f

feat: add ollama-specific rate limiting strategy

c248437

refactor: use ollama /api/show endpoint for model metadata

ee2caa3

fix: simplified json prompts for local models in ollama provider

274e6ea

feat: auto-pull missing ollama models during validation

fc39aea

fix: dynamic embedding dimensions from ollama model metadata

7189901

iceysteel commented Sep 29, 2025

View reviewed changes

feat: simplify Ollama provider with single model constraints and pass…

2be2887

…-through rate limiting

rohitrastogi reviewed Oct 13, 2025

View reviewed changes



		class RateLimitBucket:
		"""Manages a token bucket for rate limiting."""



		def main(config: Optional[fc.SessionConfig] = None):
		def main(config: Optional[fc.SessionConfig] = None, language_model_provider: str = "openai", language_model_name: str = "gpt-4o-mini"):

feat: add Ollama as chat completion and embedding provider #238

Are you sure you want to change the base?

feat: add Ollama as chat completion and embedding provider #238

Uh oh!

Conversation

iceysteel commented Sep 29, 2025

TL;DR

What changed?

1. Complete Ollama Provider Implementation (~856 new lines):

2. Enhanced Rate Limiting Strategy:

3. Dynamic Model Management:

4. API Integration:

5. Example Updates:

6. Bug Fixes:

How to test?

1. Install and start Ollama:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iceysteel commented Oct 2, 2025

Uh oh!

rohitrastogi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rohitrastogi left a comment •

edited

Loading