Skip to content

Conversation

@iceysteel
Copy link

TL;DR

Adds native Ollama provider support to fenic, enabling local LLM inference with automatic model discovery, dynamic embedding dimensions,
and optimized batch processing.

What changed?

This PR introduces comprehensive Ollama integration as a first-class model provider in fenic, allowing users to run semantic operations
with locally hosted models. The implementation includes:

1. Complete Ollama Provider Implementation (~856 new lines):

  • src/fenic/_inference/ollama/ollama_provider.py (45 lines): Core provider class with connection management
  • src/fenic/_inference/ollama/ollama_batch_chat_completions_client.py (365 lines): Batch completion client with structured output
    support and completion time tracking
  • src/fenic/_inference/ollama/ollama_batch_embeddings_client.py (268 lines): Batch embedding client with dynamic dimension detection
  • src/fenic/_inference/ollama/ollama_model_manager.py (178 lines): Model metadata management using Ollama's /api/show endpoint for
    capabilities detection

2. Enhanced Rate Limiting Strategy:

  • Extended src/fenic/_inference/rate_limit_strategy.py with OllamaQueueAwareRateLimitStrategy that respects local model constraints
    (OLLAMA_NUM_PARALLEL, queue limits)

3. Dynamic Model Management:

  • Auto-pull functionality for missing models during validation in src/fenic/core/_inference/model_catalog.py
  • Dynamic embedding dimension detection from model metadata instead of hardcoded values
  • Capabilities-based model type detection (embedding vs completion models)

4. API Integration:

  • Added OllamaLanguageModel and OllamaEmbeddingModel configuration classes in src/fenic/api/session/config.py
  • Exported Ollama models in main API (src/fenic/api/__init__.py)
  • Updated session config to support Ollama provider

5. Example Updates:

  • All examples now support --language-model-provider ollama and --embedding-model-provider ollama CLI arguments
  • Updated examples/hello_world/hello_world.py, examples/feedback_clustering/feedback_clustering.py,
    examples/news_analysis/news_analysis.py, examples/semantic_joins/semantic_joins.py, and examples/enrichment/enrichment.py

6. Bug Fixes:

  • Fixed JSON structured output for local models by simplifying prompts to avoid schema confusion
  • Fixed embedding dimension mismatches in tests by using dynamic detection
  • Improved model validation error handling

How to test?

1. Install and start Ollama:

curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

2. Run examples with Ollama models:

cd examples/hello_world
OPENAI_API_KEY=dummy-key uv run hello_world.py --language-model-provider ollama --language-model-name gemma3:4b

cd ../feedback_clustering
OPENAI_API_KEY=dummy-key uv run feedback_clustering.py \
  --language-model-provider ollama --language-model-name gemma3:4b \
  --embedding-model-provider ollama --embedding-model-name embeddinggemma:latest

3. Test configuration directly:

import fenic as fc

session = fc.Session.get_or_create(
    fc.SessionConfig(
        semantic=fc.SemanticConfig(
            language_models={"gemma3": fc.OllamaLanguageModel(model_name="gemma3:4b", auto_pull=True)},
            embedding_models={"embeddinggemma": fc.OllamaEmbeddingModel(model_name="embeddinggemma:latest", auto_pull=True)}
        )
    )
)

Not in scope of this PR

- Cloud deployment of Ollama models
- Advanced Ollama configuration options (model parameters, stop sequences)
- Integration with Ollama's streaming API
- Performance benchmarking against cloud providers

…very and optimized batch processing

  Details:
  - Added OLLAMA to ModelProvider enum in model catalog
  - Created comprehensive Ollama provider with async client support and dynamic model discovery using /api/tags and /api/show endpoints
  - Implemented OllamaModelManager for automatic model classification (chat vs embedding) and metadata extraction (context length,
  parameters)
  - Built optimized batch clients (OllamaBatchChatCompletionsClient, OllamaBatchEmbeddingsClient) leveraging Ollama's native parallel
  processing
  - Added OllamaLanguageModel and OllamaEmbeddingModel configuration classes with auto_pull support
  - Created ResolvedOllamaModelConfig for session configuration resolution
  - Updated model registry to support Ollama model initialization with high RPM/TPM limits for local models
  - Fixed validation logic in utils.py to recognize ResolvedOllamaModelConfig
  - Enhanced test infrastructure with Ollama provider support and increased RPM to 100 for testing
  - Added ollama Python package dependency

  Impact:
  - Replaces LiteLLM dependency for Ollama models with direct native integration
  - Enables automatic model discovery without pre-registration
  - Provides optimized performance for local Ollama deployments
  - Successfully passes semantic classification tests with qwen3:4b and embeddinggemma:latest models
def catalog(self) -> LocalCatalog:
"""Get the catalog object."""
return LocalCatalog(self.duckdb_conn)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to reset changes to this file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please rebase!

@iceysteel
Copy link
Author

Based on feedback, simplified Ollama integration:

  1. Block multiple Ollama models per session due to memory constraints

    • Added validation with clear error message explaining VRAM/RAM sharing
    • Environment variable override (FENIC_ALLOW_MULTIPLE_OLLAMA_MODELS=true) for testing
    • Comprehensive test coverage (9 validation tests)
  2. Replaced complex rate limiting with pass-through strategy

    • Removed OllamaQueueAwareRateLimitStrategy
    • Added OllamaPassThroughRateLimitStrategy
    • Relies on Ollama's server-side queue management
  3. Fixed structured output for semantic.extract

    • Removed duplicate JSON schema injection in Ollama client
    • Models were returning schema instead of extracted data
    • Now properly extracts data with both gemma3 and qwen3 models
    • still fails sometimes as small models output incorrect json keys (should be 'output', small models put "answer")

  Changed multiple Ollama model validation from hard error to warning to allow
  users more flexibility while still informing them of potential issues.

  Changes:
  - Replaced ConfigurationError with logger.warning for multiple Ollama models
  - Removed FENIC_ALLOW_MULTIPLE_OLLAMA_MODELS environment variable override
  - Updated warning message to explain VRAM/RAM constraints and model unloading
  - Updated tests to verify multiple models load successfully with warnings
  - Removed import os (no longer needed)

  Users can now configure multiple Ollama models in a session but will receive
  a warning that models will be unloaded/reloaded if they don't fit in VRAM/RAM
  simultaneously, causing performance degradation. Mixed providers (e.g., Ollama + OpenAI)
  work without warnings as expected.
Copy link

@rohitrastogi rohitrastogi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks so much for putting this together! 🙏 Since this is a bigger PR, I'm going to break my review into a couple of passes. Mind knocking out some cleanup items first? Then we can focus on the meaty stuff in round two. Makes it easier to give good feedback on both!

default_profile_name=model_config.default_profile
)
elif isinstance(model_config, ResolvedOllamaModelConfig):
from fenic._inference.ollama.ollama_batch_embeddings_client import OllamaBatchEmbeddingsClient

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should follow the same pattern as other model providers and wrap the imports for OllamaBatchEmbeddingsClient and OllamaBatchCompletionsClient in a try/except. This will allow Ollama to be an optional dependency, so users who don’t have it installed won’t encounter import errors at runtime.

def catalog(self) -> LocalCatalog:
"""Get the catalog object."""
return LocalCatalog(self.duckdb_conn)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please rebase!



class RateLimitBucket:
"""Manages a token bucket for rate limiting."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the changes to the comments please.

OpenAILanguageModel,
)
from fenic.api.session.config import SemanticConfig

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use pytest.importorskip("ollama") here so these tests are skipped if Ollama isn’t installed. This aligns with our goal of making Ollama an optional dependency.

"zstandard>=0.23.0",
"json-schema-to-pydantic>=0.4.1",
"pymupdf>=1.26.4",
"ollama>=0.5.4",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make Ollama an optional dependency, similar to how we handle Anthropic, Google, and Cohere?
That way, users who don’t have Ollama installed won’t need to pull it in by default, but those who want to use it can enable it through an extra (e.g. pip install fenic[ollama]).



def main(config: Optional[fc.SessionConfig] = None):
def main(config: Optional[fc.SessionConfig] = None, language_model_provider: str = "openai", language_model_name: str = "gpt-4o-mini"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep OpenAI as the default provider for all examples, so let's revert the recent changes to the example scripts.

For testing different providers, we already use examples_session_config (see tests/conftest.py#L171
). This allows the example scripts to be run with Ollama—or any other provider—during unit tests, without needing to modify the examples themselves.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do all unit tests pass using your choice of local models?

self._metrics = LMMetrics()

# Ollama-specific optimizations
self._ollama_parallel = int(os.getenv("OLLAMA_NUM_PARALLEL", "4"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using the passthrough strategy, do we still need the _ollama_parallel and _ollama_max_queue members? I believe we can safely remove them.

As an aside, because Ollama runs in a separate process, there’s no guarantee that these environment variables will be set for the process running the Fenic script.

def parse_openrouter_rate_limit_headers(
headers: dict | None,
) -> tuple[int | None, float | None]:
"""Parse OpenRouter rate limit headers into (rpm_hint, retry_at_epoch_seconds).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the change here.

class ResolvedOllamaModelConfig:
model_name: str
host: str
rpm: int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for rpm here anymore.

return FatalException(Exception(f"Embedding model '{self.model}' could not be loaded or pulled"))

# Create async client for this request
client = ollama.AsyncClient(host=self._host)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of constructing a new ollama.AsyncClient for each inference request, we should cache it and reuse a single instance (both here and for the completions client). For reference, see this example: AnthropicBatchChatCompletionsClient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants