Skip to content

Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage#13

Draft
gkostkowski wants to merge 21 commits intodevelopfrom
feature/ERS1-124/implement-ere
Draft

Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage#13
gkostkowski wants to merge 21 commits intodevelopfrom
feature/ERS1-124/implement-ere

Conversation

@gkostkowski
Copy link
Collaborator

Summary

This PR completes the resolver implementation with comprehensive test coverage and architectural refinements. It includes 37+ passing tests spanning unit tests, BDD scenarios, and end-to-end integration, plus significant refactoring to clarify naming, eliminate global state, and establish proper dependency injection patterns.

What's included

Architecture Refinements

Factory Pattern & Dependency Injection

  • fixed dependencies to follow Cosmic Python contract
  • services/factories.py: Centralized service instantiation (new)
    • build_entity_resolution_service(): Constructs resolver, mapper, and service in one call
    • Eliminates repeated factory calls per request
  • adapters/factories.py: Simplified to RDF mapper instantiation
    • Centralizes DuckDB and Splink instantiation
    • Prevents services from importing implementation details
    • Enables adapter swapping without service changes
  • Fixture-based DI in conftest.py: Removes global state from production code
    • Each test gets isolated, fresh service instance
    • Services no longer aware of test-specific cleanup logic

Service Naming Clarification

  • EntityResolver: Core algorithm (was EntityResolutionService)
  • EntityResolutionService: Public API service (was EntityResolutionResolver)
    • Eliminates semantic confusion between algorithm and service
    • Improves clarity and aligns with Clean Architecture patterns

Code Cleanup

  • Merged resolution.py and resolver_adapter.py logic into entity_resolution_service.py
  • Moved rdf_mapper_port.py to adapters/ (infrastructure boundary)
  • Deleted resolution.py and resolver_adapter.py (logic consolidated)
  • Updated all imports and docstrings for consistency

Test Coverage: 37+ Passing Tests

Service Unit Tests (13 tests)

Comprehensive coverage of core algorithm:

  • Singleton initialization
  • Match scoring and ranking
  • Cluster formation and merging
  • Training and warm-start behavior
  • State accumulation
  • Edge cases (empty clusters, below-threshold links)

Adapter Integration Tests (6 tests)

DuckDB repositories + Splink linker integration:

  • Real table creation and schema validation
  • Mention and similarity persistence
  • Cluster membership tracking
  • End-to-end adapter lifecycle

E2E Integration Tests (14 tests)

Resolver stack tests (13 tests):

  • Blocking rule enforcement (standard, compound, multi-rule)
  • Warm-start capability across requests
  • Clustering correctness
  • State accumulation validation
  • RDF output generation

App entrypoint test (1 test):

  • test/e2e/test_app.py: Complete Redis-based flow
    • Request parsing from Redis queue
    • Entity resolution via app.py handler
    • Response serialization and output to Redis
    • Validates multi-entity clustering and RDF generation

BDD Scenarios (5 scenarios)

Algorithm behavior specs in Gherkin:

  • EntityResolutionAlgorithm.feature with configurable thresholds
  • Step definitions using pytest-bdd + in-memory stubs
  • Covers singleton creation, matching, clustering, ordering, filtering

Test Organization

  • Consolidated tests/test/ (single test root)
  • Layer-aligned subdirectories:
    • test/e2e/: App entrypoint flows (Redis-based flows)
    • test/integration/: Adapter/service integration (real implementations)
    • test/service/: Service layer unit tests (with stubs)
    • test/adapters/: Adapter-level tests
    • test/features/: BDD Gherkin scenarios
    • test/steps/: BDD step definitions
  • Added .gitignore patterns to exclude planning artifacts

Entrypoint Updates (app.py)

  • Build resolver, mapper, and service once before main loop
  • Pass constructed service to each handle invocation
  • Eliminates factory overhead on every message

Architecture Compliance

  • Dependency direction preserved (entrypoints → services → adapters → models)
  • No global mutable state in production code
  • Factory pattern isolates concrete implementations
  • Clear test dependencies via pytest fixtures
  • Clean layer separation with explicit ports/adapters
  • 37+ tests passing, no regressions

TODO

  • Review and update Gherkin test scenarios
  • Resolve the 5 currently failing tests
  • Set up proper test infrastructure (Redis)
    - Note: test_app.py currently requires the Redis password via an environment variable
  • Verify full compliance with the ERE contract
  • Execute stress tests, evaluate performance, and optimise where necessary
  • Analyse and improve code coverage
  • Prepare project documentation (Markdown files in the repository)
  • Review and standardise docstrings for consistency
  • Remove outdated files (_test* and other unused artefacts)
  • Review and enforce code style compliance

- Add core resolver models: MentionId, ClusterId, Mention, MentionLink, CandidateCluster, ClusterMembership, ResolutionResult, ResolverState
- Create src/ere/models/resolver/ package with clean layer separation
- Models support backward compatibility with flat-dict representation
- All imports verified; smoke tests passing
…stration)

- Add SimilarityLinker port in services (abstract interface for pairwise similarity scoring)
- Add repository ports in adapters (MentionRepository, SimilarityRepository, ClusterRepository)
- Add ResolverConfig in services (typed YAML configuration)
- Add EntityResolutionService in services (core orchestration service)
- Export all resolver ports and services from respective __init__ files
- All files compile; import paths verified
- Add duckdb_schema.py: dynamic table creation for mentions, similarities, clusters
- Add duckdb_repositories.py: DuckDB implementations of MentionRepository, SimilarityRepository, ClusterRepository
- Add splink_linker_impl.py: Splink-backed SimilarityLinker with cold-start defaults and thread-safe training
- Add build_tf_df() helper for converting mentions to Splink TF DataFrames
- Export all implementations from adapters/__init__.py
- All files compile; import paths verified
- Move all resolver adapter imports to top (standard library → third-party → local)
- Sort local imports alphabetically by module
- Sort __all__ entries alphabetically
- Remove duplicate import comments
- Add config/resolver.yaml: standard blocking (country_code only)
- Add config/resolver_compound.yaml: compound blocking (country_code AND city)
- Add config/resolver_multirule.yaml: multi-rule blocking (country OR city OR name)
- All configs include Splink settings, cold-start defaults, and threshold calibration notes
- Add config/README.md: explains three blocking strategy variants and when to use each
- Add config/resolver.yaml: standard single-field blocking (country_code)
- Add config/resolver_compound.yaml: compound blocking (country_code AND city)
- Add config/resolver_multirule.yaml: multi-rule blocking (country OR city OR name)
- All configs include Splink settings and cold-start defaults
- Removed project-specific references; kept only succinct, self-contained guidance
- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage)
- Add splink >=4.0,<5.0 (Splink record linkage engine)
- Both required for resolver adapter implementations
- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage)
- Add splink >=4.0,<5.0 (Splink record linkage engine)
- Both required for resolver adapter implementations
Implement config-driven RDF parsing, domain object mapping, and resolve
pipeline to connect EntityResolutionService to the ERE contract. Replace
stub resolve_entity_mention() with full implementation supporting:

- Config-driven RDF Turtle parsing with namespace prefix resolution
- Deterministic mention ID derivation per ERE spec
- Idempotency guard to prevent duplicate DB writes on re-resolution
- Per-entity-type service isolation for multi-entity scenarios
- Explicit threshold vs confidence semantics

Also includes:
- EntityResolutionResolver adapter for pub/sub service path
- Test isolation via _reset_services() in test layer
- Updated BDD scenarios to match actual resolver capabilities
- RDF mapping config for ORGANISATION entity type (PROCEDURE deferred)

Tests: 7 passed, 1 xfailed (conflict detection), no regressions
…y injection

Remove global _services and _entity_mappings state from resolution.py and
replace with pytest fixture in conftest.py. This eliminates test-aware code
from the production path and makes service isolation explicit.

Changes:
- Extract EntityResolutionService creation to conftest fixture
- Make resolve_entity_mention() require explicit service parameter
- Update BDD steps to inject service via fixture
- Simplify resolution.py: stateless RDF mapping + domain construction
- Entity fields sourced from config files (resolver.yaml) as single source of truth

Benefits:
- No global mutable state or test-specific cleanup logic in production code
- Clear test dependencies via pytest fixture mechanism
- Each BDD scenario gets isolated fresh service instance
- Easier to test and reason about service lifecycle

Tests: 7 passed, 1 xfailed, no regressions
Add back resolve_to_result() which was inadvertently removed and used by:
- EntityResolutionResolver adapter (pub/sub service path)
- resolve_entity_mention() for core resolution logic

Also add build_resolution_service() factory for creating service instances
without global state (used by adapter for fresh instances, fixture for tests).

Changes:
- resolve_to_result(): core pipeline (RDF mapping + idempotency + resolution)
- build_resolution_service(): factory for EntityResolutionService
- EntityResolutionResolver: uses factory to build fresh service per request
- resolve_entity_mention(): uses resolve_to_result() internally

Tests: 7 passed, 1 xfailed, no regressions
Migrate 13 comprehensive service tests covering core algorithm,
training, state management, and edge cases. All tests pass with
in-memory stubs and no external dependencies.

Also add .gitignore patterns to prevent planning artifacts from
being committed.
Migrate 13 comprehensive E2E tests covering full resolver stack
integration: real DuckDB repositories + SpLinkSimilarityLinker + service.
Tests validate realistic scenarios including clustering, blocking rules,
warm-start capability, and state accumulation. All tests pass.

Combined test suite now includes:
- 6 adapter integration tests (DuckDB + service)
- 13 service unit tests (algorithm, training, state)
- 13 E2E tests (full resolver stack)
Total: 32/32 tests passing with no regressions.
Migrate 5 Gherkin scenarios covering core algorithm with configurable
thresholds: singleton creation, matching, clustering, candidate ordering,
and below-threshold link handling. Create corresponding step definitions
using pytest-bdd and simple in-memory mention stubs.

All 5 BDD scenarios pass.

Combined test migration now complete:
- 6 adapter integration tests (DuckDB + service)
- 13 service unit tests (algorithm, training, state)
- 13 E2E tests (full resolver stack)
- 5 BDD scenarios (algorithm behavior)
Total: 37/37 tests passing, modules 1-4 complete.
- Create adapters/factories.py to centralize DuckDB and Splink instantiation
- Remove concrete exports from adapters/__init__.py (only abstracts now)
- Prevents services from accidentally importing implementation details
- Enables adapter swapping without changing service layer
Key changes:
- EntityResolutionResolver: core algorithm (was EntityResolutionService)
- EntityResolutionService: public API service (was EntityResolutionResolver)
- Added build_entity_resolution_service() factory
- app.py now builds resolver, mapper, and service once before main loop
- Updated all imports, docstrings, and test fixtures

This fixes semantic confusion and eliminates repeated factory calls
per request, improving performance and clarity.

Files modified:
- src/ere/services/entity_resolution_service.py: merged resolution.py + resolver_adapter.py
- src/ere/services/factories.py: extracted from adapters
- src/ere/adapters/factories.py: simplified to RDF mapper only
- src/ere/adapters/rdf_mapper_port.py: moved from services
- src/ere/entrypoints/app.py: use factories, build once
- All test files: updated imports and class names
- Docstrings: updated to reflect correct naming

Tests pass: 39+
- Add test/e2e/test_app.py: complete app.py flow with Redis
  - Validates request parsing, entity resolution, response output
  - Tests realistic multi-entity clustering and RDF generation
  - Requires Redis with password env variable

- Reorganize integration tests:
  - Rename test_entity_resolution_service_e2e.py → test_entity_resolver.py
  - Move from test/e2e/ to test/integration/ (adapter integration responsibility)
  - No logic changes, clearer layer separation

- Remove outdated test_p8_3_resolver_integration.py (POC artifact)

Test layout now reflects architecture boundaries:
- test/e2e/: app entrypoint flows (Redis)
- test/integration/: adapter/service integration (real implementations)
- test/service/: service layer unit tests (stubs)
- test/features/: BDD algorithm scenarios
Move request-response cycle logic from app.py into RedisQueueWorker entrypoint
driver. This improves testability by:
- Separating concerns: orchestration (app.py) vs. queue processing (worker)
- Enabling dependency injection: worker receives Redis client and service
- Simplifying e2e tests: directly test worker instead of simulating queue ops
- Reducing app.py complexity: from 123 to 56 lines, keeping only orchestration

RedisQueueWorker is now the single testable unit for queue-based request
processing, while app.py remains the thin composition root.

All e2e tests refactored to use the worker fixture directly.
@gkostkowski gkostkowski requested a review from costezki February 27, 2026 19:13
- Move config/ directory to infra/config/ for better Docker context
- Parametrize rdf_mapping_path in TurtleRDFMapper.__init__
- Parametrize resolver_config_path in build_entity_resolver()
- Update build_rdf_mapper() to accept rdf_mapping_path
- Add CLI args to app.py: --rdf-mapping-path, --resolver-config-path
- Add env var support: RDF_MAPPING_PATH, RESOLVER_CONFIG_PATH
- Update docker-compose.yml to pass config paths as env vars
- Update Dockerfile to copy config from infra/config
- Update conftest.py to use new infra/config path

All factories now support both parametrized paths (for testing/Docker)
and default fallback paths (for backwards compatibility).

Tests: 53 passed, 2 skipped
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant