Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage by gkostkowski · Pull Request #13 · meaningfy-ws/entity-resolution-engine-basic

gkostkowski · 2026-02-27T19:13:57Z

Summary

This PR completes the resolver implementation with comprehensive test coverage and architectural refinements. It includes 37+ passing tests spanning unit tests, BDD scenarios, and end-to-end integration, plus significant refactoring to clarify naming, eliminate global state, and establish proper dependency injection patterns.

What's included

Architecture Refinements

Factory Pattern & Dependency Injection

fixed dependencies to follow Cosmic Python contract
services/factories.py: Centralized service instantiation (new)
- build_entity_resolution_service(): Constructs resolver, mapper, and service in one call
- Eliminates repeated factory calls per request
adapters/factories.py: Simplified to RDF mapper instantiation
- Centralizes DuckDB and Splink instantiation
- Prevents services from importing implementation details
- Enables adapter swapping without service changes
Fixture-based DI in conftest.py: Removes global state from production code
- Each test gets isolated, fresh service instance
- Services no longer aware of test-specific cleanup logic

Service Naming Clarification

EntityResolver: Core algorithm (was EntityResolutionService)
EntityResolutionService: Public API service (was EntityResolutionResolver)
- Eliminates semantic confusion between algorithm and service
- Improves clarity and aligns with Clean Architecture patterns

Code Cleanup

Merged resolution.py and resolver_adapter.py logic into entity_resolution_service.py
Moved rdf_mapper_port.py to adapters/ (infrastructure boundary)
Deleted resolution.py and resolver_adapter.py (logic consolidated)
Updated all imports and docstrings for consistency

Test Coverage: 37+ Passing Tests

Service Unit Tests (13 tests)

Comprehensive coverage of core algorithm:

Singleton initialization
Match scoring and ranking
Cluster formation and merging
Training and warm-start behavior
State accumulation
Edge cases (empty clusters, below-threshold links)

Adapter Integration Tests (6 tests)

DuckDB repositories + Splink linker integration:

Real table creation and schema validation
Mention and similarity persistence
Cluster membership tracking
End-to-end adapter lifecycle

E2E Integration Tests (14 tests)

Resolver stack tests (13 tests):

Blocking rule enforcement (standard, compound, multi-rule)
Warm-start capability across requests
Clustering correctness
State accumulation validation
RDF output generation

App entrypoint test (1 test):

test/e2e/test_app.py: Complete Redis-based flow
- Request parsing from Redis queue
- Entity resolution via app.py handler
- Response serialization and output to Redis
- Validates multi-entity clustering and RDF generation

BDD Scenarios (5 scenarios)

Algorithm behavior specs in Gherkin:

EntityResolutionAlgorithm.feature with configurable thresholds
Step definitions using pytest-bdd + in-memory stubs
Covers singleton creation, matching, clustering, ordering, filtering

Test Organization

Consolidated tests/ → test/ (single test root)
Layer-aligned subdirectories:
- test/e2e/: App entrypoint flows (Redis-based flows)
- test/integration/: Adapter/service integration (real implementations)
- test/service/: Service layer unit tests (with stubs)
- test/adapters/: Adapter-level tests
- test/features/: BDD Gherkin scenarios
- test/steps/: BDD step definitions
Added .gitignore patterns to exclude planning artifacts

Entrypoint Updates (`app.py`)

Build resolver, mapper, and service once before main loop
Pass constructed service to each handle invocation
Eliminates factory overhead on every message

Architecture Compliance

Dependency direction preserved (entrypoints → services → adapters → models)
No global mutable state in production code
Factory pattern isolates concrete implementations
Clear test dependencies via pytest fixtures
Clean layer separation with explicit ports/adapters
37+ tests passing, no regressions

TODO

- Add core resolver models: MentionId, ClusterId, Mention, MentionLink, CandidateCluster, ClusterMembership, ResolutionResult, ResolverState - Create src/ere/models/resolver/ package with clean layer separation - Models support backward compatibility with flat-dict representation - All imports verified; smoke tests passing

…stration) - Add SimilarityLinker port in services (abstract interface for pairwise similarity scoring) - Add repository ports in adapters (MentionRepository, SimilarityRepository, ClusterRepository) - Add ResolverConfig in services (typed YAML configuration) - Add EntityResolutionService in services (core orchestration service) - Export all resolver ports and services from respective __init__ files - All files compile; import paths verified

- Add duckdb_schema.py: dynamic table creation for mentions, similarities, clusters - Add duckdb_repositories.py: DuckDB implementations of MentionRepository, SimilarityRepository, ClusterRepository - Add splink_linker_impl.py: Splink-backed SimilarityLinker with cold-start defaults and thread-safe training - Add build_tf_df() helper for converting mentions to Splink TF DataFrames - Export all implementations from adapters/__init__.py - All files compile; import paths verified

- Move all resolver adapter imports to top (standard library → third-party → local) - Sort local imports alphabetically by module - Sort __all__ entries alphabetically - Remove duplicate import comments

- Add config/resolver.yaml: standard blocking (country_code only) - Add config/resolver_compound.yaml: compound blocking (country_code AND city) - Add config/resolver_multirule.yaml: multi-rule blocking (country OR city OR name) - All configs include Splink settings, cold-start defaults, and threshold calibration notes

- Add config/README.md: explains three blocking strategy variants and when to use each - Add config/resolver.yaml: standard single-field blocking (country_code) - Add config/resolver_compound.yaml: compound blocking (country_code AND city) - Add config/resolver_multirule.yaml: multi-rule blocking (country OR city OR name) - All configs include Splink settings and cold-start defaults - Removed project-specific references; kept only succinct, self-contained guidance

- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage) - Add splink >=4.0,<5.0 (Splink record linkage engine) - Both required for resolver adapter implementations

Implement config-driven RDF parsing, domain object mapping, and resolve pipeline to connect EntityResolutionService to the ERE contract. Replace stub resolve_entity_mention() with full implementation supporting: - Config-driven RDF Turtle parsing with namespace prefix resolution - Deterministic mention ID derivation per ERE spec - Idempotency guard to prevent duplicate DB writes on re-resolution - Per-entity-type service isolation for multi-entity scenarios - Explicit threshold vs confidence semantics Also includes: - EntityResolutionResolver adapter for pub/sub service path - Test isolation via _reset_services() in test layer - Updated BDD scenarios to match actual resolver capabilities - RDF mapping config for ORGANISATION entity type (PROCEDURE deferred) Tests: 7 passed, 1 xfailed (conflict detection), no regressions

…y injection Remove global _services and _entity_mappings state from resolution.py and replace with pytest fixture in conftest.py. This eliminates test-aware code from the production path and makes service isolation explicit. Changes: - Extract EntityResolutionService creation to conftest fixture - Make resolve_entity_mention() require explicit service parameter - Update BDD steps to inject service via fixture - Simplify resolution.py: stateless RDF mapping + domain construction - Entity fields sourced from config files (resolver.yaml) as single source of truth Benefits: - No global mutable state or test-specific cleanup logic in production code - Clear test dependencies via pytest fixture mechanism - Each BDD scenario gets isolated fresh service instance - Easier to test and reason about service lifecycle Tests: 7 passed, 1 xfailed, no regressions

Add back resolve_to_result() which was inadvertently removed and used by: - EntityResolutionResolver adapter (pub/sub service path) - resolve_entity_mention() for core resolution logic Also add build_resolution_service() factory for creating service instances without global state (used by adapter for fresh instances, fixture for tests). Changes: - resolve_to_result(): core pipeline (RDF mapping + idempotency + resolution) - build_resolution_service(): factory for EntityResolutionService - EntityResolutionResolver: uses factory to build fresh service per request - resolve_entity_mention(): uses resolve_to_result() internally Tests: 7 passed, 1 xfailed, no regressions

Migrate 13 comprehensive service tests covering core algorithm, training, state management, and edge cases. All tests pass with in-memory stubs and no external dependencies. Also add .gitignore patterns to prevent planning artifacts from being committed.

Migrate 13 comprehensive E2E tests covering full resolver stack integration: real DuckDB repositories + SpLinkSimilarityLinker + service. Tests validate realistic scenarios including clustering, blocking rules, warm-start capability, and state accumulation. All tests pass. Combined test suite now includes: - 6 adapter integration tests (DuckDB + service) - 13 service unit tests (algorithm, training, state) - 13 E2E tests (full resolver stack) Total: 32/32 tests passing with no regressions.

Migrate 5 Gherkin scenarios covering core algorithm with configurable thresholds: singleton creation, matching, clustering, candidate ordering, and below-threshold link handling. Create corresponding step definitions using pytest-bdd and simple in-memory mention stubs. All 5 BDD scenarios pass. Combined test migration now complete: - 6 adapter integration tests (DuckDB + service) - 13 service unit tests (algorithm, training, state) - 13 E2E tests (full resolver stack) - 5 BDD scenarios (algorithm behavior) Total: 37/37 tests passing, modules 1-4 complete.

- Create adapters/factories.py to centralize DuckDB and Splink instantiation - Remove concrete exports from adapters/__init__.py (only abstracts now) - Prevents services from accidentally importing implementation details - Enables adapter swapping without changing service layer

Key changes: - EntityResolutionResolver: core algorithm (was EntityResolutionService) - EntityResolutionService: public API service (was EntityResolutionResolver) - Added build_entity_resolution_service() factory - app.py now builds resolver, mapper, and service once before main loop - Updated all imports, docstrings, and test fixtures This fixes semantic confusion and eliminates repeated factory calls per request, improving performance and clarity. Files modified: - src/ere/services/entity_resolution_service.py: merged resolution.py + resolver_adapter.py - src/ere/services/factories.py: extracted from adapters - src/ere/adapters/factories.py: simplified to RDF mapper only - src/ere/adapters/rdf_mapper_port.py: moved from services - src/ere/entrypoints/app.py: use factories, build once - All test files: updated imports and class names - Docstrings: updated to reflect correct naming Tests pass: 39+

- Add test/e2e/test_app.py: complete app.py flow with Redis - Validates request parsing, entity resolution, response output - Tests realistic multi-entity clustering and RDF generation - Requires Redis with password env variable - Reorganize integration tests: - Rename test_entity_resolution_service_e2e.py → test_entity_resolver.py - Move from test/e2e/ to test/integration/ (adapter integration responsibility) - No logic changes, clearer layer separation - Remove outdated test_p8_3_resolver_integration.py (POC artifact) Test layout now reflects architecture boundaries: - test/e2e/: app entrypoint flows (Redis) - test/integration/: adapter/service integration (real implementations) - test/service/: service layer unit tests (stubs) - test/features/: BDD algorithm scenarios

Move request-response cycle logic from app.py into RedisQueueWorker entrypoint driver. This improves testability by: - Separating concerns: orchestration (app.py) vs. queue processing (worker) - Enabling dependency injection: worker receives Redis client and service - Simplifying e2e tests: directly test worker instead of simulating queue ops - Reducing app.py complexity: from 123 to 56 lines, keeping only orchestration RedisQueueWorker is now the single testable unit for queue-based request processing, while app.py remains the thin composition root. All e2e tests refactored to use the worker fixture directly.

- Move config/ directory to infra/config/ for better Docker context - Parametrize rdf_mapping_path in TurtleRDFMapper.__init__ - Parametrize resolver_config_path in build_entity_resolver() - Update build_rdf_mapper() to accept rdf_mapping_path - Add CLI args to app.py: --rdf-mapping-path, --resolver-config-path - Add env var support: RDF_MAPPING_PATH, RESOLVER_CONFIG_PATH - Update docker-compose.yml to pass config paths as env vars - Update Dockerfile to copy config from infra/config - Update conftest.py to use new infra/config path All factories now support both parametrized paths (for testing/Docker) and default fallback paths (for backwards compatibility). Tests: 53 passed, 2 skipped

gkostkowski added 19 commits February 26, 2026 16:42

refactor(adapters): organize imports following Python conventions

387d97b

- Move all resolver adapter imports to top (standard library → third-party → local) - Sort local imports alphabetically by module - Sort __all__ entries alphabetically - Remove duplicate import comments

refactor(pyproject): add resolver dependencies (pandas, splink)

7727232

- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage) - Add splink >=4.0,<5.0 (Splink record linkage engine) - Both required for resolver adapter implementations

refactor(pyproject): add resolver dependencies (pandas, splink)

8245cef

- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage) - Add splink >=4.0,<5.0 (Splink record linkage engine) - Both required for resolver adapter implementations

chore(test): Reconcile test directories

2d9a5d7

gkostkowski requested a review from costezki February 27, 2026 19:13

gkostkowski added 2 commits February 27, 2026 20:23

Clean up test artefacts

56bbc85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage#13

Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage#13
gkostkowski wants to merge 21 commits intodevelopfrom
feature/ERS1-124/implement-ere

gkostkowski commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gkostkowski commented Feb 27, 2026

Summary

What's included

Architecture Refinements

Factory Pattern & Dependency Injection

Service Naming Clarification

Code Cleanup

Test Coverage: 37+ Passing Tests

Service Unit Tests (13 tests)

Adapter Integration Tests (6 tests)

E2E Integration Tests (14 tests)

BDD Scenarios (5 scenarios)

Test Organization

Entrypoint Updates (app.py)

Architecture Compliance

TODO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Entrypoint Updates (`app.py`)