Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage#13
Draft
gkostkowski wants to merge 21 commits intodevelopfrom
Draft
Complete Basic Entity Resolution Engine Implementation with Extended Test Coverage#13gkostkowski wants to merge 21 commits intodevelopfrom
gkostkowski wants to merge 21 commits intodevelopfrom
Conversation
- Add core resolver models: MentionId, ClusterId, Mention, MentionLink, CandidateCluster, ClusterMembership, ResolutionResult, ResolverState - Create src/ere/models/resolver/ package with clean layer separation - Models support backward compatibility with flat-dict representation - All imports verified; smoke tests passing
…stration) - Add SimilarityLinker port in services (abstract interface for pairwise similarity scoring) - Add repository ports in adapters (MentionRepository, SimilarityRepository, ClusterRepository) - Add ResolverConfig in services (typed YAML configuration) - Add EntityResolutionService in services (core orchestration service) - Export all resolver ports and services from respective __init__ files - All files compile; import paths verified
- Add duckdb_schema.py: dynamic table creation for mentions, similarities, clusters - Add duckdb_repositories.py: DuckDB implementations of MentionRepository, SimilarityRepository, ClusterRepository - Add splink_linker_impl.py: Splink-backed SimilarityLinker with cold-start defaults and thread-safe training - Add build_tf_df() helper for converting mentions to Splink TF DataFrames - Export all implementations from adapters/__init__.py - All files compile; import paths verified
- Move all resolver adapter imports to top (standard library → third-party → local) - Sort local imports alphabetically by module - Sort __all__ entries alphabetically - Remove duplicate import comments
- Add config/resolver.yaml: standard blocking (country_code only) - Add config/resolver_compound.yaml: compound blocking (country_code AND city) - Add config/resolver_multirule.yaml: multi-rule blocking (country OR city OR name) - All configs include Splink settings, cold-start defaults, and threshold calibration notes
- Add config/README.md: explains three blocking strategy variants and when to use each - Add config/resolver.yaml: standard single-field blocking (country_code) - Add config/resolver_compound.yaml: compound blocking (country_code AND city) - Add config/resolver_multirule.yaml: multi-rule blocking (country OR city OR name) - All configs include Splink settings and cold-start defaults - Removed project-specific references; kept only succinct, self-contained guidance
- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage) - Add splink >=4.0,<5.0 (Splink record linkage engine) - Both required for resolver adapter implementations
- Add pandas >=2.0,<3.0 (DataFrame operations for similarity storage) - Add splink >=4.0,<5.0 (Splink record linkage engine) - Both required for resolver adapter implementations
Implement config-driven RDF parsing, domain object mapping, and resolve pipeline to connect EntityResolutionService to the ERE contract. Replace stub resolve_entity_mention() with full implementation supporting: - Config-driven RDF Turtle parsing with namespace prefix resolution - Deterministic mention ID derivation per ERE spec - Idempotency guard to prevent duplicate DB writes on re-resolution - Per-entity-type service isolation for multi-entity scenarios - Explicit threshold vs confidence semantics Also includes: - EntityResolutionResolver adapter for pub/sub service path - Test isolation via _reset_services() in test layer - Updated BDD scenarios to match actual resolver capabilities - RDF mapping config for ORGANISATION entity type (PROCEDURE deferred) Tests: 7 passed, 1 xfailed (conflict detection), no regressions
…y injection Remove global _services and _entity_mappings state from resolution.py and replace with pytest fixture in conftest.py. This eliminates test-aware code from the production path and makes service isolation explicit. Changes: - Extract EntityResolutionService creation to conftest fixture - Make resolve_entity_mention() require explicit service parameter - Update BDD steps to inject service via fixture - Simplify resolution.py: stateless RDF mapping + domain construction - Entity fields sourced from config files (resolver.yaml) as single source of truth Benefits: - No global mutable state or test-specific cleanup logic in production code - Clear test dependencies via pytest fixture mechanism - Each BDD scenario gets isolated fresh service instance - Easier to test and reason about service lifecycle Tests: 7 passed, 1 xfailed, no regressions
Add back resolve_to_result() which was inadvertently removed and used by: - EntityResolutionResolver adapter (pub/sub service path) - resolve_entity_mention() for core resolution logic Also add build_resolution_service() factory for creating service instances without global state (used by adapter for fresh instances, fixture for tests). Changes: - resolve_to_result(): core pipeline (RDF mapping + idempotency + resolution) - build_resolution_service(): factory for EntityResolutionService - EntityResolutionResolver: uses factory to build fresh service per request - resolve_entity_mention(): uses resolve_to_result() internally Tests: 7 passed, 1 xfailed, no regressions
Migrate 13 comprehensive service tests covering core algorithm, training, state management, and edge cases. All tests pass with in-memory stubs and no external dependencies. Also add .gitignore patterns to prevent planning artifacts from being committed.
Migrate 13 comprehensive E2E tests covering full resolver stack integration: real DuckDB repositories + SpLinkSimilarityLinker + service. Tests validate realistic scenarios including clustering, blocking rules, warm-start capability, and state accumulation. All tests pass. Combined test suite now includes: - 6 adapter integration tests (DuckDB + service) - 13 service unit tests (algorithm, training, state) - 13 E2E tests (full resolver stack) Total: 32/32 tests passing with no regressions.
Migrate 5 Gherkin scenarios covering core algorithm with configurable thresholds: singleton creation, matching, clustering, candidate ordering, and below-threshold link handling. Create corresponding step definitions using pytest-bdd and simple in-memory mention stubs. All 5 BDD scenarios pass. Combined test migration now complete: - 6 adapter integration tests (DuckDB + service) - 13 service unit tests (algorithm, training, state) - 13 E2E tests (full resolver stack) - 5 BDD scenarios (algorithm behavior) Total: 37/37 tests passing, modules 1-4 complete.
- Create adapters/factories.py to centralize DuckDB and Splink instantiation - Remove concrete exports from adapters/__init__.py (only abstracts now) - Prevents services from accidentally importing implementation details - Enables adapter swapping without changing service layer
Key changes: - EntityResolutionResolver: core algorithm (was EntityResolutionService) - EntityResolutionService: public API service (was EntityResolutionResolver) - Added build_entity_resolution_service() factory - app.py now builds resolver, mapper, and service once before main loop - Updated all imports, docstrings, and test fixtures This fixes semantic confusion and eliminates repeated factory calls per request, improving performance and clarity. Files modified: - src/ere/services/entity_resolution_service.py: merged resolution.py + resolver_adapter.py - src/ere/services/factories.py: extracted from adapters - src/ere/adapters/factories.py: simplified to RDF mapper only - src/ere/adapters/rdf_mapper_port.py: moved from services - src/ere/entrypoints/app.py: use factories, build once - All test files: updated imports and class names - Docstrings: updated to reflect correct naming Tests pass: 39+
- Add test/e2e/test_app.py: complete app.py flow with Redis - Validates request parsing, entity resolution, response output - Tests realistic multi-entity clustering and RDF generation - Requires Redis with password env variable - Reorganize integration tests: - Rename test_entity_resolution_service_e2e.py → test_entity_resolver.py - Move from test/e2e/ to test/integration/ (adapter integration responsibility) - No logic changes, clearer layer separation - Remove outdated test_p8_3_resolver_integration.py (POC artifact) Test layout now reflects architecture boundaries: - test/e2e/: app entrypoint flows (Redis) - test/integration/: adapter/service integration (real implementations) - test/service/: service layer unit tests (stubs) - test/features/: BDD algorithm scenarios
Move request-response cycle logic from app.py into RedisQueueWorker entrypoint driver. This improves testability by: - Separating concerns: orchestration (app.py) vs. queue processing (worker) - Enabling dependency injection: worker receives Redis client and service - Simplifying e2e tests: directly test worker instead of simulating queue ops - Reducing app.py complexity: from 123 to 56 lines, keeping only orchestration RedisQueueWorker is now the single testable unit for queue-based request processing, while app.py remains the thin composition root. All e2e tests refactored to use the worker fixture directly.
- Move config/ directory to infra/config/ for better Docker context - Parametrize rdf_mapping_path in TurtleRDFMapper.__init__ - Parametrize resolver_config_path in build_entity_resolver() - Update build_rdf_mapper() to accept rdf_mapping_path - Add CLI args to app.py: --rdf-mapping-path, --resolver-config-path - Add env var support: RDF_MAPPING_PATH, RESOLVER_CONFIG_PATH - Update docker-compose.yml to pass config paths as env vars - Update Dockerfile to copy config from infra/config - Update conftest.py to use new infra/config path All factories now support both parametrized paths (for testing/Docker) and default fallback paths (for backwards compatibility). Tests: 53 passed, 2 skipped
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR completes the resolver implementation with comprehensive test coverage and architectural refinements. It includes 37+ passing tests spanning unit tests, BDD scenarios, and end-to-end integration, plus significant refactoring to clarify naming, eliminate global state, and establish proper dependency injection patterns.
What's included
Architecture Refinements
Factory Pattern & Dependency Injection
services/factories.py: Centralized service instantiation (new)build_entity_resolution_service(): Constructs resolver, mapper, and service in one calladapters/factories.py: Simplified to RDF mapper instantiationconftest.py: Removes global state from production codeService Naming Clarification
EntityResolver: Core algorithm (wasEntityResolutionService)EntityResolutionService: Public API service (wasEntityResolutionResolver)Code Cleanup
resolution.pyandresolver_adapter.pylogic intoentity_resolution_service.pyrdf_mapper_port.pytoadapters/(infrastructure boundary)resolution.pyandresolver_adapter.py(logic consolidated)Test Coverage: 37+ Passing Tests
Service Unit Tests (13 tests)
Comprehensive coverage of core algorithm:
Adapter Integration Tests (6 tests)
DuckDB repositories + Splink linker integration:
E2E Integration Tests (14 tests)
Resolver stack tests (13 tests):
App entrypoint test (1 test):
test/e2e/test_app.py: Complete Redis-based flowBDD Scenarios (5 scenarios)
Algorithm behavior specs in Gherkin:
EntityResolutionAlgorithm.featurewith configurable thresholdsTest Organization
tests/→test/(single test root)test/e2e/: App entrypoint flows (Redis-based flows)test/integration/: Adapter/service integration (real implementations)test/service/: Service layer unit tests (with stubs)test/adapters/: Adapter-level teststest/features/: BDD Gherkin scenariostest/steps/: BDD step definitions.gitignorepatterns to exclude planning artifactsEntrypoint Updates (
app.py)Architecture Compliance
TODO
- Note:
test_app.pycurrently requires the Redis password via an environment variable_test*and other unused artefacts)