-
Notifications
You must be signed in to change notification settings - Fork 90
Open
Description
Describe the task
Investigate and fix the deduplication functionality in the spatial_join
utility that is not properly removing duplicate opa_id
records after spatial joins. During pipeline validation development, it was discovered that the expected deduplication by opa_id
in the spatial join utility is failing, leading to duplicate records in the output. As a temporary workaround, individual deduplication logic was added to several pipeline services, but this addresses the symptom rather than the root cause. This ticket involves identifying why the spatial join deduplication is not working as designed, fixing the underlying issue, and then refactoring the affected services to remove the temporary workaround code.
Acceptance Criteria
- Investigate the
spatial_join
utility to identify whyopa_id
deduplication is not functioning correctly - Debug and analyze the spatial join process to understand where duplicate records are being introduced or preserved
- Fix the root cause of the deduplication failure in the spatial join utility
- Test the fixed spatial join utility with sample data to verify proper
opa_id
deduplication - Identify all pipeline services that have temporary individual deduplication workarounds
- Refactor identified services to remove temporary deduplication code once spatial join is fixed
- Ensure all unit tests continue to pass after fixing the spatial join utility
- Verify that all validation checks pass with the refactored services
- Run the complete pipeline to ensure it executes successfully from start to finish
- Add or enhance tests for the spatial join utility to prevent regression of this issue
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status