Skip to content

Conversation

@alinakbase
Copy link
Collaborator

Summary

This PR introduces a new PySpark-based pipeline for processing Gene Ontology (GO) annotation files and exporting the results to Delta Lake format. (Replace pandas with PySpark)

Changes Included

  • New parser module: src/parsers/association_update.py

    • Parses GAF-like CSV annotation files
    • Handles qualifiers, evidence codes, publications, and metadata
    • Joins with ECO mapping for evidence type inference
    • Outputs a structured Delta table for downstream use
    • Supports optional table registration (--register) and temporary view mode (--temp)
    • Built with modular functions and a Click CLI entry point
  • Test suite: tests/test_association_update.py

    • Unit tests for all core functions including:
      • Uses pytest.mark.parametrize for testing key transformation logic
      • Ensures correctness of date parsing, predicate normalization, and evidence joins
      • Designed to run with Spark local mode for lightweight CI validation

@ialarmedalien ialarmedalien force-pushed the alinakbase-association branch 2 times, most recently from 75c3872 to 3ef25c0 Compare November 13, 2025 15:42
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 63.55932% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.12%. Comparing base (2522e03) to head (d0745f5).

Files with missing lines Patch % Lines
...dm_data_loader_utils/parsers/association_update.py 63.55% 43 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #33      +/-   ##
==========================================
- Coverage   86.97%   83.12%   -3.86%     
==========================================
  Files           9       10       +1     
  Lines         599      717     +118     
==========================================
+ Hits          521      596      +75     
- Misses         78      121      +43     
Files with missing lines Coverage Δ
...dm_data_loader_utils/parsers/association_update.py 63.55% <63.55%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2522e03...d0745f5. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ialarmedalien ialarmedalien force-pushed the alinakbase-association branch from 73a2b40 to 0c8b78e Compare December 2, 2025 16:36
@ialarmedalien ialarmedalien force-pushed the alinakbase-association branch from 0c8b78e to d0745f5 Compare December 2, 2025 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants