-
-
Notifications
You must be signed in to change notification settings - Fork 362
refactor(pyspark): restructure pyspark components #2007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ELC
commented
May 24, 2025
- Remove deprecated ColumnSchema and replace with ComponentSchema.
- Update DataFrameSchema to inherit from the new base class.
- Adjust type hints to use new DataFrame types for better clarity.
- Clean up imports and registration of backend components.
- Remove deprecated ColumnSchema and replace with ComponentSchema. - Update DataFrameSchema to inherit from the new base class. - Adjust type hints to use new DataFrame types for better clarity. - Clean up imports and registration of backend components. Signed-off-by: Ezequiel Leonardo Castaño <[email protected]>
- Introduce Self type for better type hinting in models. - Refactor DataFrameModel and related classes to use Self. - Update validation methods to enhance type safety. - Simplify error handling in backend implementations. - Remove unused parameters and improve code readability. Signed-off-by: Ezequiel Leonardo Castaño <[email protected]>
This PR aims to bring the pyspark implementation to the new class structure used in Pandas and Polars @cosmicBboy this is still WIP but I would need the approval to run the CICD tests and fix any failing tests |
- Introduce dtype property in Column class for better type management. - Set default lazy parameter to False in DataFrameSchema methods. - Remove error_handler parameter from several methods to simplify error handling. - Update tests to reflect changes in schema validation logic. - Disable ANSI SQL mode in Spark sessions for compatibility. Signed-off-by: Ezequiel Leonardo Castaño <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2007 +/- ##
===========================================
- Coverage 94.28% 83.29% -10.99%
===========================================
Files 91 134 +43
Lines 7013 10226 +3213
===========================================
+ Hits 6612 8518 +1906
- Misses 401 1708 +1307 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@ELC looking into reviving this PR now. There are a bunch of conflicts I need to resolve |
Hey @cosmicBboy , I'm a heavy pyspark user and I'd love to help here. I believe the PR already has most of the needed changes but I will need help with fixing the tests as when I tried to run them on Codespace it died before finishing. Would you be able to help me there? |
hi @ELC apologies for the delay... going to look into this now |