-
Notifications
You must be signed in to change notification settings - Fork 2
[Feature]: Dask-based Parallelisation for CodeEntropy #306
Copy link
Copy link
Open
Description
Problem / Motivation
CodeEntropy currently executes workflows sequentially, which limits performance for large or computationally intensive datasets. This results in longer runtimes and underutilisation of available CPU resources.
Proposed Solution
Introduce parallel execution using Dask by leveraging the existing DAG-based architecture.
The codebase has already been refactored to represent the workflow as a DAG, which provides a natural mapping to a Dask task graph. Each DAG node can be treated as a Dask task, with dependencies represented as edges.
The implementation should:
- Map DAG nodes to Dask tasks (e.g. using
dask.delayed) - Preserve existing DAG dependencies as task dependencies
- Execute the DAG via a Dask scheduler (initially local)
- Provide a configuration option to toggle between sequential and parallel execution
- Keep the DAG definition separate from the execution backend
Alternatives Considered
multiprocessing/concurrent.futures: simpler but less flexible and harder to scale- Threading: limited effectiveness for CPU-bound workloads due to the GIL
- Joblib: suitable for simple parallelism but less appropriate for DAG-based workflows
Dask is preferred due to its native support for task graphs and scalability.
Expected Impact
- Reduced runtime for large workloads
- Improved CPU utilisation
- Better scalability of the CodeEntropy pipeline
- Minimal disruption to existing DAG-based architecture
- Pathway for future distributed execution support
Additional Context
- The existing DAG refactor aligns well with Dask’s execution model and should minimise required changes.
- Initial implementation should focus on wrapping existing DAG nodes as Dask tasks and validating correctness against the sequential version.
- Benchmarking should be performed to compare performance before and after parallelisation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request