Skip to content

[Feature]: Dask-based Parallelisation for CodeEntropy #306

@harryswift01

Description

@harryswift01

Problem / Motivation

CodeEntropy currently executes workflows sequentially, which limits performance for large or computationally intensive datasets. This results in longer runtimes and underutilisation of available CPU resources.

Proposed Solution

Introduce parallel execution using Dask by leveraging the existing DAG-based architecture.

The codebase has already been refactored to represent the workflow as a DAG, which provides a natural mapping to a Dask task graph. Each DAG node can be treated as a Dask task, with dependencies represented as edges.

The implementation should:

  • Map DAG nodes to Dask tasks (e.g. using dask.delayed)
  • Preserve existing DAG dependencies as task dependencies
  • Execute the DAG via a Dask scheduler (initially local)
  • Provide a configuration option to toggle between sequential and parallel execution
  • Keep the DAG definition separate from the execution backend

Alternatives Considered

  • multiprocessing / concurrent.futures: simpler but less flexible and harder to scale
  • Threading: limited effectiveness for CPU-bound workloads due to the GIL
  • Joblib: suitable for simple parallelism but less appropriate for DAG-based workflows

Dask is preferred due to its native support for task graphs and scalability.

Expected Impact

  • Reduced runtime for large workloads
  • Improved CPU utilisation
  • Better scalability of the CodeEntropy pipeline
  • Minimal disruption to existing DAG-based architecture
  • Pathway for future distributed execution support

Additional Context

  • The existing DAG refactor aligns well with Dask’s execution model and should minimise required changes.
  • Initial implementation should focus on wrapping existing DAG nodes as Dask tasks and validating correctness against the sequential version.
  • Benchmarking should be performed to compare performance before and after parallelisation.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions