Skip to content

[EPIC] Remove CoalesceBatchesExec operator #18779

@alamb

Description

@alamb

DataFusion currently has a standalone CoalesceBatchesExec operator that ensures batches are large enough to take advantage of vectorized execution

This operator is inserted after "filter-like" operations that can produce small batches, such as FilterExec, HashJoinExec, and RepartitionExec here:

/// `CoalesceBatchesExec` combines small batches into larger batches for more
/// efficient vectorized processing by later operators.
///
/// The operator buffers batches until it collects `target_batch_size` rows and
/// then emits a single concatenated batch. When only a limited number of rows
/// are necessary (specified by the `fetch` parameter), the operator will stop
/// buffering and returns the final batch once the number of collected rows
/// reaches the `fetch` value.
///
/// See [`LimitedBatchCoalescer`] for more information

CoalesceBatchesExec is non ideal as it can prevent other optimizations from happening (or they are more complicated than they otherwise would need to be as they need to know how to look through the operator). For example, see here or pushing limit through joins

A longstanding effort, has been trying to make the filter+concat faster, and we have introduced the coalesce kernels upstream

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    EPICA larger project, actively underway, with sub tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions