-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
DataFusion currently has a standalone CoalesceBatchesExec operator that ensures batches are large enough to take advantage of vectorized execution
This operator is inserted after "filter-like" operations that can produce small batches, such as FilterExec, HashJoinExec, and RepartitionExec here:
datafusion/datafusion/physical-plan/src/coalesce_batches.rs
Lines 47 to 56 in af22336
| /// `CoalesceBatchesExec` combines small batches into larger batches for more | |
| /// efficient vectorized processing by later operators. | |
| /// | |
| /// The operator buffers batches until it collects `target_batch_size` rows and | |
| /// then emits a single concatenated batch. When only a limited number of rows | |
| /// are necessary (specified by the `fetch` parameter), the operator will stop | |
| /// buffering and returns the final batch once the number of collected rows | |
| /// reaches the `fetch` value. | |
| /// | |
| /// See [`LimitedBatchCoalescer`] for more information |
CoalesceBatchesExec is non ideal as it can prevent other optimizations from happening (or they are more complicated than they otherwise would need to be as they need to know how to look through the operator). For example, see here or pushing limit through joins
A longstanding effort, has been trying to make the filter+concat faster, and we have introduced the coalesce kernels upstream
Related issues
- Use the upstream arrow-rs coalesce kernel #17193
- Integrate Batch coalescing inside FilterExec #18606
- Remove FilterExec from CoalesceBatches optimization rule #18646
- Integrate
BatchCoalescerintoHashJoinExecand remove fromCoalesceBatchesoptimization rule #18781 - Improve RepartitionExec for better query performance #7001
-
CoalesceBatchesphysical optimizer rule says it should be last but it isn't #18148 - Idea: Avoid planning CoalesceBatches in front of blocking operators. #15478
- Integrate
BatchCoalescerintoRepartitionExecand remove fromCoalesceBatchesoptimization rule #18782