-
Notifications
You must be signed in to change notification settings - Fork 225
Beat the limitations of EE in terms of singular elements pushed into batch inputs #1504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beat the limitations of EE in terms of singular elements pushed into batch inputs #1504
Conversation
…terms-of-singular-elements-pushed-into-batch-inputs
(`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code achieves an 18% speedup through several targeted micro-optimizations: **1. Direct OrderedDict Construction** The most significant improvement eliminates the intermediate list allocation in `retrieve_selectors_from_schema`. Instead of building a list and then converting it to an OrderedDict with a generator expression, selectors are added directly to the OrderedDict during iteration. This saves memory allocation and reduces the final conversion overhead. **2. Reduced Dictionary Access Overhead** In `retrieve_selectors_from_simple_property`, the `property_definition` parameter is aliased to `pd` to avoid repeated dictionary name lookups. While seemingly minor, this reduces attribute resolution overhead in the function's hot path. **3. Optimized Set Membership Testing** The dynamic points-to-batch logic now caches set membership results in local variables (`in_batches_and_scalars`, `in_batches`, `in_auto_cast`) rather than performing the same set membership tests multiple times. **4. Conditional List Comprehension** When processing KIND_KEY values, the code now checks if the list is empty before creating the list comprehension, avoiding unnecessary iterator creation for empty cases. **Performance Analysis from Tests:** The optimizations show consistent improvements across all test scenarios, with particularly strong gains (20-30%) on simpler schemas and smaller but meaningful gains (6-11%) on complex union cases. The optimizations are most effective for schemas with many properties, where the direct dictionary construction and reduced lookups compound their benefits. Edge cases like empty schemas show the highest relative improvements (50%+) due to reduced overhead in the main loop structure.
⚡️ Codeflash found optimizations for this PR📄 19% (0.19x) speedup for
|
…ting` by 13% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimization achieves a 12% speedup by applying two key changes: **1. Function Call Inlining (Primary Optimization)** The main performance gain comes from inlining the `get_lineage_for_input_property` function logic directly into the main loop of `get_input_data_lineage_excluding_auto_batch_casting`. This eliminates ~2,342 function calls (as shown in the profiler), reducing the overhead from 79.6% to 31.6% of total time spent in the `identify_lineage` call. The inlined logic checks `input_definition.is_compound_input()` directly in the loop and handles both compound and simple inputs inline, avoiding the function call overhead entirely for the common case of simple batch-oriented inputs. **2. Dictionary Implementation Change** In `verify_lineages`, replaced `defaultdict(list)` with a plain dictionary using explicit key existence checks. This reduces the overhead of defaultdict's factory function calls and provides more predictable performance characteristics, especially beneficial when processing large numbers of lineages. **Performance Impact by Test Type:** - **Large-scale tests** (500+ properties): ~17-18% improvement due to reduced per-iteration overhead - **Basic tests** (few properties): ~14-22% improvement from eliminating function call overhead - **Compound inputs**: ~7-20% improvement, with better gains for simpler compound structures - **Edge cases** (empty/scalar): Minimal impact as expected, since less computation occurs The optimization maintains identical behavior and error handling while significantly reducing the computational overhead in the hot path where most properties are processed.
⚡️ Codeflash found optimizations for this PR📄 13% (0.13x) speedup for
|
…(`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code replaces a manual linear search with Python's built-in `max()` function, delivering a **26% speedup** by eliminating redundant operations. **Key optimizations:** 1. **Single-pass iteration**: The original code performs 12,601 iterations with 12,550 length comparisons. The optimized version uses `max(all_lineages_of_batch_parameters, key=len, default=[])` which iterates once and delegates the comparison logic to highly optimized C code. 2. **Eliminates repeated `len()` calls**: The original code calls `len(longest_longest_lineage_support)` on every comparison (12,550 times), recalculating the same length repeatedly. The optimized version calculates each lineage's length exactly once. 3. **Removes variable assignments**: The original code performs 3,104 assignment operations when updating the longest lineage. The optimized version eliminates these assignments entirely. **Performance characteristics by test case:** - **Small inputs (< 10 lineages)**: The optimization shows 50-60% slower performance due to function call overhead, but these cases run in microseconds where the difference is negligible. - **Large inputs (1000+ lineages)**: Shows 30-55% speedup, where the optimization truly shines. For example, `test_large_with_varying_lengths` improves from 62.1μs to 40.4μs (54% faster). - **Best case scenarios**: When the longest lineage appears early or when many lineages share the maximum length, the original code still must scan the entire list, while `max()` maintains consistent performance. The optimization is most effective for workflows processing large batches of lineage data, which appears to be the primary use case based on the test suite.
inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py
Show resolved
Hide resolved
…` by 30% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code achieves a **29% speedup** through two key optimizations that reduce overhead in the inner loop: **Key optimizations:** 1. **Eliminates repeated attribute lookups**: Caches `parsed_selector.definition.property_name` in a local variable instead of accessing it twice per inner loop iteration 2. **Reduces dictionary access overhead**: Stores a reference to the target set (`batch_compatibility_of_properties[property_name]`) and reuses it, avoiding repeated dictionary lookups 3. **Uses in-place set union (`|=`)** instead of the `update()` method, which has slightly less overhead for set operations **Performance impact by test case:** - **Small inputs (1-10 selectors)**: Modest 1-10% improvements due to reduced method call overhead - **Medium inputs (100-500 selectors)**: 12-25% speedups as the optimizations compound with more iterations - **Large inputs with many references**: Up to 149% improvement in cases with many references per selector, where the inner loop dominates runtime The line profiler shows the optimization moves expensive work (attribute lookups and dictionary access) from the inner loop to the outer loop. The original code performed `parsed_selector.definition.property_name` lookup 12,672 times, while the optimized version does it only 3,432 times - exactly once per selector instead of once per reference. This optimization is particularly effective for workflows with selectors containing many allowed references, which is common in batch processing scenarios.
⚡️ Codeflash found optimizations for this PR📄 30% (0.30x) speedup for
|
…ting batch-oriented steps
|
STILL TODO - DOCS! |
Workflows UI updated to allow removing all image inputs ✅ |
cool |
…`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)
The optimized code achieves a **36% speedup** through a single but impactful conditional check optimization in the `prepare_parameters` function.
**Key Optimization:**
The main performance improvement comes from adding an `if empty_indices:` check before executing expensive list comprehension and data removal operations:
```python
# Original: Always executes these expensive operations
indices = [e for e in indices if e not in empty_indices]
result = remove_indices(value=result, indices=empty_indices)
# Optimized: Only executes when empty_indices is non-empty
if empty_indices:
indices = [e for e in indices if e not in empty_indices]
result = remove_indices(value=result, indices=empty_indices)
```
**Why this optimization works:**
- In many test cases, `empty_indices` is an empty set, making the filtering operations unnecessary
- The list comprehension `[e for e in indices if e not in empty_indices]` has O(n*m) complexity where n=len(indices) and m=len(empty_indices)
- `remove_indices()` recursively processes nested data structures, which is expensive even for empty removal sets
- By avoiding these operations when `empty_indices` is empty, we eliminate significant computational overhead
**Performance impact by test case type:**
- **Large batch inputs** see the biggest gains (43-107% faster) because they avoid expensive O(n) operations on large datasets when no filtering is needed
- **Basic test cases** show consistent 15-25% improvements from avoiding unnecessary operations
- **Edge cases with actual empty elements** may see minimal or slightly negative impact (0.5% slower) due to the additional conditional check, but this is negligible compared to the gains in common cases
This optimization is particularly effective because most workflow executions don't have empty batch elements that need filtering, making the conditional check a highly beneficial guard against unnecessary work.
⚡️ Codeflash found optimizations for this PR📄 37% (0.37x) speedup for
|
…terms-of-singular-elements-pushed-into-batch-inputs
…-08-25T10.24.18 ⚡️ Speed up function `construct_simd_step_input` by 37% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)
|
This PR is now faster! 🚀 @PawelPeczek-Roboflow accepted my optimizations from: |
…terms-of-singular-elements-pushed-into-batch-inputs
| Non-SIMD steps, by contrast, are expected to deliver a single result for the input data. In the case of non-SIMD | ||
| flow-control steps, they affect all downstream steps as a whole, rather than individually for each element in a batch. | ||
|
|
||
| Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps | |
| Historically, Execution Engine could not handle well all scenarios when non-SIMD steps' outputs were fed into SIMD steps |
| batch order**. With Auto Batch Casting, batches may also be generated dynamically, and no deterministic ordering | ||
| can be guaranteed (imagine scenario when you feed batch of 4 images, and there is a block generating dynamic batch | ||
| with 3 images - when results are to be returned, Execution Engine is unable to determine a single input batch which | ||
| would dictate output order alignment, which is a hard requirement caused by falty design choices). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| would dictate output order alignment, which is a hard requirement caused by falty design choices). | |
| would dictate output order alignment, which is a hard requirement caused by previous design choices). |
| if property_name in inputs_accepting_batches_and_scalars: | ||
| points_to_batch = {True, False} | ||
| if property_name in inputs_enforcing_auto_batch_casting: | ||
| points_to_batch = {True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only True here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for enforced auto-batch casting - which is turning all parameters into batches in case there is a mix (compound fields) or when we have this special case of non-batch oriented block downgrading the output dim (then we need to add this new class method to the manifest, otherwise the only way to judge which input params are to be wrapped [to make it possible to reduce across last dim] is to analyse the signature annotations, which we avoided to do as this is very flaky)
hansent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
In this PR I am adding extension to Workflows Execution Engine which are going to make the EE much more flexible:
The main feature added here is Auto Batch Casting - which is mainly responsible for bridging the gap between scalar parameters and SIMD blocks - so far, when blocks were suited to process batch-oriented inputs and was fed with scalar parameters - we had errors like:
This was exceptionally problematic when we wanted to combine steps that let's say don't take image inputs, but rather produce images and we expect them to be processed by models blocks down the line. This PR makes the above possible.
Additionally, it also breaks the artificial boundary of dimensionality collapse.
The migration to new version of EE is free, up to the point of the blocks which decrease dimensionality - for those of them which did not accept batched inputs we cannot say if the scalar provided should be auto-batch-casted or not - in such scenarios, without decorating manifest with
get_parameters_enforcing_auto_batch_casting(...)we will not be able to auto-cast parameters which potentially are intended to be casted - in such a case, instead of compilation error (as in the previous EE version), users may occasionally see blocks runtime error, but according to my investigation, this only happens for workflows that previously was failing.When merged with #1498
But there is a limitation related to the way on how we construct outputs - really important limitation. We can only have one source of leading batch dimensionality - usually this is meant to be input image, but when we have auto-batch-casting, some other block on the flow may create 1-st level of dimensionality increasing dim of auto-casted batch. We handle that gracefully in EE, up to the moment of the output construction, which unfortunately cannot be handled reasonably without breaking change for our clients and probably MUST wait until EE 2.0
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
Any specific deployment considerations
NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin
Docs