Beat the limitations of EE in terms of singular elements pushed into batch inputs #1504

PawelPeczek-Roboflow · 2025-08-22T08:01:42Z

Description

In this PR I am adding extension to Workflows Execution Engine which are going to make the EE much more flexible:

The main feature added here is Auto Batch Casting - which is mainly responsible for bridging the gap between scalar parameters and SIMD blocks - so far, when blocks were suited to process batch-oriented inputs and was fed with scalar parameters - we had errors like:

Detected invalid reference plugged into property images of step $steps.model - the step property strictly requires batch-oriented inputs, yet the input selector holds non-batch oriented input - this indicates the problem with construction of your Workflow - usually the problem occurs when non-batch oriented step inputs are filled with outputs of non batch-oriented steps or non batch-oriented inputs.

This was exceptionally problematic when we wanted to combine steps that let's say don't take image inputs, but rather produce images and we expect them to be processed by models blocks down the line. This PR makes the above possible.

Additionally, it also breaks the artificial boundary of dimensionality collapse.

The migration to new version of EE is free, up to the point of the blocks which decrease dimensionality - for those of them which did not accept batched inputs we cannot say if the scalar provided should be auto-batch-casted or not - in such scenarios, without decorating manifest with get_parameters_enforcing_auto_batch_casting(...) we will not be able to auto-cast parameters which potentially are intended to be casted - in such a case, instead of compilation error (as in the previous EE version), users may occasionally see blocks runtime error, but according to my investigation, this only happens for workflows that previously was failing.

When merged with #1498

But there is a limitation related to the way on how we construct outputs - really important limitation. We can only have one source of leading batch dimensionality - usually this is meant to be input image, but when we have auto-batch-casting, some other block on the flow may create 1-st level of dimensionality increasing dim of auto-casted batch. We handle that gracefully in EE, up to the moment of the output construction, which unfortunately cannot be handled reasonably without breaking change for our clients and probably MUST wait until EE 2.0

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

extensive suite of new automated tests of EE
CI still 🟢

Any specific deployment considerations

NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin

Docs

Docs updated? What were the changes:

…terms-of-singular-elements-pushed-into-batch-inputs

(`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code achieves an 18% speedup through several targeted micro-optimizations: **1. Direct OrderedDict Construction** The most significant improvement eliminates the intermediate list allocation in `retrieve_selectors_from_schema`. Instead of building a list and then converting it to an OrderedDict with a generator expression, selectors are added directly to the OrderedDict during iteration. This saves memory allocation and reduces the final conversion overhead. **2. Reduced Dictionary Access Overhead** In `retrieve_selectors_from_simple_property`, the `property_definition` parameter is aliased to `pd` to avoid repeated dictionary name lookups. While seemingly minor, this reduces attribute resolution overhead in the function's hot path. **3. Optimized Set Membership Testing** The dynamic points-to-batch logic now caches set membership results in local variables (`in_batches_and_scalars`, `in_batches`, `in_auto_cast`) rather than performing the same set membership tests multiple times. **4. Conditional List Comprehension** When processing KIND_KEY values, the code now checks if the list is empty before creating the list comprehension, avoiding unnecessary iterator creation for empty cases. **Performance Analysis from Tests:** The optimizations show consistent improvements across all test scenarios, with particularly strong gains (20-30%) on simpler schemas and smaller but meaningful gains (6-11%) on complex union cases. The optimizations are most effective for schemas with many properties, where the direct dictionary construction and reduced lookups compound their benefits. Edge cases like empty schemas show the highest relative improvements (50%+) due to reduced overhead in the main loop structure.

codeflash-ai · 2025-08-22T08:27:55Z

⚡️ Codeflash found optimizations for this PR

📄 19% (0.19x) speedup for `retrieve_selectors_from_schema` in `inference/core/workflows/execution_engine/introspection/schema_parser.py`

⏱️ Runtime : 186 microseconds → 157 microseconds (best of 86 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function retrieve_selectors_from_schema by 19% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1505

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

…ting` by 13% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimization achieves a 12% speedup by applying two key changes: **1. Function Call Inlining (Primary Optimization)** The main performance gain comes from inlining the `get_lineage_for_input_property` function logic directly into the main loop of `get_input_data_lineage_excluding_auto_batch_casting`. This eliminates ~2,342 function calls (as shown in the profiler), reducing the overhead from 79.6% to 31.6% of total time spent in the `identify_lineage` call. The inlined logic checks `input_definition.is_compound_input()` directly in the loop and handles both compound and simple inputs inline, avoiding the function call overhead entirely for the common case of simple batch-oriented inputs. **2. Dictionary Implementation Change** In `verify_lineages`, replaced `defaultdict(list)` with a plain dictionary using explicit key existence checks. This reduces the overhead of defaultdict's factory function calls and provides more predictable performance characteristics, especially beneficial when processing large numbers of lineages. **Performance Impact by Test Type:** - **Large-scale tests** (500+ properties): ~17-18% improvement due to reduced per-iteration overhead - **Basic tests** (few properties): ~14-22% improvement from eliminating function call overhead - **Compound inputs**: ~7-20% improvement, with better gains for simpler compound structures - **Edge cases** (empty/scalar): Minimal impact as expected, since less computation occurs The optimization maintains identical behavior and error handling while significantly reducing the computational overhead in the hot path where most properties are processed.

codeflash-ai · 2025-08-22T09:05:17Z

⚡️ Codeflash found optimizations for this PR

📄 13% (0.13x) speedup for `get_input_data_lineage_excluding_auto_batch_casting` in `inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py`

⏱️ Runtime : 1.46 milliseconds → 1.29 milliseconds (best of 18 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function get_input_data_lineage_excluding_auto_batch_casting by 13% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1506

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

…(`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code replaces a manual linear search with Python's built-in `max()` function, delivering a **26% speedup** by eliminating redundant operations. **Key optimizations:** 1. **Single-pass iteration**: The original code performs 12,601 iterations with 12,550 length comparisons. The optimized version uses `max(all_lineages_of_batch_parameters, key=len, default=[])` which iterates once and delegates the comparison logic to highly optimized C code. 2. **Eliminates repeated `len()` calls**: The original code calls `len(longest_longest_lineage_support)` on every comparison (12,550 times), recalculating the same length repeatedly. The optimized version calculates each lineage's length exactly once. 3. **Removes variable assignments**: The original code performs 3,104 assignment operations when updating the longest lineage. The optimized version eliminates these assignments entirely. **Performance characteristics by test case:** - **Small inputs (< 10 lineages)**: The optimization shows 50-60% slower performance due to function call overhead, but these cases run in microseconds where the difference is negligible. - **Large inputs (1000+ lineages)**: Shows 30-55% speedup, where the optimization truly shines. For example, `test_large_with_varying_lengths` improves from 62.1μs to 40.4μs (54% faster). - **Best case scenarios**: When the longest lineage appears early or when many lineages share the maximum length, the original code still must scan the entire list, while `max()` maintains consistent performance. The optimization is most effective for workflows processing large batches of lineage data, which appears to be the primary use case based on the test suite.

inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

…` by 30% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code achieves a **29% speedup** through two key optimizations that reduce overhead in the inner loop: **Key optimizations:** 1. **Eliminates repeated attribute lookups**: Caches `parsed_selector.definition.property_name` in a local variable instead of accessing it twice per inner loop iteration 2. **Reduces dictionary access overhead**: Stores a reference to the target set (`batch_compatibility_of_properties[property_name]`) and reuses it, avoiding repeated dictionary lookups 3. **Uses in-place set union (`|=`)** instead of the `update()` method, which has slightly less overhead for set operations **Performance impact by test case:** - **Small inputs (1-10 selectors)**: Modest 1-10% improvements due to reduced method call overhead - **Medium inputs (100-500 selectors)**: 12-25% speedups as the optimizations compound with more iterations - **Large inputs with many references**: Up to 149% improvement in cases with many references per selector, where the inner loop dominates runtime The line profiler shows the optimization moves expensive work (attribute lookups and dictionary access) from the inner loop to the outer loop. The original code performed `parsed_selector.definition.property_name` lookup 12,672 times, while the optimized version does it only 3,432 times - exactly once per selector instead of once per reference. This optimization is particularly effective for workflows with selectors containing many allowed references, which is common in batch processing scenarios.

codeflash-ai · 2025-08-22T15:35:46Z

⚡️ Codeflash found optimizations for this PR

📄 30% (0.30x) speedup for `retrieve_batch_compatibility_of_input_selectors` in `inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py`

⏱️ Runtime : 1.28 milliseconds → 987 microseconds (best of 274 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function retrieve_batch_compatibility_of_input_selectors by 30% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1507

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

…ting batch-oriented steps

PawelPeczek-Roboflow · 2025-08-22T18:09:53Z

STILL TODO - DOCS!

brunopicinin · 2025-08-22T19:40:26Z

NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin

Workflows UI updated to allow removing all image inputs ✅

PawelPeczek-Roboflow · 2025-08-25T07:32:58Z

NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin

Workflows UI updated to allow removing all image inputs ✅

cool

…`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`) The optimized code achieves a **36% speedup** through a single but impactful conditional check optimization in the `prepare_parameters` function. **Key Optimization:** The main performance improvement comes from adding an `if empty_indices:` check before executing expensive list comprehension and data removal operations: ```python # Original: Always executes these expensive operations indices = [e for e in indices if e not in empty_indices] result = remove_indices(value=result, indices=empty_indices) # Optimized: Only executes when empty_indices is non-empty if empty_indices: indices = [e for e in indices if e not in empty_indices] result = remove_indices(value=result, indices=empty_indices) ``` **Why this optimization works:** - In many test cases, `empty_indices` is an empty set, making the filtering operations unnecessary - The list comprehension `[e for e in indices if e not in empty_indices]` has O(n*m) complexity where n=len(indices) and m=len(empty_indices) - `remove_indices()` recursively processes nested data structures, which is expensive even for empty removal sets - By avoiding these operations when `empty_indices` is empty, we eliminate significant computational overhead **Performance impact by test case type:** - **Large batch inputs** see the biggest gains (43-107% faster) because they avoid expensive O(n) operations on large datasets when no filtering is needed - **Basic test cases** show consistent 15-25% improvements from avoiding unnecessary operations - **Edge cases with actual empty elements** may see minimal or slightly negative impact (0.5% slower) due to the additional conditional check, but this is negligible compared to the gains in common cases This optimization is particularly effective because most workflow executions don't have empty batch elements that need filtering, making the conditional check a highly beneficial guard against unnecessary work.

codeflash-ai · 2025-08-25T10:24:27Z

⚡️ Codeflash found optimizations for this PR

📄 37% (0.37x) speedup for `construct_simd_step_input` in `inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py`

⏱️ Runtime : 1.99 milliseconds → 1.46 milliseconds (best of 40 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function construct_simd_step_input by 37% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1509

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

…properly

…terms-of-singular-elements-pushed-into-batch-inputs

…-08-25T10.24.18 ⚡️ Speed up function `construct_simd_step_input` by 37% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

codeflash-ai · 2025-08-25T14:33:31Z

This PR is now faster! 🚀 @PawelPeczek-Roboflow accepted my optimizations from:

⚡️ Speed up function construct_simd_step_input by 37% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1509

…terms-of-singular-elements-pushed-into-batch-inputs

hansent · 2025-08-25T17:20:59Z

docs/workflows/workflows_execution_engine.md

 Non-SIMD steps, by contrast, are expected to deliver a single result for the input data. In the case of non-SIMD 
 flow-control steps, they affect all downstream steps as a whole, rather than individually for each element in a batch.

+Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps


Suggested change

Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps

Historically, Execution Engine could not handle well all scenarios when non-SIMD steps' outputs were fed into SIMD steps

hansent · 2025-08-25T17:23:27Z

docs/workflows/workflows_execution_engine.md

+    batch order**. With Auto Batch Casting, batches may also be generated dynamically, and no deterministic ordering 
+    can be guaranteed (imagine scenario when you feed batch of 4 images, and there is a block generating dynamic batch 
+    with 3 images - when results are to be returned, Execution Engine is unable to determine a single input batch which 
+    would dictate output order alignment, which is a hard requirement caused by falty design choices). 


Suggested change

would dictate output order alignment, which is a hard requirement caused by falty design choices).

would dictate output order alignment, which is a hard requirement caused by previous design choices).

hansent · 2025-08-25T17:26:02Z

inference/core/workflows/execution_engine/introspection/schema_parser.py

            if property_name in inputs_accepting_batches_and_scalars:
-                points_to_batch = {True, False}
+                if property_name in inputs_enforcing_auto_batch_casting:
+                    points_to_batch = {True}


why only True here?

this is for enforced auto-batch casting - which is turning all parameters into batches in case there is a mix (compound fields) or when we have this special case of non-batch oriented block downgrading the output dim (then we need to add this new class method to the manifest, otherwise the only way to judge which input params are to be wrapped [to make it possible to reduce across last dim] is to analyse the signature annotations, which we avoided to do as this is very flaky)

hansent

LGTM

PawelPeczek-Roboflow added 6 commits August 20, 2025 20:59

WIP - first attempt towards automatic batch casting

61eaa05

WIP - first version kinda working e2e, yet not extensively tested

0ffd261

Fix tests

ed2da81

WIP - safe commit

d328c1f

Iterate to make decrease of dimensionality work

b72719c

Merge branch 'main' into feature/try-to-beat-the-limitation-of-ee-in-…

9e7765a

…terms-of-singular-elements-pushed-into-batch-inputs

codeflash-ai bot mentioned this pull request Aug 22, 2025

⚡️ Speed up function retrieve_selectors_from_schema by 19% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1505

Closed

codeflash-ai bot mentioned this pull request Aug 22, 2025

⚡️ Speed up function get_input_data_lineage_excluding_auto_batch_casting by 13% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1506

Closed

codeflash-ai bot reviewed Aug 22, 2025

View reviewed changes

inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py Show resolved Hide resolved

PawelPeczek-Roboflow added 6 commits August 22, 2025 12:54

WIP - testing blocks accepting compound inputs

fb704d3

WIP - testing blocks accepting compound inputs

69ebbb1

Finish testing alignment of ABC with dimensionality manipulations

55211a2

Finish tests and adjustments for conditional execution

c6a7ab5

Make linters happy

c4b59cb

Clean up

ac237f2

PawelPeczek-Roboflow marked this pull request as ready for review August 22, 2025 15:31

PawelPeczek-Roboflow requested review from grzegorz-roboflow, hansent, probicheaux and yeldarby as code owners August 22, 2025 15:31

codeflash-ai bot mentioned this pull request Aug 22, 2025

⚡️ Speed up function retrieve_batch_compatibility_of_input_selectors by 30% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1507

Closed

Fix issue with the dimensionality increase in terms of auto-batch-cas…

4cf7bb1

…ting batch-oriented steps

PawelPeczek-Roboflow added 3 commits August 25, 2025 11:35

Add first part of changelog

546fad8

Revert the order of EE changelog

0b38378

Add first part of changelog

714d88c

PawelPeczek-Roboflow requested a review from capjamesg as a code owner August 25, 2025 09:35

codeflash-ai bot mentioned this pull request Aug 25, 2025

⚡️ Speed up function construct_simd_step_input by 37% in PR #1504 (feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs) #1509

Merged

PawelPeczek-Roboflow and others added 3 commits August 25, 2025 16:31

Clarify docs and fix issue with input parameters not being broadcast …

831583a

…properly

Merge branch 'main' into feature/try-to-beat-the-limitation-of-ee-in-…

62460e9

…terms-of-singular-elements-pushed-into-batch-inputs

Merge pull request #1509 from roboflow/codeflash/optimize-pr1504-2025…

37c120b

…-08-25T10.24.18 ⚡️ Speed up function `construct_simd_step_input` by 37% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

Merge branch 'main' into feature/try-to-beat-the-limitation-of-ee-in-…

32dd916

…terms-of-singular-elements-pushed-into-batch-inputs

hansent reviewed Aug 25, 2025

View reviewed changes

PawelPeczek-Roboflow added 4 commits August 26, 2025 12:06

Introduce output nesting for emergent dimensions

86d8787

Resolve conflicts with main

25646cc

Add more tests and clarify docs

852fe54

Add proper auth to integration tests

8355698

hansent approved these changes Aug 26, 2025

View reviewed changes

PawelPeczek-Roboflow merged commit a8ba225 into main Aug 26, 2025
43 of 44 checks passed

PawelPeczek-Roboflow deleted the feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs branch August 26, 2025 14:20

	Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps
	Historically, Execution Engine could not handle well all scenarios when non-SIMD steps' outputs were fed into SIMD steps

	would dictate output order alignment, which is a hard requirement caused by falty design choices).
	would dictate output order alignment, which is a hard requirement caused by previous design choices).

Beat the limitations of EE in terms of singular elements pushed into batch inputs #1504

Beat the limitations of EE in terms of singular elements pushed into batch inputs #1504

Uh oh!

Conversation

PawelPeczek-Roboflow commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

Uh oh!

codeflash-ai bot commented Aug 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 19% (0.19x) speedup for retrieve_selectors_from_schema in inference/core/workflows/execution_engine/introspection/schema_parser.py

I created a new dependent PR with the suggested changes. Please review:

Uh oh!

codeflash-ai bot commented Aug 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 13% (0.13x) speedup for get_input_data_lineage_excluding_auto_batch_casting in inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

I created a new dependent PR with the suggested changes. Please review:

Uh oh!

Uh oh!

codeflash-ai bot commented Aug 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 30% (0.30x) speedup for retrieve_batch_compatibility_of_input_selectors in inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

I created a new dependent PR with the suggested changes. Please review:

Uh oh!

PawelPeczek-Roboflow commented Aug 22, 2025

Uh oh!

brunopicinin commented Aug 22, 2025

Uh oh!

PawelPeczek-Roboflow commented Aug 25, 2025

Uh oh!

codeflash-ai bot commented Aug 25, 2025

⚡️ Codeflash found optimizations for this PR

📄 37% (0.37x) speedup for construct_simd_step_input in inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py

I created a new dependent PR with the suggested changes. Please review:

Uh oh!

codeflash-ai bot commented Aug 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hansent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PawelPeczek-Roboflow commented Aug 22, 2025 •

edited

Loading

📄 19% (0.19x) speedup for `retrieve_selectors_from_schema` in `inference/core/workflows/execution_engine/introspection/schema_parser.py`

📄 13% (0.13x) speedup for `get_input_data_lineage_excluding_auto_batch_casting` in `inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py`

📄 30% (0.30x) speedup for `retrieve_batch_compatibility_of_input_selectors` in `inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py`

📄 37% (0.37x) speedup for `construct_simd_step_input` in `inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py`