Skip to content

Conversation

@PawelPeczek-Roboflow
Copy link
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow commented Aug 22, 2025

Description

In this PR I am adding extension to Workflows Execution Engine which are going to make the EE much more flexible:

The main feature added here is Auto Batch Casting - which is mainly responsible for bridging the gap between scalar parameters and SIMD blocks - so far, when blocks were suited to process batch-oriented inputs and was fed with scalar parameters - we had errors like:

Detected invalid reference plugged into property images of step $steps.model - the step property strictly requires batch-oriented inputs, yet the input selector holds non-batch oriented input - this indicates the problem with construction of your Workflow - usually the problem occurs when non-batch oriented step inputs are filled with outputs of non batch-oriented steps or non batch-oriented inputs.

This was exceptionally problematic when we wanted to combine steps that let's say don't take image inputs, but rather produce images and we expect them to be processed by models blocks down the line. This PR makes the above possible.

Additionally, it also breaks the artificial boundary of dimensionality collapse.

The migration to new version of EE is free, up to the point of the blocks which decrease dimensionality - for those of them which did not accept batched inputs we cannot say if the scalar provided should be auto-batch-casted or not - in such scenarios, without decorating manifest with get_parameters_enforcing_auto_batch_casting(...) we will not be able to auto-cast parameters which potentially are intended to be casted - in such a case, instead of compilation error (as in the previous EE version), users may occasionally see blocks runtime error, but according to my investigation, this only happens for workflows that previously was failing.

When merged with #1498

image image image

But there is a limitation related to the way on how we construct outputs - really important limitation. We can only have one source of leading batch dimensionality - usually this is meant to be input image, but when we have auto-batch-casting, some other block on the flow may create 1-st level of dimensionality increasing dim of auto-casted batch. We handle that gracefully in EE, up to the moment of the output construction, which unfortunately cannot be handled reasonably without breaking change for our clients and probably MUST wait until EE 2.0

image

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

  • extensive suite of new automated tests of EE
  • CI still 🟢

Any specific deployment considerations

NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin

Docs

  • Docs updated? What were the changes:

codeflash-ai bot added a commit that referenced this pull request Aug 22, 2025
 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

The optimized code achieves an 18% speedup through several targeted micro-optimizations:

**1. Direct OrderedDict Construction**
The most significant improvement eliminates the intermediate list allocation in `retrieve_selectors_from_schema`. Instead of building a list and then converting it to an OrderedDict with a generator expression, selectors are added directly to the OrderedDict during iteration. This saves memory allocation and reduces the final conversion overhead.

**2. Reduced Dictionary Access Overhead**
In `retrieve_selectors_from_simple_property`, the `property_definition` parameter is aliased to `pd` to avoid repeated dictionary name lookups. While seemingly minor, this reduces attribute resolution overhead in the function's hot path.

**3. Optimized Set Membership Testing**
The dynamic points-to-batch logic now caches set membership results in local variables (`in_batches_and_scalars`, `in_batches`, `in_auto_cast`) rather than performing the same set membership tests multiple times.

**4. Conditional List Comprehension**
When processing KIND_KEY values, the code now checks if the list is empty before creating the list comprehension, avoiding unnecessary iterator creation for empty cases.

**Performance Analysis from Tests:**
The optimizations show consistent improvements across all test scenarios, with particularly strong gains (20-30%) on simpler schemas and smaller but meaningful gains (6-11%) on complex union cases. The optimizations are most effective for schemas with many properties, where the direct dictionary construction and reduced lookups compound their benefits. Edge cases like empty schemas show the highest relative improvements (50%+) due to reduced overhead in the main loop structure.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Aug 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 19% (0.19x) speedup for retrieve_selectors_from_schema in inference/core/workflows/execution_engine/introspection/schema_parser.py

⏱️ Runtime : 186 microseconds 157 microseconds (best of 86 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

codeflash-ai bot added a commit that referenced this pull request Aug 22, 2025
…ting` by 13% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

The optimization achieves a 12% speedup by applying two key changes:

**1. Function Call Inlining (Primary Optimization)**
The main performance gain comes from inlining the `get_lineage_for_input_property` function logic directly into the main loop of `get_input_data_lineage_excluding_auto_batch_casting`. This eliminates ~2,342 function calls (as shown in the profiler), reducing the overhead from 79.6% to 31.6% of total time spent in the `identify_lineage` call.

The inlined logic checks `input_definition.is_compound_input()` directly in the loop and handles both compound and simple inputs inline, avoiding the function call overhead entirely for the common case of simple batch-oriented inputs.

**2. Dictionary Implementation Change**
In `verify_lineages`, replaced `defaultdict(list)` with a plain dictionary using explicit key existence checks. This reduces the overhead of defaultdict's factory function calls and provides more predictable performance characteristics, especially beneficial when processing large numbers of lineages.

**Performance Impact by Test Type:**
- **Large-scale tests** (500+ properties): ~17-18% improvement due to reduced per-iteration overhead
- **Basic tests** (few properties): ~14-22% improvement from eliminating function call overhead  
- **Compound inputs**: ~7-20% improvement, with better gains for simpler compound structures
- **Edge cases** (empty/scalar): Minimal impact as expected, since less computation occurs

The optimization maintains identical behavior and error handling while significantly reducing the computational overhead in the hot path where most properties are processed.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Aug 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 13% (0.13x) speedup for get_input_data_lineage_excluding_auto_batch_casting in inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

⏱️ Runtime : 1.46 milliseconds 1.29 milliseconds (best of 18 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

codeflash-ai bot added a commit that referenced this pull request Aug 22, 2025
…(`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

The optimized code replaces a manual linear search with Python's built-in `max()` function, delivering a **26% speedup** by eliminating redundant operations.

**Key optimizations:**

1. **Single-pass iteration**: The original code performs 12,601 iterations with 12,550 length comparisons. The optimized version uses `max(all_lineages_of_batch_parameters, key=len, default=[])` which iterates once and delegates the comparison logic to highly optimized C code.

2. **Eliminates repeated `len()` calls**: The original code calls `len(longest_longest_lineage_support)` on every comparison (12,550 times), recalculating the same length repeatedly. The optimized version calculates each lineage's length exactly once.

3. **Removes variable assignments**: The original code performs 3,104 assignment operations when updating the longest lineage. The optimized version eliminates these assignments entirely.

**Performance characteristics by test case:**
- **Small inputs (< 10 lineages)**: The optimization shows 50-60% slower performance due to function call overhead, but these cases run in microseconds where the difference is negligible.
- **Large inputs (1000+ lineages)**: Shows 30-55% speedup, where the optimization truly shines. For example, `test_large_with_varying_lengths` improves from 62.1μs to 40.4μs (54% faster).
- **Best case scenarios**: When the longest lineage appears early or when many lineages share the maximum length, the original code still must scan the entire list, while `max()` maintains consistent performance.

The optimization is most effective for workflows processing large batches of lineage data, which appears to be the primary use case based on the test suite.
@PawelPeczek-Roboflow PawelPeczek-Roboflow marked this pull request as ready for review August 22, 2025 15:31
codeflash-ai bot added a commit that referenced this pull request Aug 22, 2025
…` by 30% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

The optimized code achieves a **29% speedup** through two key optimizations that reduce overhead in the inner loop:

**Key optimizations:**
1. **Eliminates repeated attribute lookups**: Caches `parsed_selector.definition.property_name` in a local variable instead of accessing it twice per inner loop iteration
2. **Reduces dictionary access overhead**: Stores a reference to the target set (`batch_compatibility_of_properties[property_name]`) and reuses it, avoiding repeated dictionary lookups
3. **Uses in-place set union (`|=`)** instead of the `update()` method, which has slightly less overhead for set operations

**Performance impact by test case:**
- **Small inputs (1-10 selectors)**: Modest 1-10% improvements due to reduced method call overhead
- **Medium inputs (100-500 selectors)**: 12-25% speedups as the optimizations compound with more iterations  
- **Large inputs with many references**: Up to 149% improvement in cases with many references per selector, where the inner loop dominates runtime

The line profiler shows the optimization moves expensive work (attribute lookups and dictionary access) from the inner loop to the outer loop. The original code performed `parsed_selector.definition.property_name` lookup 12,672 times, while the optimized version does it only 3,432 times - exactly once per selector instead of once per reference.

This optimization is particularly effective for workflows with selectors containing many allowed references, which is common in batch processing scenarios.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Aug 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 30% (0.30x) speedup for retrieve_batch_compatibility_of_input_selectors in inference/core/workflows/execution_engine/v1/compiler/graph_constructor.py

⏱️ Runtime : 1.28 milliseconds 987 microseconds (best of 274 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

@PawelPeczek-Roboflow
Copy link
Collaborator Author

STILL TODO - DOCS!

@brunopicinin
Copy link
Contributor

NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin

Workflows UI updated to allow removing all image inputs ✅

@PawelPeczek-Roboflow
Copy link
Collaborator Author

NOW, UI DOES NOT ALLOW US TO DELETE INPUT IMAGE, WHICH MUST BE DONE FOR BLOCKS SUCH AS THE ONE PROPOSED #1498 MAKE SENSE - CC @hansent @brunopicinin

Workflows UI updated to allow removing all image inputs ✅

cool

…`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)

The optimized code achieves a **36% speedup** through a single but impactful conditional check optimization in the `prepare_parameters` function.

**Key Optimization:**
The main performance improvement comes from adding an `if empty_indices:` check before executing expensive list comprehension and data removal operations:

```python
# Original: Always executes these expensive operations
indices = [e for e in indices if e not in empty_indices]
result = remove_indices(value=result, indices=empty_indices)

# Optimized: Only executes when empty_indices is non-empty
if empty_indices:
    indices = [e for e in indices if e not in empty_indices]
    result = remove_indices(value=result, indices=empty_indices)
```

**Why this optimization works:**
- In many test cases, `empty_indices` is an empty set, making the filtering operations unnecessary
- The list comprehension `[e for e in indices if e not in empty_indices]` has O(n*m) complexity where n=len(indices) and m=len(empty_indices)
- `remove_indices()` recursively processes nested data structures, which is expensive even for empty removal sets
- By avoiding these operations when `empty_indices` is empty, we eliminate significant computational overhead

**Performance impact by test case type:**
- **Large batch inputs** see the biggest gains (43-107% faster) because they avoid expensive O(n) operations on large datasets when no filtering is needed
- **Basic test cases** show consistent 15-25% improvements from avoiding unnecessary operations
- **Edge cases with actual empty elements** may see minimal or slightly negative impact (0.5% slower) due to the additional conditional check, but this is negligible compared to the gains in common cases

This optimization is particularly effective because most workflow executions don't have empty batch elements that need filtering, making the conditional check a highly beneficial guard against unnecessary work.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Aug 25, 2025

⚡️ Codeflash found optimizations for this PR

📄 37% (0.37x) speedup for construct_simd_step_input in inference/core/workflows/execution_engine/v1/executor/execution_data_manager/step_input_assembler.py

⏱️ Runtime : 1.99 milliseconds 1.46 milliseconds (best of 40 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs).

PawelPeczek-Roboflow and others added 3 commits August 25, 2025 16:31
…terms-of-singular-elements-pushed-into-batch-inputs
…-08-25T10.24.18

⚡️ Speed up function `construct_simd_step_input` by 37% in PR #1504 (`feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Aug 25, 2025

…terms-of-singular-elements-pushed-into-batch-inputs
Non-SIMD steps, by contrast, are expected to deliver a single result for the input data. In the case of non-SIMD
flow-control steps, they affect all downstream steps as a whole, rather than individually for each element in a batch.

Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Historically, Execution Engine could not handle well al scenarios when non-SIMD steps' outputs were fed into SIMD steps
Historically, Execution Engine could not handle well all scenarios when non-SIMD steps' outputs were fed into SIMD steps

batch order**. With Auto Batch Casting, batches may also be generated dynamically, and no deterministic ordering
can be guaranteed (imagine scenario when you feed batch of 4 images, and there is a block generating dynamic batch
with 3 images - when results are to be returned, Execution Engine is unable to determine a single input batch which
would dictate output order alignment, which is a hard requirement caused by falty design choices).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
would dictate output order alignment, which is a hard requirement caused by falty design choices).
would dictate output order alignment, which is a hard requirement caused by previous design choices).

if property_name in inputs_accepting_batches_and_scalars:
points_to_batch = {True, False}
if property_name in inputs_enforcing_auto_batch_casting:
points_to_batch = {True}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only True here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for enforced auto-batch casting - which is turning all parameters into batches in case there is a mix (compound fields) or when we have this special case of non-batch oriented block downgrading the output dim (then we need to add this new class method to the manifest, otherwise the only way to judge which input params are to be wrapped [to make it possible to reduce across last dim] is to analyse the signature annotations, which we avoided to do as this is very flaky)

Copy link
Collaborator

@hansent hansent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PawelPeczek-Roboflow PawelPeczek-Roboflow merged commit a8ba225 into main Aug 26, 2025
43 of 44 checks passed
@PawelPeczek-Roboflow PawelPeczek-Roboflow deleted the feature/try-to-beat-the-limitation-of-ee-in-terms-of-singular-elements-pushed-into-batch-inputs branch August 26, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants