Skip to content

fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector#402

Open
zhangxffff wants to merge 5 commits intobytedance:mainfrom
zhangxffff:fix/nlj-dynamic-batch-size
Open

fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector#402
zhangxffff wants to merge 5 commits intobytedance:mainfrom
zhangxffff:fix/nlj-dynamic-batch-size

Conversation

@zhangxffff
Copy link
Collaborator

@zhangxffff zhangxffff commented Mar 17, 2026

What problem does this PR solve?

This issue occurs in a Spark instance when the NLJ (Nested Loop Join) build side contains only one row with an array<string> column, the probe side also contains an array<string> column, both columns have large data sizes, and a join filter is present.

In the latest NLJ code, the probe dictionary vector is allocated with a fixed size (32768 in Spark on Bolt) and filled with zeros. When the average row size on the probe side is large, the flat size is computed based on the probe dictionary's size, resulting in an excessively large flat size. Furthermore, during exportToArrow, the inner probe vector is converted to an Arrow array with its full length, which causes OOM.

Additionally, when a filter is present, the join probe selects one row at a time to perform a cross product with the build vector. It constructs a single-row vector, applies the filter on it, and then copies this single-row vector into the final output vector. This row-by-row approach significantly degrades performance.

Issue Number: close #401

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

This PR addresses the OOM and performance issues in NLJ when handling wide probe/build rows with a filter.

  1. Compute the average row size of probe and build inputs to determine a proper batch size, avoiding OOM.
  2. Generate larger cross-product batches for NLJ with filters to improve performance.

Batch Size Estimator

During prepareOutput, the probe creates a dictionary vector based on outputBatchSize_. Before vector creation, we now update outputBatchSize_ using the average row size (sum of probe row size and build row size). This keeps the output vector's memory consumption under control and avoids OOM in downstream operators.

Single Build Row and Single Build Vector Optimization for NLJ with Filter

In the current NLJ implementation, when a filter is present, getNextCrossProductBatch returns (1 probe row) × (build vector) for subsequent filter processing. However, when the build vector is small (or even a single row), this results in returning a single-row vector for filtering, causing a performance issue.

To address this, we now return (batchSize / buildSize probe rows) × (build vector) from getNextCrossProductBatch. In the subsequent evaluation loop, we replace the single probe row loop with one that supports iterating over a probe × build dictionary. Since there is only one probe vector and one build vector, we can also avoid the copy in CopyBuildValues and simply return a dictionary vector instead.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    NLJ before optmization: 4.68h
    NLJ after optimization: 1.4min
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@zhangxffff zhangxffff requested a review from fzhedu March 17, 2026 07:10
@zhangxffff zhangxffff marked this pull request as draft March 19, 2026 18:27
@zhangxffff zhangxffff marked this pull request as ready for review March 20, 2026 09:04
@zhangxffff zhangxffff changed the title fix: Dynamically adjust NestedLoopJoinProbe output batch size to prevent OOM fix: Dynamically adjust NLJ output batch size and optimize NLJ with single build vector Mar 20, 2026
@zhangxffff zhangxffff changed the title fix: Dynamically adjust NLJ output batch size and optimize NLJ with single build vector fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] NestedLoopJoinProbe OOM with wide build-side rows

1 participant