fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector by zhangxffff · Pull Request #402 · bytedance/bolt

zhangxffff · 2026-03-17T06:31:05Z

What problem does this PR solve?

This issue occurs in a Spark instance when the NLJ (Nested Loop Join) build side contains only one row with an array<string> column, the probe side also contains an array<string> column, both columns have large data sizes, and a join filter is present.

In the latest NLJ code, the probe dictionary vector is allocated with a fixed size (32768 in Spark on Bolt) and filled with zeros. When the average row size on the probe side is large, the flat size is computed based on the probe dictionary's size, resulting in an excessively large flat size. Furthermore, during exportToArrow, the inner probe vector is converted to an Arrow array with its full length, which causes OOM.

Additionally, when a filter is present, the join probe selects one row at a time to perform a cross product with the build vector. It constructs a single-row vector, applies the filter on it, and then copies this single-row vector into the final output vector. This row-by-row approach significantly degrades performance.

Issue Number: close #401

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

This PR addresses the OOM and performance issues in NLJ when handling wide probe/build rows with a filter.

Compute the average row size of probe and build inputs to determine a proper batch size, avoiding OOM.
Generate larger cross-product batches for NLJ with filters to improve performance.

Batch Size Estimator

During prepareOutput, the probe creates a dictionary vector based on outputBatchSize_. Before vector creation, we now update outputBatchSize_ using the average row size (sum of probe row size and build row size). This keeps the output vector's memory consumption under control and avoids OOM in downstream operators.

Single Build Row and Single Build Vector Optimization for NLJ with Filter

In the current NLJ implementation, when a filter is present, getNextCrossProductBatch returns (1 probe row) × (build vector) for subsequent filter processing. However, when the build vector is small (or even a single row), this results in returning a single-row vector for filtering, causing a performance issue.

To address this, we now return (batchSize / buildSize probe rows) × (build vector) from getNextCrossProductBatch. In the subsequent evaluation loop, we replace the single probe row loop with one that supports iterating over a probe × build dictionary. Since there is only one probe vector and one build vector, we can also avoid the copy in CopyBuildValues and simply return a dictionary vector instead.

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
```
NLJ before optmization: 4.68h
NLJ after optimization: 1.4min
```
Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

…ent OOM

fix: Dynamically adjust NestedLoopJoinProbe output batch size to prev…

14620a0

…ent OOM

zhangxffff requested a review from fzhedu March 17, 2026 07:10

zhangxffff added 2 commits March 19, 2026 17:01

remove estimate in getOutput

9bb60d8

avoid filter on small batch when build only has one row

3136abd

zhangxffff marked this pull request as draft March 19, 2026 18:27

simplify code and comments

bbd940d

zhangxffff marked this pull request as ready for review March 20, 2026 09:04

zhangxffff changed the title ~~fix: Dynamically adjust NestedLoopJoinProbe output batch size to prevent OOM~~ fix: Dynamically adjust NLJ output batch size and optimize NLJ with single build vector Mar 20, 2026

zhangxffff changed the title ~~fix: Dynamically adjust NLJ output batch size and optimize NLJ with single build vector~~ fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector Mar 20, 2026

update comments

e92a491

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector#402

fix: dynamically adjust NLJ output batch size and optimize NLJ with single build vector#402
zhangxffff wants to merge 5 commits intobytedance:mainfrom
zhangxffff:fix/nlj-dynamic-batch-size

zhangxffff commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangxffff commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of Change

Description

Batch Size Estimator

Single Build Row and Single Build Vector Optimization for NLJ with Filter

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangxffff commented Mar 17, 2026 •

edited

Loading