perf: improve json read by ariel-miculas · Pull Request #20823 · apache/datafusion

ariel-miculas · 2026-03-09T15:56:04Z

Which issue does this PR close?

Closes #.

Rationale for this change

This is an alternative approach to

perf: reduce read amplification for partitioned JSON file scanning #19687

Instead of reading the entire range in the json FileOpener, implement an
AlignedBoundaryStream which scans the range for newlines as the FileStream
requests data from the stream, by wrapping the original stream returned by the
ObjectStore.

This eliminated the overhead of the extra two get_opts requests needed by
calculate_range and more importantly, it allows for efficient read-ahead
implementations by the underlying ObjectStore. Previously this was inefficient
because the streams opened by calculate_range included a stream from
(start - 1) to file_size and another one from (end - 1) to end_of_file, just to
find the two relevant newlines.

What changes are included in this PR?

Added the AlignedBoundaryStream which wraps a stream returned by the object
store and finds the delimiting newlines for a particular file range. Notably it doesn't
do any standalone reads (unlike the calculate_range function), eliminating two calls
to get_opts.

Are these changes tested?

Yes, added unit tests.

Are there any user-facing changes?

No

ariel-miculas · 2026-03-09T15:59:11Z

adding @Weijun-H and @martin-g since you've discussed the previous PR

ariel-miculas · 2026-03-09T16:29:34Z

a bit ironic

Run ci/scripts/typos_check.sh
[typos_check.sh] `typos --config typos.toml`
error: `tpos` should be `typos`
    ╭▸ ./datafusion/datasource-json/src/boundary_stream.rs:279:41
    │
279 │ …                     if let Some(tpos) =
    ╰╴                                  ━━━━
error: `tpos` should be `typos`
    ╭▸ ./datafusion/datasource-json/src/boundary_stream.rs:283:74
    │
283 │ …                     return Poll::Ready(Some(Ok(chunk.slice(..tpos + 1))));
    ╰╴                                                               ━━━━

This is an alternative approach to apache#19687 Instead of reading the entire range in the json FileOpener, implement an AlignedBoundaryStream which scans the range for newlines as the FileStream requests data from the stream, by wrapping the original stream returned by the ObjectStore. This eliminated the overhead of the extra two get_opts requests needed by calculate_range and more importantly, it allows for efficient read-ahead implementations by the underlying ObjectStore. Previously this was inefficient because the streams opened by calculate_range included a stream from (start - 1) to file_size and another one from (end - 1) to end_of_file, just to find the two relevant newlines.

ariel-miculas · 2026-03-19T15:04:44Z

Who can help me with reviews? @martin-g @alamb

alamb · 2026-03-19T15:51:33Z

Thanks for this PR @ariel-miculas

Do you have any benchmark results for this change? Even some example queries

@Weijun-H do you know of any benchmarks to run?

ariel-miculas · 2026-03-19T16:31:08Z

No, I'm having troubles coming up with a realistic benchmark.

The previous benchmark https://github.com/apache/datafusion/pull/19687/changes#diff-5358b38b6265d769b66b614f7ba88ed9320f7a9fce5197330b7c01c2a8a3ed3b incorrectly assumes that all the requested bytes (via get_opts) will be read, while you can actually request a 10GiB stream of bytes and read only 16KiB from it.

As a result, the benchmark of the previous PR for reducing the read amplification shows impressive improvements, but it hides the fact that it breaks the parallelization between data fetching and json decoding (by doing all the data fetching in the JsonOpener instead of allowing FileStream to do its magic).

So I'm not sure how to write a benchmark that can prove at the same time that:

I'm increasing performance (because there are no more read requests in the JsonOpener)
This solution is better than the original proposal perf: reduce read amplification for partitioned JSON file scanning #19687 because it doesn't break parallelization between fetching and decoding
This optimization is relevant for real-world object store implementations (where network latency matters, network speed matters, data computation can happen while waiting for bytes to be read, read-ahead is a relevant optimization etc.)

datafusion/datasource-json/src/boundary_stream.rs

datafusion/datasource-json/src/source.rs

datafusion/datasource-json/src/boundary_stream.rs

martin-g · 2026-03-20T03:43:48Z

datafusion/datasource-json/src/boundary_stream.rs

+    terminator: u8,
+    /// Effective end boundary. Set to `u64::MAX` when `end >= file_size`
+    /// (last partition), so `FetchingChunks` never transitions to
+    /// `ScanningLastTerminator` and simply streams to EOF.


... streams to EOF is not clear to me. What do you mean ?

It means we passthrough all the chunks to the json decoder (the caller which polls AlignedBoundaryStream), staying in the FetchingChunks phase until we consume the entire inner stream; this only happens when raw_end >= file_size, i.e. for the last file range in a file, in which case there's nothing else to scan past raw_end for a terminator (nor is there any need to do so). So we consume only the initial stream, but since that one includes the end of the file, we passthrough all the remaining chunks until end of file (EOF) is reached.

Thanks!

How about changing it to this ?

... and simply streams until EOF is reached

I am not native English speaker and "verb to EOF" does not sound correct to me.

you're right, until is the right word here

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

martin-g · 2026-03-20T13:11:16Z

datafusion/datasource-json/src/boundary_stream.rs

+    terminator: u8,
+    /// Effective end boundary. Set to `u64::MAX` when `end >= file_size`
+    /// (last partition), so `FetchingChunks` never transitions to
+    /// `ScanningLastTerminator` and simply streams to EOF.


Thanks!

How about changing it to this ?

... and simply streams until EOF is reached

I am not native English speaker and "verb to EOF" does not sound correct to me.

martin-g · 2026-03-20T13:11:52Z

datafusion/datasource-json/src/boundary_stream.rs

+        )
+        .await?;
+
+        // Last partition reads to EOF — no end-boundary scanning needed.


// Last partition reads until EOF is reached — no end-boundary scanning needed.

martin-g · 2026-03-20T13:12:37Z

datafusion/datasource-json/src/boundary_stream.rs

+                    let pos_after = this.abs_pos();
+
+                    // When end == u64::MAX (last partition), this is always
+                    // true and we stream straight through to EOF.


// true and we stream straight through until EOF is reached.

martin-g · 2026-03-20T13:13:07Z

datafusion/datasource-json/src/boundary_stream.rs

+    async fn test_no_trailing_newline() {
+        // Last partition of a file that does not end with a newline.
+        // end >= file_size → this.end = u64::MAX, so Passthrough streams
+        // straight to EOF and yields the final incomplete line as-is.


// straight until EOF is reached and yields the final incomplete line as-is.

ariel-miculas · 2026-03-20T14:23:06Z

thanks for reviewing, @martin-g !

github-actions bot added the datasource Changes to the datasource crate label Mar 9, 2026

ariel-miculas force-pushed the improve-json-read branch from e3b5355 to f5b3811 Compare March 17, 2026 17:12

ariel-miculas added 3 commits March 19, 2026 12:54

fix: remove custom logic for file vs stream handling

b58ffa2

fix: remove unused dependency

2f92f12

refactor: remove some redundant code

abb6609

martin-g self-requested a review March 20, 2026 03:02

martin-g reviewed Mar 20, 2026

View reviewed changes

ariel-miculas and others added 4 commits March 20, 2026 11:29

fix: remove unneeded clone

ddf8619

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

fix: range checks for raw_start and raw_end

422988f

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

fix: formatting and missing import

88fa48a

refactor: consolidate boundary checks inside AlignedBoundaryStream

abc003d

martin-g reviewed Mar 20, 2026

View reviewed changes

martin-g approved these changes Mar 20, 2026

View reviewed changes

fix: minor grammar fixes in comments

7871e54

Conversation

ariel-miculas commented Mar 9, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ariel-miculas commented Mar 9, 2026

Uh oh!

ariel-miculas commented Mar 9, 2026

Uh oh!

ariel-miculas commented Mar 19, 2026

Uh oh!

alamb commented Mar 19, 2026

Uh oh!

ariel-miculas commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariel-miculas commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ariel-miculas commented Mar 9, 2026 •

edited by alamb

Loading