Skip to content

[feature][cp] feat: Add Spark to_json function#390

Merged
markjin1990 merged 2 commits intobytedance:mainfrom
markjin1990:cherry-pick-spark-func-to-json
Mar 23, 2026
Merged

[feature][cp] feat: Add Spark to_json function#390
markjin1990 merged 2 commits intobytedance:mainfrom
markjin1990:cherry-pick-spark-func-to-json

Conversation

@markjin1990
Copy link
Collaborator

@markjin1990 markjin1990 commented Mar 12, 2026

What problem does this PR solve?

Issue Number: close #391

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

  • Cherrypicked Spark function implementation for to_json facebookincubator/velox@a871a75
  • Modified unit test in ToJsonTest.longDecimal to be consistent with Spark which parses decimal '0.0000000000' as '0E-10' using scientific notation when Bolt supports Spark.
  • Added missing components needed by the cherry-pick.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- [cherry-pick] feat: Add Spark to_json function

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@markjin1990 markjin1990 force-pushed the cherry-pick-spark-func-to-json branch from 9af9d0e to 26f39f0 Compare March 12, 2026 22:07
@@ -0,0 +1,82 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect license

@markjin1990 markjin1990 force-pushed the cherry-pick-spark-func-to-json branch 2 times, most recently from 65a3c5d to c3cb26c Compare March 13, 2026 16:56
Cherry-picked from facebookincubator/velox@a871a75

Original-author: Wechar Yu <yuwq1996@gmail.com>
Cherry-picked-by: Zhongjun Jin <markjin1990@gmail.com>

Original Commit Message:
------------------------------------------------------------
Summary:
The `to_json` function converts a Json object (ROW, ARRAY or MAP) into a JSON string.

Spark's implementation: https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L672

https://docs.databricks.com/en/sql/language-manual/functions/to_json.html

Pull Request resolved: #11995

Reviewed By: xiaoxmeng

Differential Revision: D79266717

Pulled By: kKPulla

fbshipit-source-id: da5308e663f1149dbfa5a95f6b61ee1c4ab86d7c
------------------------------------------------------------

Source: facebookincubator/velox@a871a75
@markjin1990 markjin1990 force-pushed the cherry-pick-spark-func-to-json branch from c3cb26c to 9ce2c7f Compare March 13, 2026 17:38
@markjin1990 markjin1990 changed the title WIP: [Cherry-pick][Velox] feat: Add Spark to_json function [Cherry-pick][Velox] feat: Add Spark to_json function Mar 13, 2026
@frankobe frankobe changed the title [Cherry-pick][Velox] feat: Add Spark to_json function [feature][cp] feat: Add Spark to_json function Mar 13, 2026
@wangxinshuo-bolt
Copy link
Collaborator

Great work, I have some questions:

  1. Before open-sourcing, we internally implemented our own to_json function. After open-sourcing, the to_json function support was removed. Why is this?
  2. If the to_json function is cherry-picked, were the original UT cases retained? Those examples came from our online cases.

@markjin1990
Copy link
Collaborator Author

@wangxinshuo-bolt Excellent questions! @frankobe also has the same concerns.

  1. Our internal (presto) to_json implementation can only takes in one argument. Now, after rebase, Gluten now requires two arguments for to_json (input + timeZone), and Bolt fails as described in [Bug] "to_json" fails on gluten with error "Scalar function to_json not registered with arguments: (ARRAY<VARCHAR>, VARCHAR)" #391. In this case, we either need to cherry-pick the Spark to_json implementation to meet the new requirements, or we rewrite the Gluten (ToJsonTransformer).
  2. I already run tests on 150 internal tasks, and the result all match. I am now testing original UT tests and see if we miss anything. I will get you updated on the final result.

…cpp, 1) support varbinary type in Spark to_json, 2) fix Array with single empty ROW case in Spark to_json function.

auto input = makeRowVector({mapVector, arrayVector});
auto expected = makeNullableFlatVector<std::string>(
{R"({"c0":{"blue":[1,2],"red":[null,4]},"c1":[{"blue":1,"red":2},{"green":null}]})",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxinshuo-bolt So far, the only difference between this cherry-picked Spark to_json function and the existing to_json is on the nested json object. Spark to_json function will add quotes around the nested keys, but existing to_json won't.

Image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxinshuo-bolt So far, the only difference between this cherry-picked Spark to_json function and the existing to_json is on the nested json object. Spark to_json function will add quotes around the nested keys, but existing to_json won't.

Image

I think "c0":{"blue":[1,2],"red":[null,4]} seems more reasonable, and we can use Spark to test the actual return value of the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxinshuo-bolt Spark does have quotes around key strings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxinshuo-bolt Spark does have quotes around key strings.

Excellent!

Copy link
Collaborator

@wangxinshuo-bolt wangxinshuo-bolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markjin1990 markjin1990 added this pull request to the merge queue Mar 23, 2026
@markjin1990 markjin1990 removed this pull request from the merge queue due to a manual request Mar 23, 2026
@markjin1990 markjin1990 added this pull request to the merge queue Mar 23, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 23, 2026
@markjin1990 markjin1990 added this pull request to the merge queue Mar 23, 2026
Merged via the queue into bytedance:main with commit ab89ad8 Mar 23, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] "to_json" fails on gluten with error "Scalar function to_json not registered with arguments: (ARRAY<VARCHAR>, VARCHAR)"

3 participants