[feature]: Add Spark 4.0 VARIANT type and Parquet read support by oarap · Pull Request #355 · bytedance/bolt

oarap · 2026-03-06T18:55:37Z

Summary

Implements end-to-end support for the Spark 4.0 VARIANT type in Bolt — from the type system and vector representation through Parquet I/O, serialization, expression evaluation, and Arrow interop.

The VARIANT type stores semi-structured JSON-like data as two VARBINARY fields: value (binary payload) and metadata (string dictionary).

What problem does this PR solve?

Issue Number: close #420

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

Changes

Type System

Added TypeKind::VARIANT, VariantType, VariantValue (view) and OwnedVariantValue (owning) with full folly::hasher / std::hash support.
VARIANT is marked non-orderable and non-comparable — byte-level comparison does not imply semantic equality, so GROUP BY, JOIN, and ORDER BY on VARIANT are blocked at the planner level.

Vector Layer

New VariantVector (VectorEncoding::Simple::VARIANT): a BaseVector subclass backed by two child FlatVector<StringView> columns for value and metadata.
Integrated into DecodedVector, BaseVector::createInternal, VectorEncoding, and related utilities.

Parquet Reader

New VariantColumnReader (subclass of StructColumnReader) promotes STRUCT<value BINARY, metadata BINARY> Parquet columns into VariantVector outputs.
Auto-detection heuristic in getParquetColumnInfo; explicit requestedType = VARIANT always takes precedence. DWRF footer parsing deliberately does not auto-promote to avoid false positives on legitimate structs.

Spark VARIANT Encoding/Decoding

SparkVariantEncoder: two-pass JSON → VARIANT binary encoder.
SparkVariantReader: decoder supporting Spark bit-coded, compact, and raw JSON formats with auto-detection. decode() returns std::optional<std::string> — std::nullopt = failure, "null" = valid JSON null.
JSONPath navigation (.key, [index], quoted keys) on binary payloads without a full decode.

SQL Functions

parse_json(VARCHAR) → VARIANT: encodes a JSON string into Spark binary VARIANT.
variant_get(VARIANT, VARCHAR) → VARCHAR: extracts a value at a JSONPath via three strategies: binary nav → compact nav → full decode + simdjson fallback.

Tests

VariantTypeTest, VariantVectorTest, VariantFunctionTest — type, vector, and function correctness.
5 new ParquetReaderTest cases covering end-to-end reads, ScanSpec order mismatches, all Spark primitive types, nested containers (ARRAY<VARIANT>, MAP<VARCHAR, VARIANT>, STRUCT<VARIANT, VARIANT>), and raw JSON payloads. Real Parquet fixture files included.

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Read `VARIANT` columns from Parquet files written by Spark 4.0. Bolt automatically detects the physical `STRUCT<value BINARY, metadata BINARY>` encoding and promotes it to a first-class `VARIANT` type.

- New SQL functions:
  - `parse_json(VARCHAR) → VARIANT` — parses a JSON string into Spark binary VARIANT format.
  - `variant_get(VARIANT, VARCHAR) → VARCHAR` — extracts a value from a VARIANT column using a JSONPath expression (e.g. `$.a.b`, `$[0]`).

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

Cherry-picked from PR bytedance#355 (commits d90040e, 8e131c8) as baseline before applying review fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tation Addresses multiple issues identified during code review: - Fix bytedance#5: Replace "null" string sentinel in decode() with format discriminator (detectFormat) that determines encoding type upfront instead of trial-and-error fallback chain - Fix bytedance#6: Remove fragile single-row swap heuristic in VariantColumnReader; rely on ScanSpec subscript ordering instead - Fix bytedance#7: Document byte-level compare/hash semantic limitations (same logical value can have different binary representations) - Fix bytedance#8: Add requestedType guard to Parquet VARIANT auto-detection to reduce false positives on legitimate structs - Fix bytedance#10: Add missing DECIMAL4/DECIMAL8/DECIMAL16 decoding cases that previously caused silent data loss - Fix bytedance#11: Replace O(n^2) compact container end-offset computation with O(n log n) sorted approach via computeCompactEndOffsets() - Fix bytedance#12: Change StringDictionary::add() return type from uint32_t (always returning 0) to void - Fix bytedance#13: Document StringView inline storage safety assumption in serde deserialization - Fix bytedance#14: Replace heap-allocated unique_ptr<VectorReader> with std::optional to avoid per-batch allocations - Fix bytedance#15-16: Extract duplicated VARIANT-to-JSON logic into appendVariantAsJson() helper - Fix bytedance#17: Add #undef for pre-C++20 concept macro fallbacks in JsonUtil.h to prevent global namespace pollution - Fix bytedance#29: Add VariantVector::kValueChildName/kMetadataChildName constants and use them in VariantColumnReader Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): fix ParquetReader build error (requestedType->isVariant) The requestedType parameter is already a TypePtr, not a TypeWithId, so ->type()->isVariant() should be ->isVariant(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): fix VariantFunctionTest.basic to handle VariantVector The test was using evaluateOnce<VariantValue> which internally tries to dynamic_cast the result to SimpleVector<VariantValue>. But parse_json returns a VariantVector (a composite vector), not a FlatVector. Fixed by evaluating directly and extracting from the VariantVector. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): apply harsh review fixes to VARIANT implementation Critical fixes: - Fix bytedance#5/N6: Replace "null" string sentinel with std::optional<std::string> in decode chain. JSON null values are no longer confused with decode failures. - Fix bytedance#4: Split 1,374-line header-only VariantEncoding.h into .h (declarations) + .cpp (implementations) to reduce compile-time tax on every includer. - Fix bytedance#7/N10: Mark VariantType as non-orderable/non-comparable to prevent query planner from generating GROUP BY/JOIN/DISTINCT/ORDER BY plans that would produce silently wrong results due to byte-level comparison. - Fix bytedance#19: Change isPrimitiveType to false since VARIANT has composite storage (VariantVector with 2 children). Add VARIANT to SimpleFunctionMetadata static_assert whitelist. - Fix bytedance#8/N8: Remove unconditional VARIANT auto-promotion from DWRF ProtoUtils. Add warning comments to Parquet and Arrow heuristics about false positives. Algorithmic fixes: - Fix N1: Replace O(n^2) inner loops in getSubVariantCompact with computeCompactEndOffsets (O(n log n)). - Fix N5: Reorder detectFormat checks to avoid ambiguous format classification. - Fix bytedance#21/N9: Rewrite variant_get with 3 clear strategies (binary nav, compact nav, full decode+JSON extract) instead of 6+ redundant fallback paths. Safety fixes: - Fix N2: Delete copy/move on VectorWriter<Variant> to prevent dangling refs. - Fix N3: Add BOLT_CHECK encoding guard in PrestoSerializer readRowVector. - Fix N7: Resize parent before children in VariantVector for exception safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): address remaining review issues Arrow auto-promotion (bytedance#8): - Remove heuristic STRUCT<value,metadata> → VARIANT promotion from Arrow import. Now requires explicit "ARROW:extension:name" = "spark.variant" metadata annotation. Prevents false positives on legitimate structs. VariantVector::valueVector override confusion (#valueVector): - Remove override of BaseVector::valueVector() which has different semantics (wrapper inner vector vs child[0]). Rename to valueChildVector()/metadataChildVector() to avoid confusion. - Update all callers (VectorReaders, VectorWriters, ContainerRowSerde, RowBasedSerde, ArrowSerializer, tests). Serde duplication (bytedance#15): - Extract shared VARIANT serialize/deserialize/compare/hash logic into bolt/exec/VariantSerdeDetail.h. Both ContainerRowSerde and RowBasedSerde now delegate to the shared implementations, eliminating ~230 lines of duplicated code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduce TypeKind::VARIANT as Spark-compatible STRUCT<value VARBINARY, metadata VARBINARY> and wire it through vectors, serde, and Spark SQL functions. - Add VARIANT type and VariantVector/VariantValue plumbing. - Implement Spark VARIANT dictionary parsing and decoding for both bit-coded and compact encodings (compact string length and unordered start-offset tables). - Support variant_get / parse_json over Spark VARIANT payloads. - Improve Parquet reader integration, including ScanSpec child ordering mismatch correction for (value, metadata). - Add Spark-generated VARIANT Parquet fixtures and Parquet reader/unit test coverage.

oarap force-pushed the oarap_variant_spark branch 2 times, most recently from 84ece88 to d90040e Compare March 18, 2026 17:46

oarap changed the title ~~feat(variant): add Spark 4.0 VARIANT type and Parquet read support~~ [feature]: Add Spark 4.0 VARIANT type and Parquet read support Mar 20, 2026

oarap force-pushed the oarap_variant_spark branch from bee93cf to 2734188 Compare March 20, 2026 17:21

oarap requested a review from frankobe March 20, 2026 17:22

oarap marked this pull request as ready for review March 20, 2026 17:22

oarap requested review from ZacBlanco and markjin1990 March 20, 2026 17:22

markjin1990 requested a review from guhaiyan0221 March 20, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature]: Add Spark 4.0 VARIANT type and Parquet read support#355

[feature]: Add Spark 4.0 VARIANT type and Parquet read support#355
oarap wants to merge 1 commit intobytedance:mainfrom
oarap:oarap_variant_spark

oarap commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oarap commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What problem does this PR solve?

Type of Change

Description

Changes

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oarap commented Mar 6, 2026 •

edited

Loading