Skip to content

[feature]: Add Spark 4.0 VARIANT type and Parquet read support#355

Open
oarap wants to merge 1 commit intobytedance:mainfrom
oarap:oarap_variant_spark
Open

[feature]: Add Spark 4.0 VARIANT type and Parquet read support#355
oarap wants to merge 1 commit intobytedance:mainfrom
oarap:oarap_variant_spark

Conversation

@oarap
Copy link
Collaborator

@oarap oarap commented Mar 6, 2026

Summary

Implements end-to-end support for the Spark 4.0 VARIANT type in Bolt — from the type system and vector representation through Parquet I/O, serialization, expression evaluation, and Arrow interop.

The VARIANT type stores semi-structured JSON-like data as two VARBINARY fields: value (binary payload) and metadata (string dictionary).

What problem does this PR solve?

Issue Number: close #420

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Changes

Type System

  • Added TypeKind::VARIANT, VariantType, VariantValue (view) and OwnedVariantValue (owning) with full folly::hasher / std::hash support.
  • VARIANT is marked non-orderable and non-comparable — byte-level comparison does not imply semantic equality, so GROUP BY, JOIN, and ORDER BY on VARIANT are blocked at the planner level.

Vector Layer

  • New VariantVector (VectorEncoding::Simple::VARIANT): a BaseVector subclass backed by two child FlatVector<StringView> columns for value and metadata.
  • Integrated into DecodedVector, BaseVector::createInternal, VectorEncoding, and related utilities.

Parquet Reader

  • New VariantColumnReader (subclass of StructColumnReader) promotes STRUCT<value BINARY, metadata BINARY> Parquet columns into VariantVector outputs.
  • Auto-detection heuristic in getParquetColumnInfo; explicit requestedType = VARIANT always takes precedence. DWRF footer parsing deliberately does not auto-promote to avoid false positives on legitimate structs.

Spark VARIANT Encoding/Decoding

  • SparkVariantEncoder: two-pass JSON → VARIANT binary encoder.
  • SparkVariantReader: decoder supporting Spark bit-coded, compact, and raw JSON formats with auto-detection. decode() returns std::optional<std::string>std::nullopt = failure, "null" = valid JSON null.
  • JSONPath navigation (.key, [index], quoted keys) on binary payloads without a full decode.

SQL Functions

  • parse_json(VARCHAR) → VARIANT: encodes a JSON string into Spark binary VARIANT.
  • variant_get(VARIANT, VARCHAR) → VARCHAR: extracts a value at a JSONPath via three strategies: binary nav → compact nav → full decode + simdjson fallback.

Tests

  • VariantTypeTest, VariantVectorTest, VariantFunctionTest — type, vector, and function correctness.
  • 5 new ParquetReaderTest cases covering end-to-end reads, ScanSpec order mismatches, all Spark primitive types, nested containers (ARRAY<VARIANT>, MAP<VARCHAR, VARIANT>, STRUCT<VARIANT, VARIANT>), and raw JSON payloads. Real Parquet fixture files included.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Read `VARIANT` columns from Parquet files written by Spark 4.0. Bolt automatically detects the physical `STRUCT<value BINARY, metadata BINARY>` encoding and promotes it to a first-class `VARIANT` type.

- New SQL functions:
  - `parse_json(VARCHAR) → VARIANT` — parses a JSON string into Spark binary VARIANT format.
  - `variant_get(VARIANT, VARCHAR) → VARCHAR` — extracts a value from a VARIANT column using a JSONPath expression (e.g. `$.a.b`, `$[0]`).

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@oarap oarap force-pushed the oarap_variant_spark branch 2 times, most recently from 84ece88 to d90040e Compare March 18, 2026 17:46
omerarapdev pushed a commit to omerarapdev/bolt that referenced this pull request Mar 20, 2026
Cherry-picked from PR bytedance#355 (commits d90040e, 8e131c8) as baseline
before applying review fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
omerarapdev pushed a commit to omerarapdev/bolt that referenced this pull request Mar 20, 2026
…tation

Addresses multiple issues identified during code review:

- Fix bytedance#5: Replace "null" string sentinel in decode() with format
  discriminator (detectFormat) that determines encoding type upfront
  instead of trial-and-error fallback chain
- Fix bytedance#6: Remove fragile single-row swap heuristic in
  VariantColumnReader; rely on ScanSpec subscript ordering instead
- Fix bytedance#7: Document byte-level compare/hash semantic limitations
  (same logical value can have different binary representations)
- Fix bytedance#8: Add requestedType guard to Parquet VARIANT auto-detection
  to reduce false positives on legitimate structs
- Fix bytedance#10: Add missing DECIMAL4/DECIMAL8/DECIMAL16 decoding cases
  that previously caused silent data loss
- Fix bytedance#11: Replace O(n^2) compact container end-offset computation
  with O(n log n) sorted approach via computeCompactEndOffsets()
- Fix bytedance#12: Change StringDictionary::add() return type from uint32_t
  (always returning 0) to void
- Fix bytedance#13: Document StringView inline storage safety assumption in
  serde deserialization
- Fix bytedance#14: Replace heap-allocated unique_ptr<VectorReader> with
  std::optional to avoid per-batch allocations
- Fix bytedance#15-16: Extract duplicated VARIANT-to-JSON logic into
  appendVariantAsJson() helper
- Fix bytedance#17: Add #undef for pre-C++20 concept macro fallbacks in
  JsonUtil.h to prevent global namespace pollution
- Fix bytedance#29: Add VariantVector::kValueChildName/kMetadataChildName
  constants and use them in VariantColumnReader

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(variant): fix ParquetReader build error (requestedType->isVariant)

The requestedType parameter is already a TypePtr, not a TypeWithId,
so ->type()->isVariant() should be ->isVariant().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(variant): fix VariantFunctionTest.basic to handle VariantVector

The test was using evaluateOnce<VariantValue> which internally tries to
dynamic_cast the result to SimpleVector<VariantValue>. But parse_json
returns a VariantVector (a composite vector), not a FlatVector. Fixed
by evaluating directly and extracting from the VariantVector.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(variant): apply harsh review fixes to VARIANT implementation

Critical fixes:
- Fix bytedance#5/N6: Replace "null" string sentinel with std::optional<std::string>
  in decode chain. JSON null values are no longer confused with decode failures.
- Fix bytedance#4: Split 1,374-line header-only VariantEncoding.h into .h (declarations)
  + .cpp (implementations) to reduce compile-time tax on every includer.
- Fix bytedance#7/N10: Mark VariantType as non-orderable/non-comparable to prevent
  query planner from generating GROUP BY/JOIN/DISTINCT/ORDER BY plans that
  would produce silently wrong results due to byte-level comparison.
- Fix bytedance#19: Change isPrimitiveType to false since VARIANT has composite storage
  (VariantVector with 2 children). Add VARIANT to SimpleFunctionMetadata
  static_assert whitelist.
- Fix bytedance#8/N8: Remove unconditional VARIANT auto-promotion from DWRF ProtoUtils.
  Add warning comments to Parquet and Arrow heuristics about false positives.

Algorithmic fixes:
- Fix N1: Replace O(n^2) inner loops in getSubVariantCompact with
  computeCompactEndOffsets (O(n log n)).
- Fix N5: Reorder detectFormat checks to avoid ambiguous format classification.
- Fix bytedance#21/N9: Rewrite variant_get with 3 clear strategies (binary nav, compact
  nav, full decode+JSON extract) instead of 6+ redundant fallback paths.

Safety fixes:
- Fix N2: Delete copy/move on VectorWriter<Variant> to prevent dangling refs.
- Fix N3: Add BOLT_CHECK encoding guard in PrestoSerializer readRowVector.
- Fix N7: Resize parent before children in VariantVector for exception safety.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(variant): address remaining review issues

Arrow auto-promotion (bytedance#8):
- Remove heuristic STRUCT<value,metadata> → VARIANT promotion from Arrow
  import. Now requires explicit "ARROW:extension:name" = "spark.variant"
  metadata annotation. Prevents false positives on legitimate structs.

VariantVector::valueVector override confusion (#valueVector):
- Remove override of BaseVector::valueVector() which has different
  semantics (wrapper inner vector vs child[0]). Rename to
  valueChildVector()/metadataChildVector() to avoid confusion.
- Update all callers (VectorReaders, VectorWriters, ContainerRowSerde,
  RowBasedSerde, ArrowSerializer, tests).

Serde duplication (bytedance#15):
- Extract shared VARIANT serialize/deserialize/compare/hash logic into
  bolt/exec/VariantSerdeDetail.h. Both ContainerRowSerde and RowBasedSerde
  now delegate to the shared implementations, eliminating ~230 lines of
  duplicated code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@oarap oarap changed the title feat(variant): add Spark 4.0 VARIANT type and Parquet read support [feature]: Add Spark 4.0 VARIANT type and Parquet read support Mar 20, 2026
Introduce TypeKind::VARIANT as Spark-compatible STRUCT<value VARBINARY, metadata VARBINARY> and wire it through vectors, serde, and Spark SQL functions.

- Add VARIANT type and VariantVector/VariantValue plumbing.
- Implement Spark VARIANT dictionary parsing and decoding for both bit-coded and compact encodings (compact string length and unordered start-offset tables).
- Support variant_get / parse_json over Spark VARIANT payloads.
- Improve Parquet reader integration, including ScanSpec child ordering mismatch correction for (value, metadata).
- Add Spark-generated VARIANT Parquet fixtures and Parquet reader/unit test coverage.
@oarap oarap force-pushed the oarap_variant_spark branch from bee93cf to 2734188 Compare March 20, 2026 17:21
@oarap oarap requested a review from frankobe March 20, 2026 17:22
@oarap oarap marked this pull request as ready for review March 20, 2026 17:22
@oarap oarap requested review from ZacBlanco and markjin1990 March 20, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark 4.0 VARIANT Type

1 participant