[feature]: Add Spark 4.0 VARIANT type and Parquet read support#355
Open
oarap wants to merge 1 commit intobytedance:mainfrom
Open
[feature]: Add Spark 4.0 VARIANT type and Parquet read support#355oarap wants to merge 1 commit intobytedance:mainfrom
oarap wants to merge 1 commit intobytedance:mainfrom
Conversation
84ece88 to
d90040e
Compare
omerarapdev
pushed a commit
to omerarapdev/bolt
that referenced
this pull request
Mar 20, 2026
Cherry-picked from PR bytedance#355 (commits d90040e, 8e131c8) as baseline before applying review fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
omerarapdev
pushed a commit
to omerarapdev/bolt
that referenced
this pull request
Mar 20, 2026
…tation Addresses multiple issues identified during code review: - Fix bytedance#5: Replace "null" string sentinel in decode() with format discriminator (detectFormat) that determines encoding type upfront instead of trial-and-error fallback chain - Fix bytedance#6: Remove fragile single-row swap heuristic in VariantColumnReader; rely on ScanSpec subscript ordering instead - Fix bytedance#7: Document byte-level compare/hash semantic limitations (same logical value can have different binary representations) - Fix bytedance#8: Add requestedType guard to Parquet VARIANT auto-detection to reduce false positives on legitimate structs - Fix bytedance#10: Add missing DECIMAL4/DECIMAL8/DECIMAL16 decoding cases that previously caused silent data loss - Fix bytedance#11: Replace O(n^2) compact container end-offset computation with O(n log n) sorted approach via computeCompactEndOffsets() - Fix bytedance#12: Change StringDictionary::add() return type from uint32_t (always returning 0) to void - Fix bytedance#13: Document StringView inline storage safety assumption in serde deserialization - Fix bytedance#14: Replace heap-allocated unique_ptr<VectorReader> with std::optional to avoid per-batch allocations - Fix bytedance#15-16: Extract duplicated VARIANT-to-JSON logic into appendVariantAsJson() helper - Fix bytedance#17: Add #undef for pre-C++20 concept macro fallbacks in JsonUtil.h to prevent global namespace pollution - Fix bytedance#29: Add VariantVector::kValueChildName/kMetadataChildName constants and use them in VariantColumnReader Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): fix ParquetReader build error (requestedType->isVariant) The requestedType parameter is already a TypePtr, not a TypeWithId, so ->type()->isVariant() should be ->isVariant(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): fix VariantFunctionTest.basic to handle VariantVector The test was using evaluateOnce<VariantValue> which internally tries to dynamic_cast the result to SimpleVector<VariantValue>. But parse_json returns a VariantVector (a composite vector), not a FlatVector. Fixed by evaluating directly and extracting from the VariantVector. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): apply harsh review fixes to VARIANT implementation Critical fixes: - Fix bytedance#5/N6: Replace "null" string sentinel with std::optional<std::string> in decode chain. JSON null values are no longer confused with decode failures. - Fix bytedance#4: Split 1,374-line header-only VariantEncoding.h into .h (declarations) + .cpp (implementations) to reduce compile-time tax on every includer. - Fix bytedance#7/N10: Mark VariantType as non-orderable/non-comparable to prevent query planner from generating GROUP BY/JOIN/DISTINCT/ORDER BY plans that would produce silently wrong results due to byte-level comparison. - Fix bytedance#19: Change isPrimitiveType to false since VARIANT has composite storage (VariantVector with 2 children). Add VARIANT to SimpleFunctionMetadata static_assert whitelist. - Fix bytedance#8/N8: Remove unconditional VARIANT auto-promotion from DWRF ProtoUtils. Add warning comments to Parquet and Arrow heuristics about false positives. Algorithmic fixes: - Fix N1: Replace O(n^2) inner loops in getSubVariantCompact with computeCompactEndOffsets (O(n log n)). - Fix N5: Reorder detectFormat checks to avoid ambiguous format classification. - Fix bytedance#21/N9: Rewrite variant_get with 3 clear strategies (binary nav, compact nav, full decode+JSON extract) instead of 6+ redundant fallback paths. Safety fixes: - Fix N2: Delete copy/move on VectorWriter<Variant> to prevent dangling refs. - Fix N3: Add BOLT_CHECK encoding guard in PrestoSerializer readRowVector. - Fix N7: Resize parent before children in VariantVector for exception safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(variant): address remaining review issues Arrow auto-promotion (bytedance#8): - Remove heuristic STRUCT<value,metadata> → VARIANT promotion from Arrow import. Now requires explicit "ARROW:extension:name" = "spark.variant" metadata annotation. Prevents false positives on legitimate structs. VariantVector::valueVector override confusion (#valueVector): - Remove override of BaseVector::valueVector() which has different semantics (wrapper inner vector vs child[0]). Rename to valueChildVector()/metadataChildVector() to avoid confusion. - Update all callers (VectorReaders, VectorWriters, ContainerRowSerde, RowBasedSerde, ArrowSerializer, tests). Serde duplication (bytedance#15): - Extract shared VARIANT serialize/deserialize/compare/hash logic into bolt/exec/VariantSerdeDetail.h. Both ContainerRowSerde and RowBasedSerde now delegate to the shared implementations, eliminating ~230 lines of duplicated code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce TypeKind::VARIANT as Spark-compatible STRUCT<value VARBINARY, metadata VARBINARY> and wire it through vectors, serde, and Spark SQL functions. - Add VARIANT type and VariantVector/VariantValue plumbing. - Implement Spark VARIANT dictionary parsing and decoding for both bit-coded and compact encodings (compact string length and unordered start-offset tables). - Support variant_get / parse_json over Spark VARIANT payloads. - Improve Parquet reader integration, including ScanSpec child ordering mismatch correction for (value, metadata). - Add Spark-generated VARIANT Parquet fixtures and Parquet reader/unit test coverage.
bee93cf to
2734188
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements end-to-end support for the Spark 4.0
VARIANTtype in Bolt — from the type system and vector representation through Parquet I/O, serialization, expression evaluation, and Arrow interop.The
VARIANTtype stores semi-structured JSON-like data as twoVARBINARYfields:value(binary payload) andmetadata(string dictionary).What problem does this PR solve?
Issue Number: close #420
Type of Change
Description
Changes
Type System
TypeKind::VARIANT,VariantType,VariantValue(view) andOwnedVariantValue(owning) with fullfolly::hasher/std::hashsupport.VARIANTis marked non-orderable and non-comparable — byte-level comparison does not imply semantic equality, soGROUP BY,JOIN, andORDER BYon VARIANT are blocked at the planner level.Vector Layer
VariantVector(VectorEncoding::Simple::VARIANT): aBaseVectorsubclass backed by two childFlatVector<StringView>columns for value and metadata.DecodedVector,BaseVector::createInternal,VectorEncoding, and related utilities.Parquet Reader
VariantColumnReader(subclass ofStructColumnReader) promotesSTRUCT<value BINARY, metadata BINARY>Parquet columns intoVariantVectoroutputs.getParquetColumnInfo; explicitrequestedType = VARIANTalways takes precedence. DWRF footer parsing deliberately does not auto-promote to avoid false positives on legitimate structs.Spark VARIANT Encoding/Decoding
SparkVariantEncoder: two-pass JSON → VARIANT binary encoder.SparkVariantReader: decoder supporting Spark bit-coded, compact, and raw JSON formats with auto-detection.decode()returnsstd::optional<std::string>—std::nullopt= failure,"null"= valid JSON null..key,[index], quoted keys) on binary payloads without a full decode.SQL Functions
parse_json(VARCHAR) → VARIANT: encodes a JSON string into Spark binary VARIANT.variant_get(VARIANT, VARCHAR) → VARCHAR: extracts a value at a JSONPath via three strategies: binary nav → compact nav → full decode + simdjson fallback.Tests
VariantTypeTest,VariantVectorTest,VariantFunctionTest— type, vector, and function correctness.ParquetReaderTestcases covering end-to-end reads, ScanSpec order mismatches, all Spark primitive types, nested containers (ARRAY<VARIANT>,MAP<VARCHAR, VARIANT>,STRUCT<VARIANT, VARIANT>), and raw JSON payloads. Real Parquet fixture files included.Performance Impact
No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes