[GH-2240] Fix write and read nested geometry array using vectorized parquet reader #2359

zhangfengcdt · 2025-09-19T15:51:13Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

This PR addresses SPARK-48942, a bug that occurs when reading nested geometry arrays from Parquet files using Spark's vectorized reader. The fix implements compatibility checks for UserDefinedTypes (UDTs) in nested structures and adds workaround utilities for schema transformation.

Adds UDT compatibility checking in Parquet column vector operations to handle type mismatches
Implements utility classes for transforming nested GeometryUDT schemas to avoid the bug
Provides comprehensive test coverage for various nested geometry scenarios

How was this patch tested?

new tests are added to geoparquetIOTests.scala

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation.

Copilot

Pull Request Overview

This PR addresses SPARK-48942, fixing a bug where reading nested geometry arrays from Parquet files fails when using Spark's vectorized reader. The solution implements compatibility checks for UserDefinedTypes (UDTs) in nested structures and adds workaround utilities for schema transformation.

Adds UDT compatibility checking in Parquet column vector operations to handle type mismatches between logical and physical schemas
Implements schema transformation utilities to convert nested GeometryUDT to BinaryType for Parquet compatibility
Provides comprehensive test coverage for various nested geometry scenarios including arrays of structs and deeply nested structures

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
geoparquetIOTests.scala	Adds test cases for nested geometry array scenarios and GeoParquet format validation
TransformNestedUDTForParquet.scala	Expression class for transforming nested GeometryUDT schemas to BinaryType
TransformNestedUDTParquet.scala	Catalyst rule for automatic schema transformation in Parquet reading operations
SedonaContext.scala	Registers the new transformation rule in the optimization pipeline
ParquetColumnVector.java	Enhanced type compatibility checking with caching for UDT and nested type comparisons

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

...java/org/apache/spark/sql/execution/datasources/geoparquet/internal/ParquetColumnVector.java

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/TransformNestedUDTParquet.scala

...java/org/apache/spark/sql/execution/datasources/geoparquet/internal/ParquetColumnVector.java

Kontinuation · 2025-09-22T07:07:20Z

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/TransformNestedUDTParquet.scala

+  /**
+   * Transform a schema to handle nested UDT by processing each top-level field. This preserves
+   * top-level GeometryUDT fields while transforming nested ones to BinaryType.
+   */
+  private def transformSchemaForNestedUDT(schema: StructType): StructType = {
+    StructType(
+      schema.fields.map(field => field.copy(dataType = transformTopLevelUDT(field.dataType))))
+  }
+
+  /**
+   * Transform a top-level field's data type, preserving GeometryUDT at the top level but
+   * converting nested GeometryUDT to BinaryType.
+   */
+  private def transformTopLevelUDT(dataType: DataType): DataType = {
+    dataType match {
+      case ArrayType(elementType, containsNull) =>
+        ArrayType(transformNestedUDTToBinary(elementType), containsNull)
+      case MapType(keyType, valueType, valueContainsNull) =>
+        MapType(
+          transformNestedUDTToBinary(keyType),
+          transformNestedUDTToBinary(valueType),
+          valueContainsNull)
+      case StructType(fields) =>
+        StructType(
+          fields.map(field => field.copy(dataType = transformNestedUDTToBinary(field.dataType))))
+      case _: GeometryUDT => dataType // Preserve top-level GeometryUDT
+      case other => other
+    }
+  }
+
+  /**
+   * Recursively transform nested data types, converting ALL GeometryUDT to BinaryType. This is
+   * used for nested structures where GeometryUDT must be converted.
+   */
+  private def transformNestedUDTToBinary(dataType: DataType): DataType = {
+    dataType match {
+      case _: GeometryUDT => BinaryType
+      case ArrayType(elementType, containsNull) =>
+        ArrayType(transformNestedUDTToBinary(elementType), containsNull)
+      case MapType(keyType, valueType, valueContainsNull) =>
+        MapType(
+          transformNestedUDTToBinary(keyType),
+          transformNestedUDTToBinary(valueType),
+          valueContainsNull)
+      case StructType(fields) =>
+        StructType(
+          fields.map(field => field.copy(dataType = transformNestedUDTToBinary(field.dataType))))
+      case udt: UserDefinedType[_] => transformNestedUDTToBinary(udt.sqlType)
+      case other => other
+    }
+  }


I'm not sure if we can merge these 3 functions into one.

Yes, I think we can combine transformTopLevelUDT() and transformNestedUDTToBinary() for sure since they are just different in a boollean flag. I'd keep transformSchemaForNestedUDT separate as the entry point for schema.

...ala/org/apache/spark/sql/sedona_sql/expressions/transform/TransformNestedUDTForParquet.scala

Kontinuation · 2025-09-22T07:12:47Z

spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala

+      val result = readDf.collect()
+      assert(result.length == 1)
+      val nestedArray = result(0).getSeq[Any](0)
+      assert(nestedArray.length == 1)


What is the expected type of nested UDT values (binary or geometry object)? According to TransformNestedUDTParquet.scala I guess it is binary.

The type read back is actually geometry. The parquet metadata stores GeometryUDT information in the Spark schema metadata, and when it is read back Spark automatically reads this back from the SPARK_METADATA_KEY.

The TransformNestedUDTParquet rule just fixes / ensures that nested GeometryUDT gets properly handled regardless of which metadata source is used.

I have added some tests after read back to test if the regular geometry operations work.

zhangfengcdt added 10 commits September 18, 2025 10:31

Fix write and read nested geometry array using vectorized parquet reader

23dc706

remove temp test files

9de72c0

remove printout in tests

2c52aaf

rename FixNestedUDTInParquetRule

7a1b5b6

remove unused file

b33916d

add type check cache

2b2c4cb

optimize regular parquet reader path

3551da9

rename hasHash method

3a693ed

rename to be more clear

2ab42d0

add more comments

a7b7451

github-actions bot added the sedona-spark label Sep 19, 2025

zhangfengcdt added 2 commits September 19, 2025 08:59

fix spark 4.0 compilation error

6d7b8db

fix test

32e67d2

zhangfengcdt requested a review from Copilot September 19, 2025 17:29

Copilot AI reviewed Sep 19, 2025

View reviewed changes

address copilot comments

eb0ae80

zhangfengcdt marked this pull request as ready for review September 19, 2025 18:22

zhangfengcdt requested a review from jiayuasu as a code owner September 19, 2025 18:22

jiayuasu requested a review from Kontinuation September 19, 2025 20:46

jiayuasu reviewed Sep 22, 2025

View reviewed changes

...java/org/apache/spark/sql/execution/datasources/geoparquet/internal/ParquetColumnVector.java Show resolved Hide resolved

Kontinuation reviewed Sep 22, 2025

View reviewed changes

zhangfengcdt added 2 commits September 22, 2025 07:39

revert changes to ParquetColumnVector.java

6eec3f8

add futher tests on read back dataframe

b8a07a1

jiayuasu approved these changes Sep 22, 2025

View reviewed changes

jiayuasu added this to the sedona-1.8.1 milestone Sep 23, 2025

jiayuasu merged commit 09b95b4 into apache:master Sep 23, 2025
40 checks passed

jiayuasu mentioned this pull request Sep 25, 2025

Error when reading back nested geometry from Parquet with pyspark 3.5 #2240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GH-2240] Fix write and read nested geometry array using vectorized parquet reader #2359

[GH-2240] Fix write and read nested geometry array using vectorized parquet reader #2359

Uh oh!

zhangfengcdt commented Sep 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kontinuation Sep 22, 2025

Uh oh!

zhangfengcdt Sep 22, 2025

Uh oh!

Uh oh!

Kontinuation Sep 22, 2025

Uh oh!

zhangfengcdt Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

[GH-2240] Fix write and read nested geometry array using vectorized parquet reader #2359

[GH-2240] Fix write and read nested geometry array using vectorized parquet reader #2359

Uh oh!

Conversation

zhangfengcdt commented Sep 19, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kontinuation Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhangfengcdt Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kontinuation Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhangfengcdt Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!