Skip to content

[BUG] AzureSearchWriter sends GeographyPoint field as JSON string instead of GeoJSON object, causing Azure AI Search request failure. #2420

@evmin

Description

@evmin

SynapseML version

synapseml_2.12:1.0.13

System information

  • Language version (e.g. python 3.8, scala 2.12): python 3.12
  • Spark Version (e.g. 3.2.3): 3.4.4, 3.5.1
  • Spark Platform (e.g. Synapse, Databricks): Naked, Databricks

Describe the problem

When using AzureSearchWriter in SynapseML to update a GeographyPoint field in Azure AI Search, the writer sends the location value as a JSON string rather than a proper GeoJSON object. Azure AI Search expects a valid spatial object, so the request fails with HTTP 400.

Code to reproduce issue

from pyspark.sql import SparkSession, functions as F, types as T
import synapse.ml.services.search 

AZURE_SEARCH_ADMIN_KEY = "api_key"
AZURE_SEARCH_SERVICE   = "ai-search"
INDEX_NAME             = "demo-geoindex"

spark = (
    SparkSession.builder
    .appName("synapseml-to-azure-search")
    .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.13")
    .getOrCreate()
)

indexJson = r'''
{
  "name": "demo-geoindex",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true, "searchable": true, "retrievable": true },
    { "name": "location", "type": "Edm.GeographyPoint", "searchable": false, "filterable": true, "retrievable": true, "sortable": true }
  ]
}
'''.strip()

# location as STRING containing GeoJSON (lon, lat)
doc_schema = T.StructType([
    T.StructField("id", T.StringType(), False),
    T.StructField("location", T.StringType(), False),
])

geojson = '{"type":"Point","coordinates":[-0.1276, 51.5072]}'

df = (spark.createDataFrame([("1", geojson)], schema=doc_schema)
          .withColumn("searchAction", F.lit("upload")))

df.show(1)

df.writeToAzureSearch(
    subscriptionKey=AZURE_SEARCH_ADMIN_KEY,
    serviceName=AZURE_SEARCH_SERVICE,
    indexJson=indexJson,
    actionCol="searchAction",
    batchSize="1"
)

UTILITY - REPRO without Spark:

curl -s \
 "https://<ai-search>.search.windows.net/indexes/demo-geoindex/docs/index?api-version=2023-07-01-Preview" \
 -H "api-key: <api_key>" \
    -H "Content-Type: application/json" \
    -d '{"value":[{"id":"1","location":"{\"type\":\"Point\",\"coordinates\":[-0.1276, 51.5072]}","@search.action":"upload"}]}'

UTILITY - EXPECTED

curl -s \
 "https://<ai-search>.search.windows.net/indexes/demo-geoindex/docs/index?api-version=2023-07-01-Preview" \
 -H "api-key: <api_key>" \
    -H "Content-Type: application/json" \
    -d '{"value":[{"id":"3","location":{"type":"Point","coordinates":[-0.1276, 51.5072]},"@search.action":"upload"}]}'

Other info / logs

25/09/16 19:15:04 WARN HandlingUtils: got error  400: Bad Request on https://ai-search-m-se.search.windows.net/indexes/demo-geoindex/docs/index?api-version=2023-07-01-Preview   {"value":[{"id":"1","location":"{\"type\":\"Point\",\"coordinates\":[-0.1276, 51.5072]}","@search.action":"upload"}]}
25/09/16 19:15:04 INFO SynapseMLLogging: finished sending to https://ai-search-m-se.search.windows.net/indexes/demo-geoindex/docs/index?api-version=2023-07-01-Preview took (248ms)
25/09/16 19:15:04 INFO CodeGenerator: Code generated in 13.930916 ms
25/09/16 19:15:04 INFO CodeGenerator: Code generated in 11.4245 ms
25/09/16 19:15:04 INFO CodeGenerator: Code generated in 20.387 ms
25/09/16 19:15:04 INFO CodeGenerator: Code generated in 6.13 ms
25/09/16 19:15:04 ERROR Executor: Exception in task 10.0 in stage 3.0 (TID 21)
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (functions$$$Lambda$3203/0x000000700207c000: (struct<response:string,status:struct<protocolVersion:struct<protocol:string,major:int,minor:int>,statusCode:int,reasonPhrase:string>>, struct<col1:string>) => struct<response:string,status:struct<protocolVersion:struct<protocol:string,major:int,minor:int>,statusCode:int,reasonPhrase:string>>).
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:217)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ScalaUDF_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1004)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1004)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2298)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.RuntimeException: Service Exception:
	 [{"error":{"code":"","message":"The request is invalid. Details: The value specified for the spatial property was not valid. You must specify a valid spatial value."}},[[HTTP,1,1],400,Bad Request]]
 for input:
	 [{"value":[{"id":"1","location":"{\"type\":\"Point\",\"coordinates\":[-0.1276, 51.5072]}","@search.action":"upload"}]}]
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.$anonfun$checkForErrors$1(AzureSearch.scala:156)
	at scala.Option.map(Option.scala:230)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.checkForErrors(AzureSearch.scala:153)
	at com.microsoft.azure.synapse.ml.services.search.AzureSearchWriter$.$anonfun$prepareDF$23(AzureSearch.scala:356)
	at org.apache.spark.injections.UDFUtils$$anon$2.call(UDFUtils.scala:29)
	at org.apache.spark.sql.functions$.$anonfun$udf$93(functions.scala:5438)
	... 22 more

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions