-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
When string scalar functions (e.g. regexp_replace, concat, lower, upper, replace, substr, etc.) receive a Dictionary(K, Utf8) input, the current implementation unwraps the Dictionary to plain Utf8:
- The type coercion layer in
datafusion/expr/src/type_coercion/functions.rsunwrapsDictionary(K, V)toVbefore the function sees it - Each function's
return_typealso mapsDictionary(_, Utf8)→Utf8 - The output is a plain
Utf8array
This causes two problems:
Arrow IPC/Flight message size inflation: Dictionary-encoded columns use a fixed size per row based on the key type (e.g. 4 bytes for Int32) with unique values stored once. Plain Utf8 stores the full string for every row. For columns with repeated string values, this can significantly inflate individual Arrow IPC messages, potentially exceeding gRPC message size limits (4MB default in tonic).
Performance: String operations are applied to every row instead of just the unique dictionary values.
Describe the solution you'd like
Make string scalar functions handle Dictionary-typed inputs natively, preserving the encoding:
return_type: ReturnDictionary(K, V')when input isDictionary(K, V), whereV'is the appropriate result value type.invoke/invoke_with_args: When the input is aDictionaryArray, apply the string operation only to the dictionary's unique values array, then construct a newDictionaryArraywith the transformed values and the original keys (index) array.
Ideally this would be a reusable pattern (perhaps a helper or wrapper) that individual string functions can opt into, rather than duplicating the logic in every function.
Describe alternatives you've considered
- Workaround: Users can wrap results in
arrow_cast(some_string_fn(...), 'Dictionary(Int32, Utf8)')to re-dictionary-encode the output. However, this is non-obvious and still applies the operation to every row rather than just the unique values. - Coerce Dictionary to Utf8View instead of Utf8: The type coercion layer could coerce
Dictionary(K, Utf8)toUtf8Viewrather thanUtf8. Since string functions already handleUtf8Viewinputs and returnUtf8Viewoutputs, this would work without per-function changes.Utf8Viewis more compact thanUtf8for repeated values in IPC serialization, though not as compact as Dictionary encoding.