feat: Add `approx_top_k` aggregate function for approximate heavy hitter / top-k frequency estimation

## Is your feature request related to a problem or challenge?

DataFusion currently lacks a built-in approximate top-k / heavy hitter aggregate function. Users who need to find the most frequently occurring values in a column must resort to `GROUP BY value ORDER BY COUNT(*) DESC LIMIT k`, which requires materializing all distinct groups, sorting, and truncating — a process that is both memory-intensive and slow on high-cardinality columns.

This is a common analytical query pattern with well-known streaming approximation algorithms that provide bounded-error results in constant memory. Other query engines already support this:

- **ClickHouse**: [`topK` / `topKWeighted`](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk) — uses the Filtered Space-Saving algorithm
- **PostgreSQL**: Available via extensions (e.g., `datasketches`)
- **Druid**: Built-in approximate top-k via Data Sketches

## Describe the solution you'd like

Add an `approx_top_k(expression, k)` aggregate function that returns an approximate list of the top `k` most frequent values along with their estimated counts, using a streaming algorithm that operates in bounded memory.

### Algorithm: Filtered Space-Saving

The implementation should use the **Space-Saving** algorithm (Metwally et al., 2005), which maintains a fixed-size summary of counters. When a new item arrives and the summary is full, the item with the minimum count is evicted and replaced. This guarantees that all items with frequency above `N/capacity` will be tracked.

The accuracy is further improved with an **alpha map** (as described in the Filtered Space-Saving variant), which tracks recently evicted items and filters out low-frequency noise before it enters the main summary. This is the same approach used in **ClickHouse's `topK` implementation**.

Key design points:

- **`CAPACITY_MULTIPLIER = 3`**: The internal summary tracks `k * 3` counters (matching ClickHouse's default). Over-provisioning relative to `k` significantly improves accuracy for skewed distributions.
- **Mergeable state**: Summaries from different partitions can be merged, enabling parallel / distributed execution.
- **Serializable**: The summary can be serialized to/from bytes for intermediate state transfer (e.g., proto roundtrip).
- **Type support**: Works with any hashable scalar type (strings, integers, floats, etc.).

### SQL Syntax

```sql
-- Returns a list of structs {value, count} for the top k most frequent values
SELECT approx_top_k(column_name, 5) FROM table;

-- With GROUP BY
SELECT group_col, approx_top_k(value_col, 3) FROM table GROUP BY group_col;

-- k defaults to 5 if omitted
SELECT approx_top_k(column_name) FROM table;
```

### Return Type

Returns a `List(Struct({value: T, count: UInt64}))` ordered by count descending, where `T` matches the input column type.

## Describe alternatives you've considered

1. **Exact `GROUP BY ... ORDER BY COUNT(*) DESC LIMIT k`** — works but requires full materialization of all groups, which is prohibitive on high-cardinality columns (millions of distinct values).
2. **External sketching libraries** — users can implement this as a UDAF, but having it built-in lowers the barrier and ensures it's well-tested and optimized within DataFusion's execution model.
3. **Count-Min Sketch + Heap** — another approximate approach, but Space-Saving provides deterministic error bounds and is simpler to implement correctly for top-k specifically.

## Additional context

- The Space-Saving algorithm provides the guarantee that any item with true frequency ≥ `N/capacity` will appear in the summary.
- The Filtered Space-Saving extension (with alpha map) reduces over-counting of infrequent items, improving result quality in practice.
- This is particularly useful for observability (top error codes, top URLs), analytics (most popular products, most active users), and data profiling (frequent value detection).
- ClickHouse has validated this algorithm at scale in production for years.

### References

- Metwally, A., Agrawal, D., & El Abbadi, A. (2005). *Efficient Computation of Frequent and Top-k Elements in Data Streams.*
- [ClickHouse `topK` documentation](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk)
- [Filtered Space-Saving description](https://doi.org/10.1016/j.ins.2015.11.002)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `approx_top_k` aggregate function for approximate heavy hitter / top-k frequency estimation #20967

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Algorithm: Filtered Space-Saving

SQL Syntax

Return Type

Describe alternatives you've considered

Additional context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Add approx_top_k aggregate function for approximate heavy hitter / top-k frequency estimation #20967

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Algorithm: Filtered Space-Saving

SQL Syntax

Return Type

Describe alternatives you've considered

Additional context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

feat: Add `approx_top_k` aggregate function for approximate heavy hitter / top-k frequency estimation #20967