-
Notifications
You must be signed in to change notification settings - Fork 373
Description
While working on the pandas vs DataFrames.jl comparison doc (#2378), I encountered the use case of applying many aggregation functions over many columns.
Consider the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
'x': range(6, 0, -1),
'y': range(4, 10),
'z': [3, 4, 5, 6, 7, None]},
index = list('abcdef'))
>>> df[['x', 'y']].agg([max, min])
x y
max 6 9
min 1 4
With DataFrames.jl, we can achieve something similar as previously suggested by @bkamins and @nalimilan .
julia> df = DataFrame(id = 'a':'f', grp = repeat(1:2, 3), x = 6:-1:1, y = 4:9, z = [3:7; missing])
6×5 DataFrame
│ Row │ id │ grp │ x │ y │ z │
│ │ Char │ Int64 │ Int64 │ Int64 │ Int64? │
├─────┼──────┼───────┼───────┼───────┼─────────┤
│ 1 │ 'a' │ 1 │ 6 │ 4 │ 3 │
│ 2 │ 'b' │ 2 │ 5 │ 5 │ 4 │
│ 3 │ 'c' │ 1 │ 4 │ 6 │ 5 │
│ 4 │ 'd' │ 2 │ 3 │ 7 │ 6 │
│ 5 │ 'e' │ 1 │ 2 │ 8 │ 7 │
│ 6 │ 'f' │ 2 │ 1 │ 9 │ missing │
julia> combine(df, vec([:x, :y] .=> [maximum minimum]))
1×4 DataFrame
│ Row │ x_maximum │ y_maximum │ x_minimum │ y_minimum │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼───────────┼───────────┼───────────┤
│ 1 │ 6 │ 9 │ 1 │ 4 │
As you can see, the results are stored in single row with many columns. Essentially, if you have N functions and M columns, you end up with N x M columns. IMHO, pandas' output is nicer. So, I'm wondering if DataFrames.jl should be enhanced to allow multiple functions to be applied for multiple columns.
Here's a little code that works:
julia> function agg(df, cols, funcs)
result = DataFrame()
result.function = string.(funcs)
for c in cols
result[!, c] = [f(df[!, c]) for f in funcs]
end
return result
end
agg (generic function with 1 method)
julia> agg(df, [:x, :y], [maximum, minimum])
2×3 DataFrame
│ Row │ function │ x │ y │
│ │ String │ Int64 │ Int64 │
├─────┼──────────┼───────┼───────┤
│ 1 │ maximum │ 6 │ 9 │
│ 2 │ minimum │ 1 │ 4 │
Maybe this little agg function can be rolled into combine with a signature like this?
combine(df, ::Vector{Function}, ::Vector{StringOrSymbol}
Thoughts?