Skip to content

Product of multiple aggregation functions and columns #2419

@tk3369

Description

@tk3369

While working on the pandas vs DataFrames.jl comparison doc (#2378), I encountered the use case of applying many aggregation functions over many columns.

Consider the following example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
                   'x': range(6, 0, -1),
                   'y': range(4, 10),
                   'z': [3, 4, 5, 6, 7, None]},
                   index = list('abcdef'))

>>> df[['x', 'y']].agg([max, min])
     x  y
max  6  9
min  1  4

With DataFrames.jl, we can achieve something similar as previously suggested by @bkamins and @nalimilan .

julia> df = DataFrame(id = 'a':'f', grp = repeat(1:2, 3), x = 6:-1:1, y = 4:9, z = [3:7; missing])
6×5 DataFrame
│ Row │ id   │ grp   │ x     │ y     │ z       │
│     │ Char │ Int64 │ Int64 │ Int64 │ Int64?  │
├─────┼──────┼───────┼───────┼───────┼─────────┤
│ 1   │ 'a'  │ 1     │ 6     │ 4     │ 3       │
│ 2   │ 'b'  │ 2     │ 5     │ 5     │ 4       │
│ 3   │ 'c'  │ 1     │ 4     │ 6     │ 5       │
│ 4   │ 'd'  │ 2     │ 3     │ 7     │ 6       │
│ 5   │ 'e'  │ 1     │ 2     │ 8     │ 7       │
│ 6   │ 'f'  │ 2     │ 1     │ 9     │ missing │

julia> combine(df, vec([:x, :y] .=> [maximum minimum]))
1×4 DataFrame
│ Row │ x_maximum │ y_maximum │ x_minimum │ y_minimum │
│     │ Int64     │ Int64     │ Int64     │ Int64     │
├─────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 6         │ 9         │ 1         │ 4         │

As you can see, the results are stored in single row with many columns. Essentially, if you have N functions and M columns, you end up with N x M columns. IMHO, pandas' output is nicer. So, I'm wondering if DataFrames.jl should be enhanced to allow multiple functions to be applied for multiple columns.

Here's a little code that works:

julia> function agg(df, cols, funcs) 
           result = DataFrame()
           result.function = string.(funcs)
           for c in cols
               result[!, c] = [f(df[!, c]) for f in funcs]
           end
           return result
       end
agg (generic function with 1 method)

julia> agg(df, [:x, :y], [maximum, minimum])
2×3 DataFrame
│ Row │ function │ x     │ y     │
│     │ String   │ Int64 │ Int64 │
├─────┼──────────┼───────┼───────┤
│ 1   │ maximum  │ 6     │ 9     │
│ 2   │ minimum  │ 1     │ 4     │

Maybe this little agg function can be rolled into combine with a signature like this?

combine(df, ::Vector{Function}, ::Vector{StringOrSymbol}

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions