-
Notifications
You must be signed in to change notification settings - Fork 373
Description
One of the recently revised behaviors within dplyr (as of 0.8.0) was the move to grouping-by categorical variables producing zero-length groups for unrepresented category levels. I haven't seen this behavior touched on in other issues, and wanted to raise it as a topic of consideration. They've added field called .drop which when set to FALSE will retain groupings for unrepresented categorical levels.
## existing behavior
x = categorical(["a", "a", "b", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 2, 3, 4])
by(df, :x, length = :y => length)
# 2×2 DataFrame
# │ Row │ x │ length │
# │ │ Categorical… │ Int64 │
# ├─────┼──────────────┼────────┤
# │ 1 │ a │ 2 │
# │ 2 │ b │ 2 │
## alternative output
by(df, :x, length = :y => length)
# 2×2 DataFrame
# │ Row │ x │ length │
# │ │ Categorical… │ Int64 │
# ├─────┼──────────────┼────────┤
# │ 1 │ a │ 2 │
# │ 2 │ b │ 2 │
# │ 3 │ c │ 0 │Just a couple ideas - this could possibly use skipmissing=false, which could be interpreted colloquially as "missing", although I understand this is a bit of conceptual conflation with the value of missing. Alternatively, it might be nice to introduce something analogous to .drop which specifies behavior of zero-length groups specifically.
There are certainly times when you want to retain the fact that a dataset doesn't contain values of a specific level where this can be very handy.