Skip to content

by drops zero-length categorical groups #2106

@dgkf

Description

@dgkf

Related to #2104 and #1256

One of the recently revised behaviors within dplyr (as of 0.8.0) was the move to grouping-by categorical variables producing zero-length groups for unrepresented category levels. I haven't seen this behavior touched on in other issues, and wanted to raise it as a topic of consideration. They've added field called .drop which when set to FALSE will retain groupings for unrepresented categorical levels.

## existing behavior
x = categorical(["a", "a", "b", "b"])
levels!(x, ["a", "b", "c"])
df = DataFrame(x = x, y = [1, 2, 3, 4])
by(df, :x, length = :y => length)
# 2×2 DataFrame
# │ Row │ x            │ length │
# │     │ Categorical… │ Int64  │
# ├─────┼──────────────┼────────┤
# │ 1   │ a            │ 2      │
# │ 2   │ b            │ 2      │

## alternative output
by(df, :x, length = :y => length)
# 2×2 DataFrame
# │ Row │ x            │ length │
# │     │ Categorical… │ Int64  │
# ├─────┼──────────────┼────────┤
# │ 1   │ a            │ 2      │
# │ 2   │ b            │ 2      │
# │ 3   │ c            │ 0      │

Just a couple ideas - this could possibly use skipmissing=false, which could be interpreted colloquially as "missing", although I understand this is a bit of conceptual conflation with the value of missing. Alternatively, it might be nice to introduce something analogous to .drop which specifies behavior of zero-length groups specifically.

There are certainly times when you want to retain the fact that a dataset doesn't contain values of a specific level where this can be very handy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions