Skip to content

Commit 718af55

Browse files
authored
remove local definitions of fulldummy and nesting operator (#855)
* remove local definitions of fulldummy and nesting operator These are now implemented as part of RegressionFormulae.jl. At some point, we may need to provide more specialized methods for complex random effects structures (i.e. interaction with correlation suppression), but these can and should be implemented as methods of RegressionFormuale.jl's functions. * docs tweak * NEWS * add documentation on formula syntax * fix math formatting
1 parent b61a60a commit 718af55

File tree

8 files changed

+80
-76
lines changed

8 files changed

+80
-76
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ MixedModels v5.0.0 Release Notes
99
- Internal code around optimization in profiling has been restructuring so that fitting done during calls to `profile` respect the `backend` and `optimizer` settings. [#853]
1010
- The `prfit!` convenience function has been removed. [#853]
1111
- The `dataset` and `datasets` functions have been removed. They are now housed in `MixedModelsDatasets`.[#854]
12+
- The local implementation of `fulldummy` and the nesting syntax has been removed and a dependency on RegressionFormulae.jl for their implementation has been added. [#855]
1213
- One argument `predict(::GeneralizedLinearMixedModel)`, i.e. prediction on the original data, now supports the `type` keyword argument. [#856]
1314
- `isnested(A::ReMat, B::ReMat)` is now a method of `StatsModels.isnested`.[#858]
1415
- [BREAKING ]`likelihoodratiotest` has been reworked to be a thin wrapper around `StatsModels.lrtest`. The historical difference in behavior in terms of nesting checks created some confusion. Users advanced enough to create models with non-obvious nesting are assumed to be advanced enough to manually compute the likelihood ratio test. The function `likelihoodratiotest` and associated `LikelihoodRatioTest` type (now with a type parameter for number of models) has been kept to enable printing of test results with model formulae. Most users should not notice a difference in behavior, but the display has been slightly changed and the internal field structure has changed.[#858]
@@ -677,5 +678,6 @@ Package dependencies
677678
[#850]: https://github.com/JuliaStats/MixedModels.jl/issues/850
678679
[#853]: https://github.com/JuliaStats/MixedModels.jl/issues/853
679680
[#854]: https://github.com/JuliaStats/MixedModels.jl/issues/854
681+
[#855]: https://github.com/JuliaStats/MixedModels.jl/issues/855
680682
[#856]: https://github.com/JuliaStats/MixedModels.jl/issues/856
681683
[#858]: https://github.com/JuliaStats/MixedModels.jl/issues/858

Project.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
2020
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
2121
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
2222
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
23+
RegressionFormulae = "545c379f-4ec2-4339-9aea-38f2fb6a8ba2"
2324
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
2425
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
2526
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
@@ -95,4 +96,4 @@ Suppressor = "fd094767-a336-5f1f-9728-57cf17d0bbfb"
9596
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
9697

9798
[targets]
98-
test = ["Aqua", "DataFrames", "ExplicitImports", "FiniteDiff", "ForwardDiff", "InteractiveUtils", "PRIMA", "RegressionFormulae", "StableRNGs", "Suppressor", "Test"]
99+
test = ["Aqua", "DataFrames", "ExplicitImports", "FiniteDiff", "ForwardDiff", "InteractiveUtils", "PRIMA", "StableRNGs", "Suppressor", "Test"]

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ makedocs(;
2222
"rankdeficiency.md",
2323
"mime.md",
2424
"derivatives.md",
25+
"formula_syntax.md",
2526
"api.md",
2627
],
2728
)

docs/src/constructors.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -188,9 +188,6 @@ DisplayAs.Text(ans) # hide
188188
(Notice that the variance component for `days: 1` is estimated as zero, so the correlations for this component are undefined and expressed as `NaN`, not a number.)
189189

190190
An alternative is to force all the levels of `days` as indicators using `fulldummy` encoding.
191-
```@docs
192-
fulldummy
193-
```
194191
```@example Main
195192
fit(MixedModel, @formula(reaction ~ 1 + days + (1 + fulldummy(days)|subj)), sleepstudy,
196193
contrasts = Dict(:days => DummyCoding()))

docs/src/formula_syntax.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
2+
# Formula syntax
3+
4+
MixedModels.jl uses the variant of the Wilkinson-Rogers (1973) notation for models of (co)variance implemented by [StatsModels.jl](https://juliastats.org/StatsModels.jl/stable/formula/#Modeling-tabular-data).
5+
Additionally, MixedModels.jl extends this syntax to use the pipe `|` as the grouping operator.
6+
Further extensions are provided by [RegressionFormulae.jl](https://github.com/kleinschmidt/RegressionFormulae.jl?tab=readme-ov-file), in particular the use of the slash `/` as the nesting operator and the use of the caret `^` to indicate main effects and interactions up to a specified order.
7+
Currently, MixedModels.jl loads RegressionFormulae.jl by default, though this may change in a future release.
8+
If you require specific functionality from RegressionFormulae.jl, it is best to load it directly so that you can control the version used.
9+
10+
## General rules
11+
12+
- "Addition" (`+`) indicates additive, i.e., main effects: `a + b` indicates main effects of `a` and `b`.
13+
- "Multiplication" (`*`) indicates crossing: main effects and interactions between two terms: `a * b` indicates main effects of `a` and `b` as well as their interaction.
14+
- Usual algebraic rules apply (associativity and distributivity):
15+
- `(a + b) * c` is equivalent to `a * c + b * c`
16+
- `a * b * c` corresponds to main effects of `a`, `b`, and `c`, as well as all three two-way interactions and the three-way interaction.
17+
- Categorical terms are expanded into the associated indicators/contrast variables. See the [StatsModels.jl documentation on contrasts](https://juliastats.org/StatsModels.jl/stable/contrasts/) for more information.
18+
- Interactions are expressed with the ampersand (`&`). (This is contrast to R, which uses the colon `:` for this operation.). `a&b` is the interaction of `a` and `b`. For categorical terms, appropriate combinations of indicators/contrast variables are generated.
19+
- Tilde (`~`) is used to separate response from predictors.
20+
- The intercept is indicated by `1`.
21+
- `y ~ 1 + (a + b) * c` is read as:
22+
- The response variable is `y`.
23+
- The model contains an intercept.
24+
- The model contains main effects of `a`, `b`, and `c`.
25+
- The model contains interactions between `a` and `c` and between `b` and `c` but not `a` and `b`.
26+
- An intercept is included by default, i.e. there is an implicit `1 + ` in every formula. The intercept may be suppressed by including a `0 + ` in the formula. (In contrast to R, the use of `-1` is **not** supported.)
27+
28+
### MixedModels.jl provided extensions
29+
30+
- The pipe operator (`|`) indicates grouping or blocking.
31+
- `(1 + a | subject)` indicates "by-subject random effects for the intercept and main effect `a`".
32+
- This is in line with the usual statistical reading of `|` as "conditional on".
33+
34+
### RegressionFormulae.jl provided extensions
35+
36+
- "Exponentiation" (`^`) works like repeated multiplication and generates all multiplicative and additive terms up to the given order.
37+
- `(a + b + c)^2` generates `a + b + c + a&b + a&c + b&c`, but not `a&b&c`.
38+
- The presence of interaction terms within the base will result in redundant terms and is currently unsupported.
39+
- `fulldummy(a)` assigns "contrasts" to `a` that include all indicator columns (dummy variables) and an intercept column. The resulting overparameterization is generally useful in the fixed effects only as part of nesting.
40+
- The slash operator (`/`) indicates nesting:
41+
- `a / b` is read as "`b` is nested within `a`".
42+
- `a / b` expands to `a + fulldummy(a) & b`.
43+
- It is generally not necessary to specify nesting in the blocking variables, when the inner levels are unique across outer levels. In other words, in a study with children (`C1`, `C2`, etc. ) nested within schools (`S1`, `S2`, etc.),
44+
- it is not necessary to specify the nesting when `C1` identifies a unique child across schools. In other words, intercept-only random effects terms can be written as `(1|C) + `(1|S)`.
45+
- it is necessary to specify the nesting when chid identifiers are re-used across schools, e.g. `C1` refers to a child in `S1` and a different child in `S2`. In this case, the nested syntax `(1|S/C)` expands to `(1|S) + (1|S&C)`. The interaction term in the second blocking variable generates unique labels for each child across schools.
46+
47+
48+
49+
## Mixed models in Wilkinson-Rogers and mathematical notation
50+
51+
Models fit with MixedModels.jl are generally linear mixed-effects models with unconstrained random effects covariance matrices and homoskedastic, normally distributed residuals.
52+
Under these assumptions, the model specification
53+
54+
`response ~ 1 + (age + sex) * education * n_children + (1 | subject)`
55+
56+
corresponds to the statistical model
57+
58+
```math
59+
\begin{align*}
60+
\left(Y |\mathcal{B}=b\right) &\sim N\left(X\beta + Zb, \sigma^2 I \right) \\
61+
\mathcal{B} &\sim N\left(0, G\right)
62+
\end{align*}
63+
```
64+
65+
for which we wish to obtain the maximum-likelihood estimates for ``G`` and thus the fixed-effects ``\beta``.
66+
67+
- The model contains no restrictions on ``G``, except that it is positive semidefinite.
68+
- The response ``Y`` is the value of a given response.
69+
- The fixed-effects design matrix ``X`` consists of columns for
70+
- the intercept, age, sex, education, and number of children (contrast coded as appropriate)
71+
- the interaction of all lower order terms, excluding interactions between age and sex
72+
- The random-effects design matrix ``Z`` includes a column for
73+
- the intercept for each subject

src/MixedModels.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ using PrecompileTools: PrecompileTools, @setup_workload, @compile_workload
2626
using Printf: @sprintf
2727
using ProgressMeter: ProgressMeter, Progress, finish!, next!
2828
using Random: Random, AbstractRNG, randn!
29+
using RegressionFormulae: fulldummy
2930
using SparseArrays: SparseArrays, SparseMatrixCSC, SparseVector, dropzeros!, nnz
3031
using SparseArrays: nonzeros, nzrange, rowvals, sparse
3132
using StaticArrays: StaticArrays, SVector

src/randomeffectsterm.jl

Lines changed: 0 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -164,70 +164,6 @@ function _ranef_refs(
164164
return refs, uniques
165165
end
166166

167-
# TODO: remove all of this and either
168-
# - require users to use RegressionFormulae.jl
169-
# - add a dependency on RegressionFormulae.jl
170-
171-
function StatsModels.apply_schema(
172-
t::FunctionTerm{typeof(/)}, sch::StatsModels.FullRank, Mod::Type{<:MixedModel}
173-
)
174-
if length(t.args) 2
175-
throw(ArgumentError("malformed nesting term: $t (Exactly two arguments required"))
176-
end
177-
178-
first, second = apply_schema.(t.args, Ref(sch), Mod)
179-
180-
if !(typeof(first) <: CategoricalTerm)
181-
throw(
182-
ArgumentError(
183-
"nesting terms requires categorical grouping term, got $first. Manually specify $first as `CategoricalTerm` in hints/contrasts"
184-
),
185-
)
186-
end
187-
188-
return first + fulldummy(first) & second
189-
end
190-
191-
# add some syntax to manually promote to full dummy coding
192-
function fulldummy(t::AbstractTerm)
193-
return throw(
194-
ArgumentError(
195-
"can't promote $t (of type $(typeof(t))) to full dummy " *
196-
"coding (only CategoricalTerms)",
197-
),
198-
)
199-
end
200-
201-
"""
202-
fulldummy(term::CategoricalTerm)
203-
204-
Assign "contrasts" that include all indicator columns (dummy variables) and an intercept column.
205-
206-
This will result in an under-determined set of contrasts, which is not a problem in the random
207-
effects because of the regularization, or "shrinkage", of the conditional modes.
208-
209-
The interaction of `fulldummy` with complex random effects is subtle and complex with numerous
210-
potential edge cases. As we discover these edge cases, we will document and determine their
211-
behavior. Until such time, please check the model summary to verify that the expansion is
212-
working as you expected. If it is not, please report a use case on GitHub.
213-
"""
214-
function fulldummy(t::CategoricalTerm)
215-
new_contrasts = StatsModels.ContrastsMatrix(
216-
StatsModels.FullDummyCoding(), t.contrasts.levels
217-
)
218-
return t = CategoricalTerm(t.sym, new_contrasts)
219-
end
220-
221-
function fulldummy(x)
222-
return throw(ArgumentError("fulldummy isn't supported outside of a MixedModel formula"))
223-
end
224-
225-
function StatsModels.apply_schema(
226-
t::FunctionTerm{typeof(fulldummy)}, sch::StatsModels.FullRank, Mod::Type{<:MixedModel}
227-
)
228-
return fulldummy(apply_schema.(t.args, Ref(sch), Mod)...)
229-
end
230-
231167
# specify zero correlation
232168
struct ZeroCorr <: AbstractReTerm
233169
term::RandomEffectsTerm

test/FactorReTerm.jl

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -172,12 +172,10 @@ end
172172
f=string.(repeat('A':'C'; outer=6)))
173173

174174
@testset "fulldummy" begin
175-
@test_throws ArgumentError fulldummy(1)
176-
177175
f = @formula(y ~ 1 + fulldummy(f))
178176
f1 = apply_schema(f, schema(dat))
179177
@test typeof(last(f1.rhs.terms)) <: FunctionTerm{typeof(fulldummy)}
180-
@test_throws ArgumentError modelcols(f1, dat)
178+
@test_throws MethodError modelcols(f1, dat)
181179

182180
f2 = apply_schema(f, schema(dat), MixedModel)
183181
@test typeof(last(f2.rhs.terms)) <: CategoricalTerm{<:StatsModels.FullDummyCoding}
@@ -240,11 +238,6 @@ end
240238
@formula(0 ~ 1 + a / b), schema(d2), MixedModel
241239
)
242240

243-
# errors for too much nesting
244-
@test_throws ArgumentError apply_schema(
245-
@formula(0 ~ 1 + b / c / a), schema(d2), MixedModel
246-
)
247-
248241
# fitted model to test amalgamate and fnames, and equivalence with other formulations
249242
psts = dataset("pastes")
250243
m = fit(

0 commit comments

Comments
 (0)