remove local definitions of fulldummy and nesting operator (#855)

palday · web-flow · commit 718af5523637 · 2025-08-26T15:03:29.000Z
* remove local definitions of fulldummy and nesting operator

These are now implemented as part of RegressionFormulae.jl. At some point, we may need to provide more specialized methods for complex random effects structures (i.e. interaction with correlation suppression), but these can and should be implemented as methods of RegressionFormuale.jl's functions.

* docs tweak

* NEWS

* add documentation on formula syntax

* fix math formatting
diff --git a/NEWS.md b/NEWS.md
@@ -9,6 +9,7 @@ MixedModels v5.0.0 Release Notes
 - Internal code around optimization in profiling has been restructuring so that fitting done during calls to `profile` respect the `backend` and `optimizer` settings. [#853]
 - The `prfit!` convenience function has been removed. [#853]
 - The `dataset` and `datasets` functions have been removed. They are now housed in `MixedModelsDatasets`.[#854]
+- The local implementation of `fulldummy` and the nesting syntax has been removed and a dependency on RegressionFormulae.jl for their implementation has been added. [#855]
 - One argument `predict(::GeneralizedLinearMixedModel)`, i.e. prediction on the original data, now supports the `type` keyword argument. [#856]
 - `isnested(A::ReMat, B::ReMat)` is now a method of `StatsModels.isnested`.[#858]
 - [BREAKING ]`likelihoodratiotest` has been reworked to be a thin wrapper around `StatsModels.lrtest`. The historical difference in behavior in terms of nesting checks created some confusion. Users advanced enough to create models with non-obvious nesting are assumed to be advanced enough to manually compute the likelihood ratio test. The function `likelihoodratiotest` and associated `LikelihoodRatioTest` type (now with a type parameter for number of models) has been kept to enable printing of test results with model formulae. Most users should not notice a difference in behavior, but the display has been slightly changed and the internal field structure has changed.[#858]
@@ -677,5 +678,6 @@ Package dependencies
 [#850]: https://github.com/JuliaStats/MixedModels.jl/issues/850
 [#853]: https://github.com/JuliaStats/MixedModels.jl/issues/853
 [#854]: https://github.com/JuliaStats/MixedModels.jl/issues/854
+[#855]: https://github.com/JuliaStats/MixedModels.jl/issues/855
 [#856]: https://github.com/JuliaStats/MixedModels.jl/issues/856
 [#858]: https://github.com/JuliaStats/MixedModels.jl/issues/858
diff --git a/Project.toml b/Project.toml
@@ -20,6 +20,7 @@ PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
 Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
 ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
+RegressionFormulae = "545c379f-4ec2-4339-9aea-38f2fb6a8ba2"
 SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
 StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
@@ -95,4 +96,4 @@ Suppressor = "fd094767-a336-5f1f-9728-57cf17d0bbfb"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = ["Aqua", "DataFrames", "ExplicitImports", "FiniteDiff", "ForwardDiff", "InteractiveUtils", "PRIMA", "RegressionFormulae", "StableRNGs", "Suppressor", "Test"]
+test = ["Aqua", "DataFrames", "ExplicitImports", "FiniteDiff", "ForwardDiff", "InteractiveUtils", "PRIMA", "StableRNGs", "Suppressor", "Test"]
diff --git a/docs/make.jl b/docs/make.jl
@@ -22,6 +22,7 @@ makedocs(;
         "rankdeficiency.md",
         "mime.md",
         "derivatives.md",
+        "formula_syntax.md",
         "api.md",
     ],
 )
diff --git a/docs/src/constructors.md b/docs/src/constructors.md
@@ -188,9 +188,6 @@ DisplayAs.Text(ans) # hide
 (Notice that the variance component for `days: 1` is estimated as zero, so the correlations for this component are undefined and expressed as `NaN`, not a number.)
 
 An alternative is to force all the levels of `days` as indicators using `fulldummy` encoding.
-```@docs
-fulldummy
-```
 ```@example Main
 fit(MixedModel, @formula(reaction ~ 1 + days + (1 + fulldummy(days)|subj)), sleepstudy,
     contrasts = Dict(:days => DummyCoding()))
diff --git a/docs/src/formula_syntax.md b/docs/src/formula_syntax.md
@@ -0,0 +1,73 @@
+
+# Formula syntax
+
+MixedModels.jl uses the variant of the Wilkinson-Rogers (1973) notation for models of (co)variance implemented by [StatsModels.jl](https://juliastats.org/StatsModels.jl/stable/formula/#Modeling-tabular-data).
+Additionally, MixedModels.jl extends this syntax to use the pipe `|` as the grouping operator.
+Further extensions are provided by [RegressionFormulae.jl](https://github.com/kleinschmidt/RegressionFormulae.jl?tab=readme-ov-file), in particular the use of the slash `/` as the nesting operator and the use of the caret `^` to indicate main effects and interactions up to a specified order.
+Currently, MixedModels.jl loads RegressionFormulae.jl by default, though this may change in a future release.
+If you require specific functionality from RegressionFormulae.jl, it is best to load it directly so that you can control the version used.
+
+## General rules
+
+- "Addition" (`+`) indicates additive, i.e., main effects: `a + b` indicates main effects of `a` and `b`.
+- "Multiplication" (`*`) indicates crossing: main effects and interactions between two terms: `a * b` indicates main effects of `a` and `b` as well as their interaction.
+- Usual algebraic rules apply (associativity and distributivity):
+  - `(a + b) * c` is equivalent to `a * c + b * c`
+  - `a * b * c` corresponds to main effects of `a`, `b`, and `c`, as well as all three two-way interactions and the three-way interaction.
+- Categorical terms are expanded into the associated indicators/contrast variables. See the [StatsModels.jl documentation on contrasts](https://juliastats.org/StatsModels.jl/stable/contrasts/) for more information.
+- Interactions are expressed with the ampersand (`&`). (This is contrast to R, which uses the colon `:` for this operation.). `a&b` is the interaction of `a` and `b`. For categorical terms, appropriate combinations of indicators/contrast variables are generated.
+- Tilde (`~`) is used to separate response from predictors.
+- The intercept is indicated by `1`.
+- `y ~ 1 + (a + b) * c` is read as:
+  - The response variable is `y`.
+  - The model contains an intercept.
+  - The model contains main effects of `a`, `b`, and `c`.
+  - The model contains interactions between `a` and `c` and between `b` and `c` but not `a` and `b`.
+- An intercept is included by default, i.e. there is an implicit `1 + ` in every formula. The intercept may be suppressed by including a `0 + ` in the formula. (In contrast to R, the use of `-1` is **not** supported.)
+
+### MixedModels.jl provided extensions
+
+- The pipe operator (`|`) indicates grouping or blocking.
+- `(1 + a | subject)` indicates "by-subject random effects for the intercept and main effect `a`".
+- This is in line with the usual statistical reading of `|` as "conditional on".
+
+### RegressionFormulae.jl provided extensions
+
+- "Exponentiation" (`^`) works like repeated multiplication and generates all multiplicative and additive terms up to the given order.
+  - `(a + b + c)^2` generates `a + b + c + a&b + a&c + b&c`, but not `a&b&c`.
+  - The presence of interaction terms within the base will result in redundant terms and is currently unsupported.
+- `fulldummy(a)` assigns "contrasts" to `a` that include all indicator columns (dummy variables) and an intercept column. The resulting overparameterization is generally useful in the fixed effects only as part of nesting.
+- The slash operator (`/`) indicates nesting:
+  - `a / b` is read as "`b` is nested within `a`".
+  - `a / b` expands to `a + fulldummy(a) & b`.
+- It is generally not necessary to specify nesting in the blocking variables, when the inner levels are unique across outer levels. In other words, in a study with children (`C1`, `C2`, etc. ) nested within schools (`S1`, `S2`, etc.),
+  - it is not necessary to specify the nesting when `C1` identifies a unique child across schools. In other words, intercept-only random effects terms can be written as `(1|C) + `(1|S)`.
+  - it is necessary to specify the nesting when chid identifiers are re-used across schools, e.g. `C1` refers to a child in `S1` and a different child in `S2`. In this case, the nested syntax `(1|S/C)` expands to `(1|S) + (1|S&C)`. The interaction term in the second blocking variable generates unique labels for each child across schools.
+
+
+
+## Mixed models in Wilkinson-Rogers and mathematical notation
+
+Models fit with MixedModels.jl are generally linear mixed-effects models with unconstrained random effects covariance matrices and homoskedastic, normally distributed residuals.
+Under these assumptions, the model specification
+
+`response ~ 1 + (age + sex) * education * n_children  + (1 | subject)`
+
+corresponds to the statistical model
+
+```math
+\begin{align*}
+\left(Y |\mathcal{B}=b\right) &\sim N\left(X\beta + Zb, \sigma^2 I \right) \\
+\mathcal{B} &\sim N\left(0, G\right)
+\end{align*}
+```
+
+for which we wish to obtain the maximum-likelihood estimates for ``G`` and thus the fixed-effects ``\beta``.
+
+- The model contains no restrictions on ``G``, except that it is positive semidefinite.
+- The response ``Y`` is the value of a given response.
+- The fixed-effects design matrix ``X`` consists of columns for
+  - the intercept, age, sex, education, and number of children (contrast coded as appropriate)
+  - the interaction of all lower order terms, excluding interactions between age and sex
+- The random-effects design matrix ``Z`` includes a column for
+  - the intercept for each subject
diff --git a/src/MixedModels.jl b/src/MixedModels.jl
@@ -26,6 +26,7 @@ using PrecompileTools: PrecompileTools, @setup_workload, @compile_workload
 using Printf: @sprintf
 using ProgressMeter: ProgressMeter, Progress, finish!, next!
 using Random: Random, AbstractRNG, randn!
+using RegressionFormulae: fulldummy
 using SparseArrays: SparseArrays, SparseMatrixCSC, SparseVector, dropzeros!, nnz
 using SparseArrays: nonzeros, nzrange, rowvals, sparse
 using StaticArrays: StaticArrays, SVector
diff --git a/src/randomeffectsterm.jl b/src/randomeffectsterm.jl
@@ -164,70 +164,6 @@ function _ranef_refs(
     return refs, uniques
 end
 
-# TODO: remove all of this and either
-# - require users to use RegressionFormulae.jl
-# - add a dependency on RegressionFormulae.jl
-
-function StatsModels.apply_schema(
-    t::FunctionTerm{typeof(/)}, sch::StatsModels.FullRank, Mod::Type{<:MixedModel}
-)
-    if length(t.args) ≠ 2
-        throw(ArgumentError("malformed nesting term: $t (Exactly two arguments required"))
-    end
-
-    first, second = apply_schema.(t.args, Ref(sch), Mod)
-
-    if !(typeof(first) <: CategoricalTerm)
-        throw(
-            ArgumentError(
-                "nesting terms requires categorical grouping term, got $first.  Manually specify $first as `CategoricalTerm` in hints/contrasts"
-            ),
-        )
-    end
-
-    return first + fulldummy(first) & second
-end
-
-# add some syntax to manually promote to full dummy coding
-function fulldummy(t::AbstractTerm)
-    return throw(
-        ArgumentError(
-            "can't promote $t (of type $(typeof(t))) to full dummy " *
-            "coding (only CategoricalTerms)",
-        ),
-    )
-end
-
-"""
-    fulldummy(term::CategoricalTerm)
-
-Assign "contrasts" that include all indicator columns (dummy variables) and an intercept column.
-
-This will result in an under-determined set of contrasts, which is not a problem in the random
-effects because of the regularization, or "shrinkage", of the conditional modes.
-
-The interaction of `fulldummy` with complex random effects is subtle and complex with numerous
-potential edge cases. As we discover these edge cases, we will document and determine their
-behavior. Until such time, please check the model summary to verify that the expansion is
-working as you expected. If it is not, please report a use case on GitHub.
-"""
-function fulldummy(t::CategoricalTerm)
-    new_contrasts = StatsModels.ContrastsMatrix(
-        StatsModels.FullDummyCoding(), t.contrasts.levels
-    )
-    return t = CategoricalTerm(t.sym, new_contrasts)
-end
-
-function fulldummy(x)
-    return throw(ArgumentError("fulldummy isn't supported outside of a MixedModel formula"))
-end
-
-function StatsModels.apply_schema(
-    t::FunctionTerm{typeof(fulldummy)}, sch::StatsModels.FullRank, Mod::Type{<:MixedModel}
-)
-    return fulldummy(apply_schema.(t.args, Ref(sch), Mod)...)
-end
-
 # specify zero correlation
 struct ZeroCorr <: AbstractReTerm
     term::RandomEffectsTerm
diff --git a/test/FactorReTerm.jl b/test/FactorReTerm.jl
@@ -172,12 +172,10 @@ end
         f=string.(repeat('A':'C'; outer=6)))
 
     @testset "fulldummy" begin
-        @test_throws ArgumentError fulldummy(1)
-
         f = @formula(y ~ 1 + fulldummy(f))
         f1 = apply_schema(f, schema(dat))
         @test typeof(last(f1.rhs.terms)) <: FunctionTerm{typeof(fulldummy)}
-        @test_throws ArgumentError modelcols(f1, dat)
+        @test_throws MethodError modelcols(f1, dat)
 
         f2 = apply_schema(f, schema(dat), MixedModel)
         @test typeof(last(f2.rhs.terms)) <: CategoricalTerm{<:StatsModels.FullDummyCoding}
@@ -240,11 +238,6 @@ end
             @formula(0 ~ 1 + a / b), schema(d2), MixedModel
         )
 
-        # errors for too much nesting
-        @test_throws ArgumentError apply_schema(
-            @formula(0 ~ 1 + b / c / a), schema(d2), MixedModel
-        )
-
         # fitted model to test amalgamate and fnames, and equivalence with other formulations
         psts = dataset("pastes")
         m = fit(

Original file line number	Diff line number	Diff line change
`@@ -22,6 +22,7 @@ makedocs(;`
`22`	`22`	`"rankdeficiency.md",`
`23`	`23`	`"mime.md",`
`24`	`24`	`"derivatives.md",`
	`25`	`+ "formula_syntax.md",`
`25`	`26`	`"api.md",`
`26`	`27`	`],`
`27`	`28`	`)`