Make IdentityOperator default for KLMinRepGradDescent (#201)

Red-Portal · github-actions[bot] · web-flow · commit ca2b3e403993 · 2025-09-14T15:46:35.000-04:00
* fix move `q_init` to front of arguments to algorithms `init`

* fix docs missing `prob` argument in `init`

* fix add missing argument in docs link for `ParamSpaceSGD`

* fix add missing argument in `init(SubsampledObjective)`

* fix test missing `q_init` argument for `SubsampledObjective`

* fix missing `q_init` argument in tests for `SubsampledObjective`

* change default operator for `KLMinRepGradDescent`, add warning

* run formatter

Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;

* run formatter

Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;

* run formatter

Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;

* run formatter

Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;

* run formatter

Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;

* update history

* fix nowarn tests for `MvLocationScale` with `IdentityOperator`

* fix remove catch-all warning tests

* fix enable `ClipScale` operator in benchmarks

* add missing operator in tutorials

* fix wrong keyword argument for subsampling tutorial

* fix remove typedef calls in IdentityOperator warning

---------

Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;
diff --git a/HISTORY.md b/HISTORY.md
@@ -5,6 +5,9 @@
 The default parameters for the parameter-free optimizers `DoG` and `DoWG` has been changed.
 Now, the choice of parameter should be more invariant to dimension such that convergence will become faster than before on high dimensional problems.
 
+The default value of the `operator` keyword argument of `KLMinRepGradDescent` has been changed to `IdentityOperator` from `ClipScale`. This means that for variational families `<:MvLocationScale`, optimization may fail since there is nothing enforcing the scale matrix to be positive definite.
+Therefore, in case a variational family of `<:MvLocationScale` is used in combination with `IdentityOperator`, a warning message instruting to use `ClipScale` will be displayed.
+
 ## Interface Changes
 
 An additional layer of indirection, `AbstractAlgorithms` has been added.
diff --git a/README.md b/README.md
@@ -113,15 +113,23 @@ For the VI algorithm, we will use `KLMinRepGradDescent`:
 using ADTypes, ReverseDiff
 using AdvancedVI
 
-alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff());
+alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale())
 ```
 
 This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014].
 This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on.
 
-`KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` is differentiable.
-If the `LogDensityProblem` has a differentiation [capability](https://www.tamaspapp.eu/LogDensityProblems.jl/dev/#LogDensityProblems.capabilities) of at least first-order, we can take advantage of this.
-For this example, we will use `LogDensityProblemsAD` to equip our problem with a first-order capability:
+Also, projection or proximal operators can be used through the keyword argument `operator`.
+For this example, we will use Gaussian variational family, which is part of the more broad location-scale family.
+These require the scale matrix to have strictly positive eigenvalues at all times.
+Here, the projection operator `ClipScale` ensures this.
+
+This `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` has gradients.
+For this, it is straightforward to use `LogDensityProblemsAD`:
+
+```julia
+using DifferentiationInterface: DifferentiationInterface
+using LogDensityProblemsAD: LogDensityProblemsAD
 
 ```julia
 using DifferentiationInterface: DifferentiationInterface
diff --git a/bench/benchmarks.jl b/bench/benchmarks.jl
@@ -62,7 +62,7 @@ begin
             b = Bijectors.bijector(prob)
             binv = inverse(b)
             q = Bijectors.TransformedDistribution(family, binv)
-            alg = KLMinRepGradDescent(adtype; optimizer=opt, entropy)
+            alg = KLMinRepGradDescent(adtype; optimizer=opt, entropy, operator=ClipScale())
 
             SUITES[probname][objname][familyname][adname] = begin
                 @benchmarkable AdvancedVI.optimize(
diff --git a/docs/src/families.md b/docs/src/families.md
@@ -184,7 +184,7 @@ D     = ones(n_dims)
 U     = zeros(n_dims, 3)
 q0_lr = LowRankGaussian(μ, D, U)
 
-alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01))
+alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01), operator=ClipScale())
 
 max_iter = 10^4
 
diff --git a/docs/src/paramspacesgd/repgradelbo.md b/docs/src/paramspacesgd/repgradelbo.md
@@ -191,7 +191,10 @@ binv = inverse(b)
 q0_trans = Bijectors.TransformedDistribution(q0, binv)
 
 cfe = KLMinRepGradDescent(
-    AutoReverseDiff(); entropy=ClosedFormEntropy(), optimizer=Adam(1e-2)
+    AutoReverseDiff();
+    entropy=ClosedFormEntropy(),
+    optimizer=Adam(1e-2),
+    operator=ClipScale(),
 )
 nothing
 ```
@@ -200,7 +203,10 @@ The repgradelbo estimator can instead be created as follows:
 
 ```@example repgradelbo
 stl = KLMinRepGradDescent(
-    AutoReverseDiff(); entropy=StickingTheLandingEntropy(), optimizer=Adam(1e-2)
+    AutoReverseDiff();
+    entropy=StickingTheLandingEntropy(),
+    optimizer=Adam(1e-2),
+    operator=ClipScale(),
 )
 nothing
 ```
@@ -317,7 +323,7 @@ nothing
 
 ```@setup repgradelbo
 _, info_qmc, _ = AdvancedVI.optimize(
-    KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2)),
+    KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2), operator=ClipScale()),
     max_iter,
     model,
     q0_trans;
diff --git a/docs/src/tutorials/basic.md b/docs/src/tutorials/basic.md
@@ -111,15 +111,20 @@ For the VI algorithm, we will use `KLMinRepGradDescent`:
 using ADTypes, ReverseDiff
 using AdvancedVI
 
-alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff())
+alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale());
 nothing
 ```
 
 This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014].
 This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on.
+
+For certain algorithms such as `KLMinRepGradDescent`, projection or proximal operators can be used through the keyword argument `operator`.
+For this example, we will use Gaussian variational family, which is part of the more broad [location-scale family](@ref locscale).
+Location-scale family distributions require the scale matrix to have strictly positive eigenvalues at all times.
+Here, the projection operator `ClipScale` ensures this.
+
 `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` is differentiable.
 If the `LogDensityProblem` has a differentiation [capability](https://www.tamaspapp.eu/LogDensityProblems.jl/dev/#LogDensityProblems.capabilities) of at least first-order, we can take advantage of this.
-
 For this example, we will use `LogDensityProblemsAD` to equip our problem with a first-order capability:
 
 [^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014, June). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*. PMLR.
@@ -143,6 +148,9 @@ q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d)))
 nothing
 ```
 
+Now, `KLMinRepGradDescent` requires the variational approximation and the target log-density to have the same support.
+Since `y` follows a log-normal prior, its support is bounded to be the positive half-space ``\mathbb{R}_+``.
+Thus, we will use [Bijectors](https://github.com/TuringLang/Bijectors.jl) to match the support of our target posterior and the variational approximation.
 The bijector can now be applied to `q` to match the support of the target problem.
 
 ```@example basic
diff --git a/docs/src/tutorials/stan.md b/docs/src/tutorials/stan.md
@@ -81,7 +81,7 @@ using LinearAlgebra
 using LogDensityProblems
 using Plots
 
-alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff())
+alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale())
 
 d = LogDensityProblems.dimension(model)
 q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d)))
diff --git a/ext/AdvancedVIBijectorsExt.jl b/ext/AdvancedVIBijectorsExt.jl
@@ -1,11 +1,33 @@
 module AdvancedVIBijectorsExt
 
 using AdvancedVI
+using DiffResults: DiffResults
 using Bijectors
 using LinearAlgebra
 using Optimisers
 using Random
 
+function AdvancedVI.init(
+    rng::Random.AbstractRNG,
+    alg::AdvancedVI.ParamSpaceSGD,
+    q_init::Bijectors.TransformedDistribution,
+    prob,
+)
+    (; adtype, optimizer, averager, objective, operator) = alg
+    if q_init.dist isa AdvancedVI.MvLocationScale &&
+        operator isa AdvancedVI.IdentityOperator
+        @warn(
+            "IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` in the algorithm instead.",
+        )
+    end
+    params, re = Optimisers.destructure(q_init)
+    opt_st = Optimisers.setup(optimizer, params)
+    obj_st = AdvancedVI.init(rng, objective, adtype, q_init, prob, params, re)
+    avg_st = AdvancedVI.init(averager, params)
+    grad_buf = DiffResults.DiffResult(zero(eltype(params)), similar(params))
+    return AdvancedVI.ParamSpaceSGDState(prob, q_init, 0, grad_buf, opt_st, obj_st, avg_st)
+end
+
 function AdvancedVI.apply(
     op::ClipScale,
     ::Type{<:Bijectors.TransformedDistribution{<:AdvancedVI.MvLocationScale}},
diff --git a/src/algorithms/paramspacesgd/constructors.jl b/src/algorithms/paramspacesgd/constructors.jl
@@ -4,6 +4,9 @@
 
 KL divergence minimization by running stochastic gradient descent with the reparameterization gradient in the Euclidean space of variational parameters.
 
+!!! note
+    For a `<:MvLocationScale` variational family, `IdentityOperator` should be avoided for `operator` since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).
+ 
 # Arguments
 - `adtype::ADTypes.AbstractADType`: Automatic differentiation backend. 
 
@@ -12,7 +15,7 @@ KL divergence minimization by running stochastic gradient descent with the repar
 - `optimizer::Optimisers.AbstractRule`: Optimization algorithm to be used. (default: `DoWG()`)
 - `n_samples::Int`: Number of Monte Carlo samples to be used for estimating each gradient. (default: `1`)
 - `averager::AbstractAverager`: Parameter averaging strategy. 
-- `operator::Union{<:IdentityOperator, <:ClipScale}`: Operator to be applied after each gradient descent step. (default: `ClipScale()`)
+- `operator::AbstractOperator`: Operator to be applied after each gradient descent step. (default: `IdentityOperator()`)
 - `subsampling::Union{<:Nothing,<:AbstractSubsampling}`: Data point subsampling strategy. If `nothing`, subsampling is not used. (default: `nothing`)
 
 # Requirements
@@ -28,7 +31,7 @@ function KLMinRepGradDescent(
     optimizer::Optimisers.AbstractRule=DoWG(),
     n_samples::Int=1,
     averager::AbstractAverager=PolynomialAveraging(),
-    operator::Union{<:IdentityOperator,<:ClipScale}=ClipScale(),
+    operator::AbstractOperator=IdentityOperator(),
     subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
 )
     objective = if isnothing(subsampling)
@@ -90,6 +93,9 @@ end
 
 KL divergence minimization by running stochastic gradient descent with the score gradient in the Euclidean space of variational parameters.
 
+!!! note
+    If a `<:MvLocationScale` variational family is used, for `operator`, `IdentityOperator` should be avoided since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).
+
 # Arguments
 - `adtype`: Automatic differentiation backend. 
 
@@ -111,7 +117,7 @@ function KLMinScoreGradDescent(
     optimizer::Union{<:Descent,<:DoG,<:DoWG}=DoWG(),
     n_samples::Int=1,
     averager::AbstractAverager=PolynomialAveraging(),
-    operator::Union{<:IdentityOperator,<:ClipScale}=IdentityOperator(),
+    operator::AbstractOperator=IdentityOperator(),
     subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
 )
     objective = if isnothing(subsampling)
diff --git a/src/algorithms/paramspacesgd/paramspacesgd.jl b/src/algorithms/paramspacesgd/paramspacesgd.jl
@@ -65,7 +65,12 @@ struct ParamSpaceSGDState{P,Q,GradBuf,OptSt,ObjSt,AvgSt}
 end
 
 function init(rng::Random.AbstractRNG, alg::ParamSpaceSGD, q_init, prob)
-    (; adtype, optimizer, averager, objective) = alg
+    (; adtype, optimizer, averager, objective, operator) = alg
+    if q_init isa AdvancedVI.MvLocationScale && operator isa AdvancedVI.IdentityOperator
+        @warn(
+            "IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` in the algorithm instead.",
+        )
+    end
     params, re = Optimisers.destructure(q_init)
     opt_st = Optimisers.setup(optimizer, params)
     obj_st = init(rng, objective, adtype, q_init, prob, params, re)
diff --git a/src/algorithms/paramspacesgd/repgradelbo.jl b/src/algorithms/paramspacesgd/repgradelbo.jl
@@ -47,7 +47,6 @@ function init(
     params,
     restructure,
 )
-    q_stop = q
     capability = LogDensityProblems.capabilities(typeof(prob))
     ad_prob = if capability < LogDensityProblems.LogDensityOrder{1}()
         @info "The capability of the supplied `LogDensityProblem` $(capability) is less than $(LogDensityProblems.LogDensityOrder{1}()). `AdvancedVI` will attempt to directly differentiate through `LogDensityProblems.logdensity`. If this is not intended, please supply a log-density problem with capability at least $(LogDensityProblems.LogDensityOrder{1}())"
@@ -67,7 +66,7 @@ function init(
         obj=obj,
         problem=ad_prob,
         restructure=restructure,
-        q_stop=q_stop,
+        q_stop=q,
     )
     obj_ad_prep = AdvancedVI._prepare_gradient(
         estimate_repgradelbo_ad_forward, adtype, params, aux
diff --git a/test/algorithms/paramspacesgd/repgradelbo.jl b/test/algorithms/paramspacesgd/repgradelbo.jl
@@ -8,17 +8,33 @@
     (; model, μ_true, L_true, n_dims, is_meanfield) = modelstats
 
     q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims)))
+    q0_trans = Bijectors.transformed(q0, identity)
 
     @testset "basic" begin
         @testset for n_montecarlo in [1, 10]
             alg = KLMinRepGradDescent(
                 AD;
                 n_samples=n_montecarlo,
-                operator=IdentityOperator(),
+                operator=ClipScale(),
                 averager=PolynomialAveraging(),
             )
+
             _, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false)
             @test isfinite(last(info).elbo)
+
+            _, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false)
+            @test isfinite(last(info).elbo)
+        end
+    end
+
+    @testset "warn MvLocationScale with IdentityOperator" begin
+        @test_warn "IdentityOperator" begin
+            alg = KLMinRepGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0; show_progress=false)
+        end
+        @test_warn "IdentityOperator" begin
+            alg = KLMinRepGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0_trans; show_progress=false)
         end
     end
 
diff --git a/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl b/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl
@@ -16,7 +16,7 @@
 
         T = 1000
         η = 1e-3
-        alg = KLMinRepGradDescent(AD; optimizer=Descent(η))
+        alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale())
 
         q0 = if is_meanfield
             MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims)))
diff --git a/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl b/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl
@@ -16,7 +16,7 @@
 
         T = 1000
         η = 1e-3
-        alg = KLMinRepGradDescent(AD; optimizer=Descent(η))
+        alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale())
 
         b = Bijectors.bijector(model)
         b⁻¹ = inverse(b)
diff --git a/test/algorithms/paramspacesgd/scoregradelbo.jl b/test/algorithms/paramspacesgd/scoregradelbo.jl
@@ -7,12 +7,30 @@
     (; model, μ_true, L_true, n_dims, is_meanfield) = modelstats
 
     q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims)))
+    q0_trans = Bijectors.transformed(q0, identity)
 
     @testset "basic" begin
         @testset for n_montecarlo in [1, 10]
-            alg = KLMinScoreGradDescent(AD; n_samples=n_montecarlo, optimizer=Descent(1e-5))
+            alg = KLMinScoreGradDescent(
+                AD; n_samples=n_montecarlo, operator=ClipScale(), optimizer=Descent(1e-5)
+            )
+
             _, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false)
             @assert isfinite(last(info).elbo)
+
+            _, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false)
+            @assert isfinite(last(info).elbo)
+        end
+    end
+
+    @testset "warn MvLocationScale with IdentityOperator" begin
+        @test_warn "IdentityOperator" begin
+            alg = KLMinScoreGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0; show_progress=false)
+        end
+        @test_warn "IdentityOperator" begin
+            alg = KLMinScoreGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0_trans; show_progress=false)
         end
     end
 
diff --git a/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl b/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl
@@ -12,7 +12,7 @@
         T = 1000
         η = 1e-4
         opt = Optimisers.Descent(η)
-        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt)
+        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale())
 
         q0 = if is_meanfield
             MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims)))
diff --git a/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl b/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl
@@ -12,7 +12,7 @@
         T = 1000
         η = 1e-4
         opt = Optimisers.Descent(η)
-        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt)
+        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale())
 
         b = Bijectors.bijector(model)
         b⁻¹ = inverse(b)