TuringLang · Red-Portal · Sep 14, 2025 · Aug 26, 2025 · Aug 26, 2025 · Aug 26, 2025
diff --git a/HISTORY.md b/HISTORY.md
@@ -5,6 +5,9 @@
 The default parameters for the parameter-free optimizers `DoG` and `DoWG` has been changed.
 Now, the choice of parameter should be more invariant to dimension such that convergence will become faster than before on high dimensional problems.
 
+The default value of the `operator` keyword argument of `KLMinRepGradDescent` has been changed to `IdentityOperator` from `ClipScale`. This means that for variational families `<:MvLocationScale`, optimization may fail since there is nothing enforcing the scale matrix to be positive definite.
+Therefore, in case a variational family of `<:MvLocationScale` is used in combination with `IdentityOperator`, a warning message instruting to use `ClipScale` will be displayed.
+
 ## Interface Changes
 
 An additional layer of indirection, `AbstractAlgorithms` has been added.

diff --git a/README.md b/README.md
@@ -106,16 +106,23 @@ For the VI algorithm, we will use the following:
 using ADTypes, ReverseDiff
 using AdvancedVI
 
-alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff())
+alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale())
 ```
+
 This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014].
 This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on.
 
+Also, projection or proximal operators can be used through the keyword argument `operator`.
+For this example, we will use Gaussian variational family, which is part of the more broad location-scale family.
+These require the scale matrix to have strictly positive eigenvalues at all times.
+Here, the projection operator `ClipScale` ensures this.
+
 This `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` has gradients.
 For this, it is straightforward to use `LogDensityProblemsAD`:
+
 ```julia
-import DifferentiationInterface
-import LogDensityProblemsAD
+using DifferentiationInterface: DifferentiationInterface
+using LogDensityProblemsAD: LogDensityProblemsAD
 
 model_ad = LogDensityProblemsAD.ADgradient(ADTypes.AutoReverseDiff(), model)
 ```

diff --git a/docs/make.jl b/docs/make.jl
@@ -16,9 +16,7 @@ makedocs(;
     pages=[
         "AdvancedVI" => "index.md",
         "General Usage" => "general.md",
-        "Tutorials" => [
-            "tutorials/basic.md",
-        ],
+        "Tutorials" => ["tutorials/basic.md"],
         "Algorithms" => [
             "KLMinRepGradDescent" => "paramspacesgd/klminrepgraddescent.md",
             "KLMinRepGradProxDescent" => "paramspacesgd/klminrepgradproxdescent.md",

diff --git a/docs/src/families.md b/docs/src/families.md
@@ -184,7 +184,7 @@ D     = ones(n_dims)
 U     = zeros(n_dims, 3)
 q0_lr = LowRankGaussian(μ, D, U)
 
-alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01))
+alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01), operator=ClipScale())
 
 max_iter = 10^4
 

diff --git a/docs/src/paramspacesgd/repgradelbo.md b/docs/src/paramspacesgd/repgradelbo.md
@@ -192,7 +192,10 @@ binv = inverse(b)
 q0_trans = Bijectors.TransformedDistribution(q0, binv)
 
 cfe = KLMinRepGradDescent(
-    AutoReverseDiff(); entropy=ClosedFormEntropy(), optimizer=Adam(1e-2)
+    AutoReverseDiff();
+    entropy=ClosedFormEntropy(),
+    optimizer=Adam(1e-2),
+    operator=ClipScale(),
 )
 nothing
 ```
@@ -201,7 +204,10 @@ The repgradelbo estimator can instead be created as follows:
 
 ```@example repgradelbo
 stl = KLMinRepGradDescent(
-    AutoReverseDiff(); entropy=StickingTheLandingEntropy(), optimizer=Adam(1e-2)
+    AutoReverseDiff();
+    entropy=StickingTheLandingEntropy(),
+    optimizer=Adam(1e-2),
+    operator=ClipScale(),
 )
 nothing
 ```
@@ -318,7 +324,7 @@ nothing
 
 ```@setup repgradelbo
 _, info_qmc, _ = AdvancedVI.optimize(
-    KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2)),
+    KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2), operator=ClipScale()),
     max_iter,
     model,
     q0_trans;

diff --git a/docs/src/tutorials/basic.md b/docs/src/tutorials/basic.md
@@ -74,10 +74,15 @@ Here, we will use `ReverseDiff`, which can be selected by later passing `ADTypes
 using ADTypes, ReverseDiff
 using AdvancedVI
 
-alg = KLMinRepGradDescent(AutoReverseDiff());
+alg = KLMinRepGradDescent(AutoReverseDiff(); operator=ClipScale());
 nothing
 ```
 
+Projection or proximal operators can be used through the keyword argument `operator`.
+For this example, we will use Gaussian variational family, which is part of the more broad [location-scale family](@ref locscale).
+Location-scale family distributions require the scale matrix to have strictly positive eigenvalues at all times.
+Here, the projection operator `ClipScale` ensures this.
+
 Now, `KLMinRepGradDescent` requires the variational approximation and the target log-density to have the same support.
 Since `y` follows a log-normal prior, its support is bounded to be the positive half-space ``\mathbb{R}_+``.
 Thus, we will use [Bijectors](https://github.com/TuringLang/Bijectors.jl) to match the support of our target posterior and the variational approximation.

diff --git a/ext/AdvancedVIBijectorsExt.jl b/ext/AdvancedVIBijectorsExt.jl
@@ -1,11 +1,36 @@
 module AdvancedVIBijectorsExt
 
 using AdvancedVI
+using DiffResults: DiffResults
 using Bijectors
 using LinearAlgebra
 using Optimisers
 using Random
 
+function AdvancedVI.init(
+    rng::Random.AbstractRNG,
+    alg::AdvancedVI.ParamSpaceSGD,
+    q_init::Bijectors.TransformedDistribution,
+    prob,
+)
+    (; adtype, optimizer, averager, objective, operator) = alg
+    if q_init.dist isa AdvancedVI.MvLocationScale &&
+        operator isa AdvancedVI.IdentityOperator
+        @warn(
+            "IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` instead.",
+            typeof(q_init),
+            typeof(q_init.dist),
+            typeof(operator)
+        )
+    end
+    params, re = Optimisers.destructure(q_init)
+    opt_st = Optimisers.setup(optimizer, params)
+    obj_st = AdvancedVI.init(rng, objective, adtype, q_init, prob, params, re)
+    avg_st = AdvancedVI.init(averager, params)
+    grad_buf = DiffResults.DiffResult(zero(eltype(params)), similar(params))
+    return AdvancedVI.ParamSpaceSGDState(prob, q_init, 0, grad_buf, opt_st, obj_st, avg_st)
+end
+
 function AdvancedVI.apply(
     op::ClipScale,
     ::Type{<:Bijectors.TransformedDistribution{<:AdvancedVI.MvLocationScale}},

diff --git a/src/algorithms/paramspacesgd/constructors.jl b/src/algorithms/paramspacesgd/constructors.jl
@@ -4,6 +4,9 @@
 
 KL divergence minimization by running stochastic gradient descent with the reparameterization gradient in the Euclidean space of variational parameters.
 
+!!! note
+    For a `<:MvLocationScale` variational family, `IdentityOperator` should be avoided for `operator` since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).
+
 # Arguments
 - `adtype::ADTypes.AbstractADType`: Automatic differentiation backend. 
 
@@ -12,7 +15,7 @@ KL divergence minimization by running stochastic gradient descent with the repar
 - `optimizer::Optimisers.AbstractRule`: Optimization algorithm to be used. (default: `DoWG()`)
 - `n_samples::Int`: Number of Monte Carlo samples to be used for estimating each gradient. (default: `1`)
 - `averager::AbstractAverager`: Parameter averaging strategy. 
-- `operator::Union{<:IdentityOperator, <:ClipScale}`: Operator to be applied after each gradient descent step. (default: `ClipScale()`)
+- `operator::AbstractOperator`: Operator to be applied after each gradient descent step. (default: `IdentityOperator()`)
 - `subsampling::Union{<:Nothing,<:AbstractSubsampling}`: Data point subsampling strategy. If `nothing`, subsampling is not used. (default: `nothing`)
 
 # Requirements
@@ -28,7 +31,7 @@ function KLMinRepGradDescent(
     optimizer::Optimisers.AbstractRule=DoWG(),
     n_samples::Int=1,
     averager::AbstractAverager=PolynomialAveraging(),
-    operator::Union{<:IdentityOperator,<:ClipScale}=ClipScale(),
+    operator::AbstractOperator=IdentityOperator(),
     subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
 )
     objective = if isnothing(subsampling)
@@ -90,6 +93,9 @@ end
 
 KL divergence minimization by running stochastic gradient descent with the score gradient in the Euclidean space of variational parameters.
 
+!!! note
+    If a `<:MvLocationScale` variational family is used, for `operator`, `IdentityOperator` should be avoided since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).
+
 # Arguments
 - `adtype`: Automatic differentiation backend. 
 
@@ -111,7 +117,7 @@ function KLMinScoreGradDescent(
     optimizer::Union{<:Descent,<:DoG,<:DoWG}=DoWG(),
     n_samples::Int=1,
     averager::AbstractAverager=PolynomialAveraging(),
-    operator::Union{<:IdentityOperator,<:ClipScale}=IdentityOperator(),
+    operator::AbstractOperator=IdentityOperator(),
     subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
 )
     objective = if isnothing(subsampling)

diff --git a/src/algorithms/paramspacesgd/paramspacesgd.jl b/src/algorithms/paramspacesgd/paramspacesgd.jl
@@ -65,7 +65,14 @@ struct ParamSpaceSGDState{P,Q,GradBuf,OptSt,ObjSt,AvgSt}
 end
 
 function init(rng::Random.AbstractRNG, alg::ParamSpaceSGD, q_init, prob)
-    (; adtype, optimizer, averager, objective) = alg
+    (; adtype, optimizer, averager, objective, operator) = alg
+    if q_init isa AdvancedVI.MvLocationScale && operator isa AdvancedVI.IdentityOperator
+        @warn(
+            "IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` instead.",
+            typeof(q_init),
+            typeof(operator)
+        )
+    end
     params, re = Optimisers.destructure(q_init)
     opt_st = Optimisers.setup(optimizer, params)
     obj_st = init(rng, objective, adtype, q_init, prob, params, re)

diff --git a/src/algorithms/paramspacesgd/repgradelbo.jl b/src/algorithms/paramspacesgd/repgradelbo.jl
@@ -47,7 +47,6 @@ function init(
     params,
     restructure,
 )
-    q_stop = q
     capability = LogDensityProblems.capabilities(typeof(prob))
     ad_prob = if capability < LogDensityProblems.LogDensityOrder{1}()
         @info "The capability of the supplied `LogDensityProblem` $(capability) is less than $(LogDensityProblems.LogDensityOrder{1}()). `AdvancedVI` will attempt to directly differentiate through `LogDensityProblems.logdensity`. If this is not intended, please supply a log-density problem with capability at least $(LogDensityProblems.LogDensityOrder{1}())"
@@ -67,7 +66,7 @@ function init(
         obj=obj,
         problem=ad_prob,
         restructure=restructure,
-        q_stop=q_stop,
+        q_stop=q,
     )
     obj_ad_prep = AdvancedVI._prepare_gradient(
         estimate_repgradelbo_ad_forward, adtype, params, aux

diff --git a/test/algorithms/paramspacesgd/repgradelbo.jl b/test/algorithms/paramspacesgd/repgradelbo.jl
@@ -8,17 +8,33 @@
     (; model, μ_true, L_true, n_dims, is_meanfield) = modelstats
 
     q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims)))
+    q0_trans = Bijectors.transformed(q0, identity)
 
     @testset "basic" begin
         @testset for n_montecarlo in [1, 10]
             alg = KLMinRepGradDescent(
                 AD;
                 n_samples=n_montecarlo,
-                operator=IdentityOperator(),
+                operator=ClipScale(),
                 averager=PolynomialAveraging(),
             )
+
             _, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false)
             @test isfinite(last(info).elbo)
+
+            _, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false)
+            @test isfinite(last(info).elbo)
+        end
+    end
+
+    @testset "warn MvLocationScale with IdentityOperator" begin
+        @test_warn "IdentityOperator" begin
+            alg = KLMinRepGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0; show_progress=false)
+        end
+        @test_warn "IdentityOperator" begin
+            alg = KLMinRepGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0_trans; show_progress=false)
         end
     end
 

diff --git a/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl b/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl
@@ -16,7 +16,7 @@
 
         T = 1000
         η = 1e-3
-        alg = KLMinRepGradDescent(AD; optimizer=Descent(η))
+        alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale())
 
         q0 = if is_meanfield
             MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims)))

diff --git a/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl b/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl
@@ -16,7 +16,7 @@
 
         T = 1000
         η = 1e-3
-        alg = KLMinRepGradDescent(AD; optimizer=Descent(η))
+        alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale())
 
         b = Bijectors.bijector(model)
         b⁻¹ = inverse(b)

diff --git a/test/algorithms/paramspacesgd/scoregradelbo.jl b/test/algorithms/paramspacesgd/scoregradelbo.jl
@@ -7,12 +7,30 @@
     (; model, μ_true, L_true, n_dims, is_meanfield) = modelstats
 
     q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims)))
+    q0_trans = Bijectors.transformed(q0, identity)
 
     @testset "basic" begin
         @testset for n_montecarlo in [1, 10]
-            alg = KLMinScoreGradDescent(AD; n_samples=n_montecarlo, optimizer=Descent(1e-5))
+            alg = KLMinScoreGradDescent(
+                AD; n_samples=n_montecarlo, operator=ClipScale(), optimizer=Descent(1e-5)
+            )
+
             _, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false)
             @assert isfinite(last(info).elbo)
+
+            _, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false)
+            @assert isfinite(last(info).elbo)
+        end
+    end
+
+    @testset "warn MvLocationScale with IdentityOperator" begin
+        @test_warn "IdentityOperator" begin
+            alg = KLMinScoreGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0; show_progress=false)
+        end
+        @test_warn "IdentityOperator" begin
+            alg = KLMinScoreGradDescent(AD; operator=IdentityOperator())
+            optimize(rng, alg, 1, model, q0_trans; show_progress=false)
         end
     end
 

diff --git a/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl b/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl
@@ -12,7 +12,7 @@
         T = 1000
         η = 1e-4
         opt = Optimisers.Descent(η)
-        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt)
+        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale())
 
         q0 = if is_meanfield
             MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims)))

diff --git a/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl b/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl
@@ -12,7 +12,7 @@
         T = 1000
         η = 1e-4
         opt = Optimisers.Descent(η)
-        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt)
+        alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale())
 
         b = Bijectors.bijector(model)
         b⁻¹ = inverse(b)