diff --git a/HISTORY.md b/HISTORY.md index 46a2f13ff..c1a9bb010 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -5,6 +5,9 @@ The default parameters for the parameter-free optimizers `DoG` and `DoWG` has been changed. Now, the choice of parameter should be more invariant to dimension such that convergence will become faster than before on high dimensional problems. +The default value of the `operator` keyword argument of `KLMinRepGradDescent` has been changed to `IdentityOperator` from `ClipScale`. This means that for variational families `<:MvLocationScale`, optimization may fail since there is nothing enforcing the scale matrix to be positive definite. +Therefore, in case a variational family of `<:MvLocationScale` is used in combination with `IdentityOperator`, a warning message instruting to use `ClipScale` will be displayed. + ## Interface Changes An additional layer of indirection, `AbstractAlgorithms` has been added. diff --git a/README.md b/README.md index c53de30f2..7ca99adaa 100644 --- a/README.md +++ b/README.md @@ -113,15 +113,23 @@ For the VI algorithm, we will use `KLMinRepGradDescent`: using ADTypes, ReverseDiff using AdvancedVI -alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff()); +alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale()) ``` This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014]. This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on. -`KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` is differentiable. -If the `LogDensityProblem` has a differentiation [capability](https://www.tamaspapp.eu/LogDensityProblems.jl/dev/#LogDensityProblems.capabilities) of at least first-order, we can take advantage of this. -For this example, we will use `LogDensityProblemsAD` to equip our problem with a first-order capability: +Also, projection or proximal operators can be used through the keyword argument `operator`. +For this example, we will use Gaussian variational family, which is part of the more broad location-scale family. +These require the scale matrix to have strictly positive eigenvalues at all times. +Here, the projection operator `ClipScale` ensures this. + +This `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` has gradients. +For this, it is straightforward to use `LogDensityProblemsAD`: + +```julia +using DifferentiationInterface: DifferentiationInterface +using LogDensityProblemsAD: LogDensityProblemsAD ```julia using DifferentiationInterface: DifferentiationInterface diff --git a/bench/benchmarks.jl b/bench/benchmarks.jl index 6808a4742..ad40b3534 100644 --- a/bench/benchmarks.jl +++ b/bench/benchmarks.jl @@ -62,7 +62,7 @@ begin b = Bijectors.bijector(prob) binv = inverse(b) q = Bijectors.TransformedDistribution(family, binv) - alg = KLMinRepGradDescent(adtype; optimizer=opt, entropy) + alg = KLMinRepGradDescent(adtype; optimizer=opt, entropy, operator=ClipScale()) SUITES[probname][objname][familyname][adname] = begin @benchmarkable AdvancedVI.optimize( diff --git a/docs/src/families.md b/docs/src/families.md index 541744a00..761769f3a 100644 --- a/docs/src/families.md +++ b/docs/src/families.md @@ -184,7 +184,7 @@ D = ones(n_dims) U = zeros(n_dims, 3) q0_lr = LowRankGaussian(μ, D, U) -alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01)) +alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01), operator=ClipScale()) max_iter = 10^4 diff --git a/docs/src/paramspacesgd/repgradelbo.md b/docs/src/paramspacesgd/repgradelbo.md index 04db2ff52..f61940be1 100644 --- a/docs/src/paramspacesgd/repgradelbo.md +++ b/docs/src/paramspacesgd/repgradelbo.md @@ -191,7 +191,10 @@ binv = inverse(b) q0_trans = Bijectors.TransformedDistribution(q0, binv) cfe = KLMinRepGradDescent( - AutoReverseDiff(); entropy=ClosedFormEntropy(), optimizer=Adam(1e-2) + AutoReverseDiff(); + entropy=ClosedFormEntropy(), + optimizer=Adam(1e-2), + operator=ClipScale(), ) nothing ``` @@ -200,7 +203,10 @@ The repgradelbo estimator can instead be created as follows: ```@example repgradelbo stl = KLMinRepGradDescent( - AutoReverseDiff(); entropy=StickingTheLandingEntropy(), optimizer=Adam(1e-2) + AutoReverseDiff(); + entropy=StickingTheLandingEntropy(), + optimizer=Adam(1e-2), + operator=ClipScale(), ) nothing ``` @@ -317,7 +323,7 @@ nothing ```@setup repgradelbo _, info_qmc, _ = AdvancedVI.optimize( - KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2)), + KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2), operator=ClipScale()), max_iter, model, q0_trans; diff --git a/docs/src/tutorials/basic.md b/docs/src/tutorials/basic.md index 1c0dfa138..1b61528b5 100644 --- a/docs/src/tutorials/basic.md +++ b/docs/src/tutorials/basic.md @@ -111,15 +111,20 @@ For the VI algorithm, we will use `KLMinRepGradDescent`: using ADTypes, ReverseDiff using AdvancedVI -alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff()) +alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale()); nothing ``` This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014]. This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on. + +For certain algorithms such as `KLMinRepGradDescent`, projection or proximal operators can be used through the keyword argument `operator`. +For this example, we will use Gaussian variational family, which is part of the more broad [location-scale family](@ref locscale). +Location-scale family distributions require the scale matrix to have strictly positive eigenvalues at all times. +Here, the projection operator `ClipScale` ensures this. + `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` is differentiable. If the `LogDensityProblem` has a differentiation [capability](https://www.tamaspapp.eu/LogDensityProblems.jl/dev/#LogDensityProblems.capabilities) of at least first-order, we can take advantage of this. - For this example, we will use `LogDensityProblemsAD` to equip our problem with a first-order capability: [^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014, June). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*. PMLR. @@ -143,6 +148,9 @@ q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d))) nothing ``` +Now, `KLMinRepGradDescent` requires the variational approximation and the target log-density to have the same support. +Since `y` follows a log-normal prior, its support is bounded to be the positive half-space ``\mathbb{R}_+``. +Thus, we will use [Bijectors](https://github.com/TuringLang/Bijectors.jl) to match the support of our target posterior and the variational approximation. The bijector can now be applied to `q` to match the support of the target problem. ```@example basic diff --git a/docs/src/tutorials/stan.md b/docs/src/tutorials/stan.md index 5c32eac83..1890675d6 100644 --- a/docs/src/tutorials/stan.md +++ b/docs/src/tutorials/stan.md @@ -81,7 +81,7 @@ using LinearAlgebra using LogDensityProblems using Plots -alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff()) +alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale()) d = LogDensityProblems.dimension(model) q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d))) diff --git a/ext/AdvancedVIBijectorsExt.jl b/ext/AdvancedVIBijectorsExt.jl index 67818e94e..a28dc9d0f 100644 --- a/ext/AdvancedVIBijectorsExt.jl +++ b/ext/AdvancedVIBijectorsExt.jl @@ -1,11 +1,33 @@ module AdvancedVIBijectorsExt using AdvancedVI +using DiffResults: DiffResults using Bijectors using LinearAlgebra using Optimisers using Random +function AdvancedVI.init( + rng::Random.AbstractRNG, + alg::AdvancedVI.ParamSpaceSGD, + q_init::Bijectors.TransformedDistribution, + prob, +) + (; adtype, optimizer, averager, objective, operator) = alg + if q_init.dist isa AdvancedVI.MvLocationScale && + operator isa AdvancedVI.IdentityOperator + @warn( + "IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` in the algorithm instead.", + ) + end + params, re = Optimisers.destructure(q_init) + opt_st = Optimisers.setup(optimizer, params) + obj_st = AdvancedVI.init(rng, objective, adtype, q_init, prob, params, re) + avg_st = AdvancedVI.init(averager, params) + grad_buf = DiffResults.DiffResult(zero(eltype(params)), similar(params)) + return AdvancedVI.ParamSpaceSGDState(prob, q_init, 0, grad_buf, opt_st, obj_st, avg_st) +end + function AdvancedVI.apply( op::ClipScale, ::Type{<:Bijectors.TransformedDistribution{<:AdvancedVI.MvLocationScale}}, diff --git a/src/algorithms/paramspacesgd/constructors.jl b/src/algorithms/paramspacesgd/constructors.jl index ddfb564f0..2ec0ae41b 100644 --- a/src/algorithms/paramspacesgd/constructors.jl +++ b/src/algorithms/paramspacesgd/constructors.jl @@ -4,6 +4,9 @@ KL divergence minimization by running stochastic gradient descent with the reparameterization gradient in the Euclidean space of variational parameters. +!!! note + For a `<:MvLocationScale` variational family, `IdentityOperator` should be avoided for `operator` since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref). + # Arguments - `adtype::ADTypes.AbstractADType`: Automatic differentiation backend. @@ -12,7 +15,7 @@ KL divergence minimization by running stochastic gradient descent with the repar - `optimizer::Optimisers.AbstractRule`: Optimization algorithm to be used. (default: `DoWG()`) - `n_samples::Int`: Number of Monte Carlo samples to be used for estimating each gradient. (default: `1`) - `averager::AbstractAverager`: Parameter averaging strategy. -- `operator::Union{<:IdentityOperator, <:ClipScale}`: Operator to be applied after each gradient descent step. (default: `ClipScale()`) +- `operator::AbstractOperator`: Operator to be applied after each gradient descent step. (default: `IdentityOperator()`) - `subsampling::Union{<:Nothing,<:AbstractSubsampling}`: Data point subsampling strategy. If `nothing`, subsampling is not used. (default: `nothing`) # Requirements @@ -28,7 +31,7 @@ function KLMinRepGradDescent( optimizer::Optimisers.AbstractRule=DoWG(), n_samples::Int=1, averager::AbstractAverager=PolynomialAveraging(), - operator::Union{<:IdentityOperator,<:ClipScale}=ClipScale(), + operator::AbstractOperator=IdentityOperator(), subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing, ) objective = if isnothing(subsampling) @@ -90,6 +93,9 @@ end KL divergence minimization by running stochastic gradient descent with the score gradient in the Euclidean space of variational parameters. +!!! note + If a `<:MvLocationScale` variational family is used, for `operator`, `IdentityOperator` should be avoided since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref). + # Arguments - `adtype`: Automatic differentiation backend. @@ -111,7 +117,7 @@ function KLMinScoreGradDescent( optimizer::Union{<:Descent,<:DoG,<:DoWG}=DoWG(), n_samples::Int=1, averager::AbstractAverager=PolynomialAveraging(), - operator::Union{<:IdentityOperator,<:ClipScale}=IdentityOperator(), + operator::AbstractOperator=IdentityOperator(), subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing, ) objective = if isnothing(subsampling) diff --git a/src/algorithms/paramspacesgd/paramspacesgd.jl b/src/algorithms/paramspacesgd/paramspacesgd.jl index 8e2144b21..bc2dd3195 100644 --- a/src/algorithms/paramspacesgd/paramspacesgd.jl +++ b/src/algorithms/paramspacesgd/paramspacesgd.jl @@ -65,7 +65,12 @@ struct ParamSpaceSGDState{P,Q,GradBuf,OptSt,ObjSt,AvgSt} end function init(rng::Random.AbstractRNG, alg::ParamSpaceSGD, q_init, prob) - (; adtype, optimizer, averager, objective) = alg + (; adtype, optimizer, averager, objective, operator) = alg + if q_init isa AdvancedVI.MvLocationScale && operator isa AdvancedVI.IdentityOperator + @warn( + "IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` in the algorithm instead.", + ) + end params, re = Optimisers.destructure(q_init) opt_st = Optimisers.setup(optimizer, params) obj_st = init(rng, objective, adtype, q_init, prob, params, re) diff --git a/src/algorithms/paramspacesgd/repgradelbo.jl b/src/algorithms/paramspacesgd/repgradelbo.jl index 1b1184b20..a1cdcbd92 100644 --- a/src/algorithms/paramspacesgd/repgradelbo.jl +++ b/src/algorithms/paramspacesgd/repgradelbo.jl @@ -47,7 +47,6 @@ function init( params, restructure, ) - q_stop = q capability = LogDensityProblems.capabilities(typeof(prob)) ad_prob = if capability < LogDensityProblems.LogDensityOrder{1}() @info "The capability of the supplied `LogDensityProblem` $(capability) is less than $(LogDensityProblems.LogDensityOrder{1}()). `AdvancedVI` will attempt to directly differentiate through `LogDensityProblems.logdensity`. If this is not intended, please supply a log-density problem with capability at least $(LogDensityProblems.LogDensityOrder{1}())" @@ -67,7 +66,7 @@ function init( obj=obj, problem=ad_prob, restructure=restructure, - q_stop=q_stop, + q_stop=q, ) obj_ad_prep = AdvancedVI._prepare_gradient( estimate_repgradelbo_ad_forward, adtype, params, aux diff --git a/test/algorithms/paramspacesgd/repgradelbo.jl b/test/algorithms/paramspacesgd/repgradelbo.jl index 45f22c0ad..29c9b7f73 100644 --- a/test/algorithms/paramspacesgd/repgradelbo.jl +++ b/test/algorithms/paramspacesgd/repgradelbo.jl @@ -8,17 +8,33 @@ (; model, μ_true, L_true, n_dims, is_meanfield) = modelstats q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims))) + q0_trans = Bijectors.transformed(q0, identity) @testset "basic" begin @testset for n_montecarlo in [1, 10] alg = KLMinRepGradDescent( AD; n_samples=n_montecarlo, - operator=IdentityOperator(), + operator=ClipScale(), averager=PolynomialAveraging(), ) + _, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false) @test isfinite(last(info).elbo) + + _, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false) + @test isfinite(last(info).elbo) + end + end + + @testset "warn MvLocationScale with IdentityOperator" begin + @test_warn "IdentityOperator" begin + alg = KLMinRepGradDescent(AD; operator=IdentityOperator()) + optimize(rng, alg, 1, model, q0; show_progress=false) + end + @test_warn "IdentityOperator" begin + alg = KLMinRepGradDescent(AD; operator=IdentityOperator()) + optimize(rng, alg, 1, model, q0_trans; show_progress=false) end end diff --git a/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl b/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl index 9ca41a23e..eb6d26fb9 100644 --- a/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl +++ b/test/algorithms/paramspacesgd/repgradelbo_locationscale.jl @@ -16,7 +16,7 @@ T = 1000 η = 1e-3 - alg = KLMinRepGradDescent(AD; optimizer=Descent(η)) + alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale()) q0 = if is_meanfield MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims))) diff --git a/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl b/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl index b96f520c4..0a398993d 100644 --- a/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl +++ b/test/algorithms/paramspacesgd/repgradelbo_locationscale_bijectors.jl @@ -16,7 +16,7 @@ T = 1000 η = 1e-3 - alg = KLMinRepGradDescent(AD; optimizer=Descent(η)) + alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale()) b = Bijectors.bijector(model) b⁻¹ = inverse(b) diff --git a/test/algorithms/paramspacesgd/scoregradelbo.jl b/test/algorithms/paramspacesgd/scoregradelbo.jl index cd21de53c..c8940eed3 100644 --- a/test/algorithms/paramspacesgd/scoregradelbo.jl +++ b/test/algorithms/paramspacesgd/scoregradelbo.jl @@ -7,12 +7,30 @@ (; model, μ_true, L_true, n_dims, is_meanfield) = modelstats q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims))) + q0_trans = Bijectors.transformed(q0, identity) @testset "basic" begin @testset for n_montecarlo in [1, 10] - alg = KLMinScoreGradDescent(AD; n_samples=n_montecarlo, optimizer=Descent(1e-5)) + alg = KLMinScoreGradDescent( + AD; n_samples=n_montecarlo, operator=ClipScale(), optimizer=Descent(1e-5) + ) + _, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false) @assert isfinite(last(info).elbo) + + _, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false) + @assert isfinite(last(info).elbo) + end + end + + @testset "warn MvLocationScale with IdentityOperator" begin + @test_warn "IdentityOperator" begin + alg = KLMinScoreGradDescent(AD; operator=IdentityOperator()) + optimize(rng, alg, 1, model, q0; show_progress=false) + end + @test_warn "IdentityOperator" begin + alg = KLMinScoreGradDescent(AD; operator=IdentityOperator()) + optimize(rng, alg, 1, model, q0_trans; show_progress=false) end end diff --git a/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl b/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl index 109a2b044..fc567b4c9 100644 --- a/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl +++ b/test/algorithms/paramspacesgd/scoregradelbo_locationscale.jl @@ -12,7 +12,7 @@ T = 1000 η = 1e-4 opt = Optimisers.Descent(η) - alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt) + alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale()) q0 = if is_meanfield MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims))) diff --git a/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl b/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl index 84d570f2a..bfc0da73c 100644 --- a/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl +++ b/test/algorithms/paramspacesgd/scoregradelbo_locationscale_bijectors.jl @@ -12,7 +12,7 @@ T = 1000 η = 1e-4 opt = Optimisers.Descent(η) - alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt) + alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale()) b = Bijectors.bijector(model) b⁻¹ = inverse(b)