Skip to content
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
0d30400
fix move `q_init` to front of arguments to algorithms `init`
Red-Portal Aug 26, 2025
a5d87c4
fix docs missing `prob` argument in `init`
Red-Portal Aug 26, 2025
2813724
fix add missing argument in docs link for `ParamSpaceSGD`
Red-Portal Aug 26, 2025
679d2fb
fix add missing argument in `init(SubsampledObjective)`
Red-Portal Aug 26, 2025
17cf6a1
fix test missing `q_init` argument for `SubsampledObjective`
Red-Portal Aug 26, 2025
37031db
fix missing `q_init` argument in tests for `SubsampledObjective`
Red-Portal Aug 26, 2025
164f0b1
Merge branch 'main' of github.com:TuringLang/AdvancedVI.jl into chang…
Red-Portal Aug 26, 2025
8f81a43
change default operator for `KLMinRepGradDescent`, add warning
Red-Portal Aug 26, 2025
11f16c9
run formatter
Red-Portal Aug 26, 2025
e75b319
run formatter
Red-Portal Aug 26, 2025
a630152
run formatter
Red-Portal Aug 26, 2025
049a783
run formatter
Red-Portal Aug 26, 2025
34b08a0
run formatter
Red-Portal Aug 26, 2025
ae95b22
update history
Red-Portal Aug 26, 2025
96da115
Merge branch 'klminrepgradescent_default_identity' of github.com:Turi…
Red-Portal Aug 26, 2025
0a7e26f
fix nowarn tests for `MvLocationScale` with `IdentityOperator`
Red-Portal Aug 26, 2025
2a0c08e
fix remove catch-all warning tests
Red-Portal Aug 26, 2025
59fcbfe
fix enable `ClipScale` operator in benchmarks
Red-Portal Aug 26, 2025
5808d7a
Merge branch 'main' of github.com:TuringLang/AdvancedVI.jl into klmin…
Red-Portal Sep 13, 2025
95c6c2f
add missing operator in tutorials
Red-Portal Sep 13, 2025
89792f5
fix wrong keyword argument for subsampling tutorial
Red-Portal Sep 13, 2025
b5f30b6
fix remove typedef calls in IdentityOperator warning
Red-Portal Sep 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
The default parameters for the parameter-free optimizers `DoG` and `DoWG` has been changed.
Now, the choice of parameter should be more invariant to dimension such that convergence will become faster than before on high dimensional problems.

The default value of the `operator` keyword argument of `KLMinRepGradDescent` has been changed to `IdentityOperator` from `ClipScale`. This means that for variational families `<:MvLocationScale`, optimization may fail since there is nothing enforcing the scale matrix to be positive definite.
Therefore, in case a variational family of `<:MvLocationScale` is used in combination with `IdentityOperator`, a warning message instruting to use `ClipScale` will be displayed.

## Interface Changes

An additional layer of indirection, `AbstractAlgorithms` has been added.
Expand Down
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,16 +106,23 @@ For the VI algorithm, we will use the following:
using ADTypes, ReverseDiff
using AdvancedVI

alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff())
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale())
```

This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014].
This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on.

Also, projection or proximal operators can be used through the keyword argument `operator`.
For this example, we will use Gaussian variational family, which is part of the more broad location-scale family.
These require the scale matrix to have strictly positive eigenvalues at all times.
Here, the projection operator `ClipScale` ensures this.

This `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` has gradients.
For this, it is straightforward to use `LogDensityProblemsAD`:

```julia
import DifferentiationInterface
import LogDensityProblemsAD
using DifferentiationInterface: DifferentiationInterface
using LogDensityProblemsAD: LogDensityProblemsAD

model_ad = LogDensityProblemsAD.ADgradient(ADTypes.AutoReverseDiff(), model)
```
Expand Down
4 changes: 1 addition & 3 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@ makedocs(;
pages=[
"AdvancedVI" => "index.md",
"General Usage" => "general.md",
"Tutorials" => [
"tutorials/basic.md",
],
"Tutorials" => ["tutorials/basic.md"],
"Algorithms" => [
"KLMinRepGradDescent" => "paramspacesgd/klminrepgraddescent.md",
"KLMinRepGradProxDescent" => "paramspacesgd/klminrepgradproxdescent.md",
Expand Down
2 changes: 1 addition & 1 deletion docs/src/families.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ D = ones(n_dims)
U = zeros(n_dims, 3)
q0_lr = LowRankGaussian(μ, D, U)

alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01))
alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01), operator=ClipScale())

max_iter = 10^4

Expand Down
12 changes: 9 additions & 3 deletions docs/src/paramspacesgd/repgradelbo.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,10 @@ binv = inverse(b)
q0_trans = Bijectors.TransformedDistribution(q0, binv)

cfe = KLMinRepGradDescent(
AutoReverseDiff(); entropy=ClosedFormEntropy(), optimizer=Adam(1e-2)
AutoReverseDiff();
entropy=ClosedFormEntropy(),
optimizer=Adam(1e-2),
operator=ClipScale(),
)
nothing
```
Expand All @@ -201,7 +204,10 @@ The repgradelbo estimator can instead be created as follows:

```@example repgradelbo
stl = KLMinRepGradDescent(
AutoReverseDiff(); entropy=StickingTheLandingEntropy(), optimizer=Adam(1e-2)
AutoReverseDiff();
entropy=StickingTheLandingEntropy(),
optimizer=Adam(1e-2),
operator=ClipScale(),
)
nothing
```
Expand Down Expand Up @@ -318,7 +324,7 @@ nothing

```@setup repgradelbo
_, info_qmc, _ = AdvancedVI.optimize(
KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2)),
KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2), operator=ClipScale()),
max_iter,
model,
q0_trans;
Expand Down
7 changes: 6 additions & 1 deletion docs/src/tutorials/basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,15 @@ Here, we will use `ReverseDiff`, which can be selected by later passing `ADTypes
using ADTypes, ReverseDiff
using AdvancedVI

alg = KLMinRepGradDescent(AutoReverseDiff());
alg = KLMinRepGradDescent(AutoReverseDiff(); operator=ClipScale());
nothing
```

Projection or proximal operators can be used through the keyword argument `operator`.
For this example, we will use Gaussian variational family, which is part of the more broad [location-scale family](@ref locscale).
Location-scale family distributions require the scale matrix to have strictly positive eigenvalues at all times.
Here, the projection operator `ClipScale` ensures this.

Now, `KLMinRepGradDescent` requires the variational approximation and the target log-density to have the same support.
Since `y` follows a log-normal prior, its support is bounded to be the positive half-space ``\mathbb{R}_+``.
Thus, we will use [Bijectors](https://github.com/TuringLang/Bijectors.jl) to match the support of our target posterior and the variational approximation.
Expand Down
25 changes: 25 additions & 0 deletions ext/AdvancedVIBijectorsExt.jl
Original file line number Diff line number Diff line change
@@ -1,11 +1,36 @@
module AdvancedVIBijectorsExt

using AdvancedVI
using DiffResults: DiffResults
using Bijectors
using LinearAlgebra
using Optimisers
using Random

function AdvancedVI.init(
rng::Random.AbstractRNG,
alg::AdvancedVI.ParamSpaceSGD,
q_init::Bijectors.TransformedDistribution,
prob,
)
(; adtype, optimizer, averager, objective, operator) = alg
if q_init.dist isa AdvancedVI.MvLocationScale &&
operator isa AdvancedVI.IdentityOperator
@warn(
"IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` instead.",
typeof(q_init),
typeof(q_init.dist),
typeof(operator)
)
end
params, re = Optimisers.destructure(q_init)
opt_st = Optimisers.setup(optimizer, params)
obj_st = AdvancedVI.init(rng, objective, adtype, q_init, prob, params, re)
avg_st = AdvancedVI.init(averager, params)
grad_buf = DiffResults.DiffResult(zero(eltype(params)), similar(params))
return AdvancedVI.ParamSpaceSGDState(prob, q_init, 0, grad_buf, opt_st, obj_st, avg_st)
end

function AdvancedVI.apply(
op::ClipScale,
::Type{<:Bijectors.TransformedDistribution{<:AdvancedVI.MvLocationScale}},
Expand Down
12 changes: 9 additions & 3 deletions src/algorithms/paramspacesgd/constructors.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

KL divergence minimization by running stochastic gradient descent with the reparameterization gradient in the Euclidean space of variational parameters.

!!! note
For a `<:MvLocationScale` variational family, `IdentityOperator` should be avoided for `operator` since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).

# Arguments
- `adtype::ADTypes.AbstractADType`: Automatic differentiation backend.

Expand All @@ -12,7 +15,7 @@ KL divergence minimization by running stochastic gradient descent with the repar
- `optimizer::Optimisers.AbstractRule`: Optimization algorithm to be used. (default: `DoWG()`)
- `n_samples::Int`: Number of Monte Carlo samples to be used for estimating each gradient. (default: `1`)
- `averager::AbstractAverager`: Parameter averaging strategy.
- `operator::Union{<:IdentityOperator, <:ClipScale}`: Operator to be applied after each gradient descent step. (default: `ClipScale()`)
- `operator::AbstractOperator`: Operator to be applied after each gradient descent step. (default: `IdentityOperator()`)
- `subsampling::Union{<:Nothing,<:AbstractSubsampling}`: Data point subsampling strategy. If `nothing`, subsampling is not used. (default: `nothing`)

# Requirements
Expand All @@ -28,7 +31,7 @@ function KLMinRepGradDescent(
optimizer::Optimisers.AbstractRule=DoWG(),
n_samples::Int=1,
averager::AbstractAverager=PolynomialAveraging(),
operator::Union{<:IdentityOperator,<:ClipScale}=ClipScale(),
operator::AbstractOperator=IdentityOperator(),
subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
)
objective = if isnothing(subsampling)
Expand Down Expand Up @@ -90,6 +93,9 @@ end

KL divergence minimization by running stochastic gradient descent with the score gradient in the Euclidean space of variational parameters.

!!! note
If a `<:MvLocationScale` variational family is used, for `operator`, `IdentityOperator` should be avoided since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).

# Arguments
- `adtype`: Automatic differentiation backend.

Expand All @@ -111,7 +117,7 @@ function KLMinScoreGradDescent(
optimizer::Union{<:Descent,<:DoG,<:DoWG}=DoWG(),
n_samples::Int=1,
averager::AbstractAverager=PolynomialAveraging(),
operator::Union{<:IdentityOperator,<:ClipScale}=IdentityOperator(),
operator::AbstractOperator=IdentityOperator(),
subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
)
objective = if isnothing(subsampling)
Expand Down
9 changes: 8 additions & 1 deletion src/algorithms/paramspacesgd/paramspacesgd.jl
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,14 @@ struct ParamSpaceSGDState{P,Q,GradBuf,OptSt,ObjSt,AvgSt}
end

function init(rng::Random.AbstractRNG, alg::ParamSpaceSGD, q_init, prob)
(; adtype, optimizer, averager, objective) = alg
(; adtype, optimizer, averager, objective, operator) = alg
if q_init isa AdvancedVI.MvLocationScale && operator isa AdvancedVI.IdentityOperator
@warn(
"IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` instead.",
typeof(q_init),
typeof(operator)
)
end
params, re = Optimisers.destructure(q_init)
opt_st = Optimisers.setup(optimizer, params)
obj_st = init(rng, objective, adtype, q_init, prob, params, re)
Expand Down
3 changes: 1 addition & 2 deletions src/algorithms/paramspacesgd/repgradelbo.jl
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ function init(
params,
restructure,
)
q_stop = q
capability = LogDensityProblems.capabilities(typeof(prob))
ad_prob = if capability < LogDensityProblems.LogDensityOrder{1}()
@info "The capability of the supplied `LogDensityProblem` $(capability) is less than $(LogDensityProblems.LogDensityOrder{1}()). `AdvancedVI` will attempt to directly differentiate through `LogDensityProblems.logdensity`. If this is not intended, please supply a log-density problem with capability at least $(LogDensityProblems.LogDensityOrder{1}())"
Expand All @@ -67,7 +66,7 @@ function init(
obj=obj,
problem=ad_prob,
restructure=restructure,
q_stop=q_stop,
q_stop=q,
)
obj_ad_prep = AdvancedVI._prepare_gradient(
estimate_repgradelbo_ad_forward, adtype, params, aux
Expand Down
18 changes: 17 additions & 1 deletion test/algorithms/paramspacesgd/repgradelbo.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,33 @@
(; model, μ_true, L_true, n_dims, is_meanfield) = modelstats

q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims)))
q0_trans = Bijectors.transformed(q0, identity)

@testset "basic" begin
@testset for n_montecarlo in [1, 10]
alg = KLMinRepGradDescent(
AD;
n_samples=n_montecarlo,
operator=IdentityOperator(),
operator=ClipScale(),
averager=PolynomialAveraging(),
)

_, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false)
@test isfinite(last(info).elbo)

_, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false)
@test isfinite(last(info).elbo)
end
end

@testset "warn MvLocationScale with IdentityOperator" begin
@test_warn "IdentityOperator" begin
alg = KLMinRepGradDescent(AD; operator=IdentityOperator())
optimize(rng, alg, 1, model, q0; show_progress=false)
end
@test_warn "IdentityOperator" begin
alg = KLMinRepGradDescent(AD; operator=IdentityOperator())
optimize(rng, alg, 1, model, q0_trans; show_progress=false)
end
end

Expand Down
2 changes: 1 addition & 1 deletion test/algorithms/paramspacesgd/repgradelbo_locationscale.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

T = 1000
η = 1e-3
alg = KLMinRepGradDescent(AD; optimizer=Descent(η))
alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale())

q0 = if is_meanfield
MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims)))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

T = 1000
η = 1e-3
alg = KLMinRepGradDescent(AD; optimizer=Descent(η))
alg = KLMinRepGradDescent(AD; optimizer=Descent(η), operator=ClipScale())

b = Bijectors.bijector(model)
b⁻¹ = inverse(b)
Expand Down
20 changes: 19 additions & 1 deletion test/algorithms/paramspacesgd/scoregradelbo.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,30 @@
(; model, μ_true, L_true, n_dims, is_meanfield) = modelstats

q0 = MeanFieldGaussian(zeros(n_dims), Diagonal(ones(n_dims)))
q0_trans = Bijectors.transformed(q0, identity)

@testset "basic" begin
@testset for n_montecarlo in [1, 10]
alg = KLMinScoreGradDescent(AD; n_samples=n_montecarlo, optimizer=Descent(1e-5))
alg = KLMinScoreGradDescent(
AD; n_samples=n_montecarlo, operator=ClipScale(), optimizer=Descent(1e-5)
)

_, info, _ = optimize(rng, alg, 10, model, q0; show_progress=false)
@assert isfinite(last(info).elbo)

_, info, _ = optimize(rng, alg, 10, model, q0_trans; show_progress=false)
@assert isfinite(last(info).elbo)
end
end

@testset "warn MvLocationScale with IdentityOperator" begin
@test_warn "IdentityOperator" begin
alg = KLMinScoreGradDescent(AD; operator=IdentityOperator())
optimize(rng, alg, 1, model, q0; show_progress=false)
end
@test_warn "IdentityOperator" begin
alg = KLMinScoreGradDescent(AD; operator=IdentityOperator())
optimize(rng, alg, 1, model, q0_trans; show_progress=false)
end
end

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
T = 1000
η = 1e-4
opt = Optimisers.Descent(η)
alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt)
alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale())

q0 = if is_meanfield
MeanFieldGaussian(zeros(realtype, n_dims), Diagonal(ones(realtype, n_dims)))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
T = 1000
η = 1e-4
opt = Optimisers.Descent(η)
alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt)
alg = KLMinScoreGradDescent(AD; n_samples=10, optimizer=opt, operator=ClipScale())

b = Bijectors.bijector(model)
b⁻¹ = inverse(b)
Expand Down
Loading