Skip to content

Commit ca2b3e4

Browse files
Make IdentityOperator default for KLMinRepGradDescent (#201)
* fix move `q_init` to front of arguments to algorithms `init` * fix docs missing `prob` argument in `init` * fix add missing argument in docs link for `ParamSpaceSGD` * fix add missing argument in `init(SubsampledObjective)` * fix test missing `q_init` argument for `SubsampledObjective` * fix missing `q_init` argument in tests for `SubsampledObjective` * change default operator for `KLMinRepGradDescent`, add warning * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * update history * fix nowarn tests for `MvLocationScale` with `IdentityOperator` * fix remove catch-all warning tests * fix enable `ClipScale` operator in benchmarks * add missing operator in tutorials * fix wrong keyword argument for subsampling tutorial * fix remove typedef calls in IdentityOperator warning --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent cb11ac4 commit ca2b3e4

File tree

17 files changed

+115
-24
lines changed

17 files changed

+115
-24
lines changed

HISTORY.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55
The default parameters for the parameter-free optimizers `DoG` and `DoWG` has been changed.
66
Now, the choice of parameter should be more invariant to dimension such that convergence will become faster than before on high dimensional problems.
77

8+
The default value of the `operator` keyword argument of `KLMinRepGradDescent` has been changed to `IdentityOperator` from `ClipScale`. This means that for variational families `<:MvLocationScale`, optimization may fail since there is nothing enforcing the scale matrix to be positive definite.
9+
Therefore, in case a variational family of `<:MvLocationScale` is used in combination with `IdentityOperator`, a warning message instruting to use `ClipScale` will be displayed.
10+
811
## Interface Changes
912

1013
An additional layer of indirection, `AbstractAlgorithms` has been added.

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -113,15 +113,23 @@ For the VI algorithm, we will use `KLMinRepGradDescent`:
113113
using ADTypes, ReverseDiff
114114
using AdvancedVI
115115

116-
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff());
116+
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale())
117117
```
118118

119119
This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014].
120120
This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on.
121121

122-
`KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` is differentiable.
123-
If the `LogDensityProblem` has a differentiation [capability](https://www.tamaspapp.eu/LogDensityProblems.jl/dev/#LogDensityProblems.capabilities) of at least first-order, we can take advantage of this.
124-
For this example, we will use `LogDensityProblemsAD` to equip our problem with a first-order capability:
122+
Also, projection or proximal operators can be used through the keyword argument `operator`.
123+
For this example, we will use Gaussian variational family, which is part of the more broad location-scale family.
124+
These require the scale matrix to have strictly positive eigenvalues at all times.
125+
Here, the projection operator `ClipScale` ensures this.
126+
127+
This `KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` has gradients.
128+
For this, it is straightforward to use `LogDensityProblemsAD`:
129+
130+
```julia
131+
using DifferentiationInterface: DifferentiationInterface
132+
using LogDensityProblemsAD: LogDensityProblemsAD
125133

126134
```julia
127135
using DifferentiationInterface: DifferentiationInterface

bench/benchmarks.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ begin
6262
b = Bijectors.bijector(prob)
6363
binv = inverse(b)
6464
q = Bijectors.TransformedDistribution(family, binv)
65-
alg = KLMinRepGradDescent(adtype; optimizer=opt, entropy)
65+
alg = KLMinRepGradDescent(adtype; optimizer=opt, entropy, operator=ClipScale())
6666

6767
SUITES[probname][objname][familyname][adname] = begin
6868
@benchmarkable AdvancedVI.optimize(

docs/src/families.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ D = ones(n_dims)
184184
U = zeros(n_dims, 3)
185185
q0_lr = LowRankGaussian(μ, D, U)
186186
187-
alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01))
187+
alg = KLMinRepGradDescent(AutoReverseDiff(); optimizer=Adam(0.01), operator=ClipScale())
188188
189189
max_iter = 10^4
190190

docs/src/paramspacesgd/repgradelbo.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,10 @@ binv = inverse(b)
191191
q0_trans = Bijectors.TransformedDistribution(q0, binv)
192192
193193
cfe = KLMinRepGradDescent(
194-
AutoReverseDiff(); entropy=ClosedFormEntropy(), optimizer=Adam(1e-2)
194+
AutoReverseDiff();
195+
entropy=ClosedFormEntropy(),
196+
optimizer=Adam(1e-2),
197+
operator=ClipScale(),
195198
)
196199
nothing
197200
```
@@ -200,7 +203,10 @@ The repgradelbo estimator can instead be created as follows:
200203

201204
```@example repgradelbo
202205
stl = KLMinRepGradDescent(
203-
AutoReverseDiff(); entropy=StickingTheLandingEntropy(), optimizer=Adam(1e-2)
206+
AutoReverseDiff();
207+
entropy=StickingTheLandingEntropy(),
208+
optimizer=Adam(1e-2),
209+
operator=ClipScale(),
204210
)
205211
nothing
206212
```
@@ -317,7 +323,7 @@ nothing
317323

318324
```@setup repgradelbo
319325
_, info_qmc, _ = AdvancedVI.optimize(
320-
KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2)),
326+
KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2), operator=ClipScale()),
321327
max_iter,
322328
model,
323329
q0_trans;

docs/src/tutorials/basic.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,15 +111,20 @@ For the VI algorithm, we will use `KLMinRepGradDescent`:
111111
using ADTypes, ReverseDiff
112112
using AdvancedVI
113113
114-
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff())
114+
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale());
115115
nothing
116116
```
117117

118118
This algorithm minimizes the exclusive/reverse KL divergence via stochastic gradient descent in the (Euclidean) space of the parameters of the variational approximation with the reparametrization gradient[^TL2014][^RMW2014][^KW2014].
119119
This is also commonly referred as automatic differentiation VI, black-box VI, stochastic gradient VI, and so on.
120+
121+
For certain algorithms such as `KLMinRepGradDescent`, projection or proximal operators can be used through the keyword argument `operator`.
122+
For this example, we will use Gaussian variational family, which is part of the more broad [location-scale family](@ref locscale).
123+
Location-scale family distributions require the scale matrix to have strictly positive eigenvalues at all times.
124+
Here, the projection operator `ClipScale` ensures this.
125+
120126
`KLMinRepGradDescent`, in particular, assumes that the target `LogDensityProblem` is differentiable.
121127
If the `LogDensityProblem` has a differentiation [capability](https://www.tamaspapp.eu/LogDensityProblems.jl/dev/#LogDensityProblems.capabilities) of at least first-order, we can take advantage of this.
122-
123128
For this example, we will use `LogDensityProblemsAD` to equip our problem with a first-order capability:
124129

125130
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014, June). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*. PMLR.
@@ -143,6 +148,9 @@ q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d)))
143148
nothing
144149
```
145150

151+
Now, `KLMinRepGradDescent` requires the variational approximation and the target log-density to have the same support.
152+
Since `y` follows a log-normal prior, its support is bounded to be the positive half-space ``\mathbb{R}_+``.
153+
Thus, we will use [Bijectors](https://github.com/TuringLang/Bijectors.jl) to match the support of our target posterior and the variational approximation.
146154
The bijector can now be applied to `q` to match the support of the target problem.
147155

148156
```@example basic

docs/src/tutorials/stan.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ using LinearAlgebra
8181
using LogDensityProblems
8282
using Plots
8383
84-
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff())
84+
alg = KLMinRepGradDescent(ADTypes.AutoReverseDiff(); operator=ClipScale())
8585
8686
d = LogDensityProblems.dimension(model)
8787
q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d)))

ext/AdvancedVIBijectorsExt.jl

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,33 @@
11
module AdvancedVIBijectorsExt
22

33
using AdvancedVI
4+
using DiffResults: DiffResults
45
using Bijectors
56
using LinearAlgebra
67
using Optimisers
78
using Random
89

10+
function AdvancedVI.init(
11+
rng::Random.AbstractRNG,
12+
alg::AdvancedVI.ParamSpaceSGD,
13+
q_init::Bijectors.TransformedDistribution,
14+
prob,
15+
)
16+
(; adtype, optimizer, averager, objective, operator) = alg
17+
if q_init.dist isa AdvancedVI.MvLocationScale &&
18+
operator isa AdvancedVI.IdentityOperator
19+
@warn(
20+
"IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` in the algorithm instead.",
21+
)
22+
end
23+
params, re = Optimisers.destructure(q_init)
24+
opt_st = Optimisers.setup(optimizer, params)
25+
obj_st = AdvancedVI.init(rng, objective, adtype, q_init, prob, params, re)
26+
avg_st = AdvancedVI.init(averager, params)
27+
grad_buf = DiffResults.DiffResult(zero(eltype(params)), similar(params))
28+
return AdvancedVI.ParamSpaceSGDState(prob, q_init, 0, grad_buf, opt_st, obj_st, avg_st)
29+
end
30+
931
function AdvancedVI.apply(
1032
op::ClipScale,
1133
::Type{<:Bijectors.TransformedDistribution{<:AdvancedVI.MvLocationScale}},

src/algorithms/paramspacesgd/constructors.jl

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
55
KL divergence minimization by running stochastic gradient descent with the reparameterization gradient in the Euclidean space of variational parameters.
66
7+
!!! note
8+
For a `<:MvLocationScale` variational family, `IdentityOperator` should be avoided for `operator` since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).
9+
710
# Arguments
811
- `adtype::ADTypes.AbstractADType`: Automatic differentiation backend.
912
@@ -12,7 +15,7 @@ KL divergence minimization by running stochastic gradient descent with the repar
1215
- `optimizer::Optimisers.AbstractRule`: Optimization algorithm to be used. (default: `DoWG()`)
1316
- `n_samples::Int`: Number of Monte Carlo samples to be used for estimating each gradient. (default: `1`)
1417
- `averager::AbstractAverager`: Parameter averaging strategy.
15-
- `operator::Union{<:IdentityOperator, <:ClipScale}`: Operator to be applied after each gradient descent step. (default: `ClipScale()`)
18+
- `operator::AbstractOperator`: Operator to be applied after each gradient descent step. (default: `IdentityOperator()`)
1619
- `subsampling::Union{<:Nothing,<:AbstractSubsampling}`: Data point subsampling strategy. If `nothing`, subsampling is not used. (default: `nothing`)
1720
1821
# Requirements
@@ -28,7 +31,7 @@ function KLMinRepGradDescent(
2831
optimizer::Optimisers.AbstractRule=DoWG(),
2932
n_samples::Int=1,
3033
averager::AbstractAverager=PolynomialAveraging(),
31-
operator::Union{<:IdentityOperator,<:ClipScale}=ClipScale(),
34+
operator::AbstractOperator=IdentityOperator(),
3235
subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
3336
)
3437
objective = if isnothing(subsampling)
@@ -90,6 +93,9 @@ end
9093
9194
KL divergence minimization by running stochastic gradient descent with the score gradient in the Euclidean space of variational parameters.
9295
96+
!!! note
97+
If a `<:MvLocationScale` variational family is used, for `operator`, `IdentityOperator` should be avoided since optimization can result in a singular scale matrix. Instead, consider using [`ClipScale`](@ref).
98+
9399
# Arguments
94100
- `adtype`: Automatic differentiation backend.
95101
@@ -111,7 +117,7 @@ function KLMinScoreGradDescent(
111117
optimizer::Union{<:Descent,<:DoG,<:DoWG}=DoWG(),
112118
n_samples::Int=1,
113119
averager::AbstractAverager=PolynomialAveraging(),
114-
operator::Union{<:IdentityOperator,<:ClipScale}=IdentityOperator(),
120+
operator::AbstractOperator=IdentityOperator(),
115121
subsampling::Union{<:Nothing,<:AbstractSubsampling}=nothing,
116122
)
117123
objective = if isnothing(subsampling)

src/algorithms/paramspacesgd/paramspacesgd.jl

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,12 @@ struct ParamSpaceSGDState{P,Q,GradBuf,OptSt,ObjSt,AvgSt}
6565
end
6666

6767
function init(rng::Random.AbstractRNG, alg::ParamSpaceSGD, q_init, prob)
68-
(; adtype, optimizer, averager, objective) = alg
68+
(; adtype, optimizer, averager, objective, operator) = alg
69+
if q_init isa AdvancedVI.MvLocationScale && operator isa AdvancedVI.IdentityOperator
70+
@warn(
71+
"IdentityOperator is used with a variational family <:MvLocationScale. Optimization can easily fail under this combination due to singular scale matrices. Consider using the operator `ClipScale` in the algorithm instead.",
72+
)
73+
end
6974
params, re = Optimisers.destructure(q_init)
7075
opt_st = Optimisers.setup(optimizer, params)
7176
obj_st = init(rng, objective, adtype, q_init, prob, params, re)

0 commit comments

Comments
 (0)