Skip to content

Commit 6f2e564

Browse files
Remove the type ParamSpaceSGD (#205)
* remove the type `ParamSpaceSGD` * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix rename file paramspacesgd.jl to interface.jl * throw invalid state for unknown paramspacesgd type * add docstring for union type of paramspacesgd algorithms * fix remove custom state types for paramspacesgd algorithms * fix remove custom state types for paramspacesgd * fix file path * fix bug in BijectorsExt * fix include `SubSampleObjective` as part of `ParamSpaceSGD` * fix formatting Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix revert adding SubsampledObjective into ParamSpaceSGD * refactor flatten algorithms * fix error update paths in main file * refactor flatten the tests to reflect new structure * fix file include path in tests * fix missing operator in subsampledobj tests * fix formatting Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * update docs --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 3f1bc2f commit 6f2e564

33 files changed

+427
-422
lines changed

docs/make.jl

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,9 @@ makedocs(;
2323
"Normalizing Flows" => "tutorials/flows.md",
2424
],
2525
"Algorithms" => [
26-
"KLMinRepGradDescent" => "paramspacesgd/klminrepgraddescent.md",
27-
"KLMinRepGradProxDescent" => "paramspacesgd/klminrepgradproxdescent.md",
28-
"KLMinScoreGradDescent" => "paramspacesgd/klminscoregraddescent.md",
29-
"Parameter Space SGD" => [
30-
"General" => "paramspacesgd/general.md",
31-
"Objectives" => [
32-
"Overview" => "paramspacesgd/objectives.md",
33-
"RepGradELBO" => "paramspacesgd/repgradelbo.md",
34-
"ScoreGradELBO" => "paramspacesgd/scoregradelbo.md",
35-
],
36-
],
26+
"KLMinRepGradDescent" => "klminrepgraddescent.md",
27+
"KLMinRepGradProxDescent" => "klminrepgradproxdescent.md",
28+
"KLMinScoreGradDescent" => "klminscoregraddescent.md",
3729
],
3830
"Variational Families" => "families.md",
3931
"Optimization" => "optimization.md",

docs/src/families.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# [Reparameterizable Variational Families](@id families)
22

3-
The [RepGradELBO](@ref repgradelbo) objective assumes that the members of the variational family have a differentiable sampling path.
3+
Algorithms such as [`KLMinRepGradELBO`](@ref klminrepgraddescent) assume that the members of the variational family have a differentiable sampling path.
44
We provide multiple pre-packaged variational families that can be readily used.
55

66
## [The `LocationScale` Family](@id locscale)

docs/src/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ VI algorithms perform scalable and computationally efficient Bayesian inference
1010

1111
# List of Algorithms
1212

13-
- [ParamSpaceSGD](@ref paramspacesgd)
1413
- [KLMinRepGradDescent](@ref klminrepgraddescent) (alias of `ADVI`)
1514
- [KLMinRepGradProxDescent](@ref klminrepgradproxdescent)
1615
- [KLMinScoreGradDescent](@ref klminscoregraddescent) (alias of `BBVI`)

docs/src/paramspacesgd/repgradelbo.md renamed to docs/src/klminrepgraddescent.md

Lines changed: 62 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,75 @@
1-
# [Reparameterization Gradient Estimator](@id repgradelbo)
1+
# [`KLMinRepGradDescent`](@id klminrepgraddescent)
22

3-
## Overview
3+
## Description
44

5-
The `RepGradELBO` objective implements the reparameterization gradient estimator[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient.
6-
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
7-
For the variational family $\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda\}$, suppose the process of sampling from $q_{\lambda}$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution* $$\varphi$$ independent of $$\lambda$$ such that
5+
This algorithm aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
6+
Specifically, it uses the the *reparameterization gradient estimator*.
7+
As a result, this algorithm is best applicable when the target log-density is differentiable and the sampling process of the variational family is differentiable.
8+
(See the [methodology section](@ref klminrepgraddescent_method) for more details.)
9+
This algorithm is also commonly referred to as automatic differentiation variational inference, black-box variational inference with the reparameterization gradient, and stochastic gradient variational inference.
10+
`KLMinRepGradDescent` is also an alias of `ADVI` .
11+
12+
```@docs
13+
KLMinRepGradDescent
14+
```
15+
16+
## [Methodology](@id klminrepgraddescent_method)
17+
18+
This algorithm aims to solve the problem
819

9-
[^HC1983]: Ho, Y. C., & Cao, X. (1983). Perturbation analysis and optimization of queueing networks. Journal of optimization theory and Applications, 40(4), 559-582.
10-
[^G1991]: Glasserman, P. (1991). Gradient estimation via perturbation analysis (Vol. 116). Springer Science & Business Media.
11-
[^R1992]: Rubinstein, R. Y. (1992). Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1), 229-250.
12-
[^P1996]: Pflug, G. C. (1996). Optimization of stochastic models: the interface between simulation and optimization (Vol. 373). Springer Science & Business Media.
13-
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
14-
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
15-
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
1620
```math
17-
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
18-
z \stackrel{d}{=} T_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi \; .
21+
\mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right)
1922
```
2023

21-
In these cases, denoting the target log denstiy as $\log \pi$, we can effectively estimate the gradient of the ELBO by directly differentiating the stochastic estimate of the ELBO objective
24+
where $\mathcal{Q}$ is some family of distributions, often called the variational family, by running stochastic gradient descent in the (Euclidean) space of parameters.
25+
That is, for all $$q_{\lambda} \in \mathcal{Q}$$, we assume $$q_{\lambda}$$ there is a corresponding vector of parameters $$\lambda \in \Lambda$$, where the space of parameters is Euclidean such that $$\Lambda \subset \mathbb{R}^p$$.
26+
27+
Since we usually only have access to the unnormalized densities of the target distribution $\pi$, we don't have direct access to the KL divergence.
28+
Instead, the ELBO maximization strategy maximizes a surrogate objective, the *evidence lower bound* (ELBO; [^JGJS1999])
2229

2330
```math
24-
\widehat{\mathrm{ELBO}}\left(\lambda\right) = \frac{1}{M}\sum^M_{m=1} \log \pi\left(T_{\lambda}\left(\epsilon_m\right)\right) + \mathbb{H}\left(q_{\lambda}\right),
31+
\mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),
2532
```
2633

27-
where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
28-
The resulting gradient estimate is called the reparameterization gradient estimator.
34+
which is equivalent to the KL up to an additive constant (the evidence).
2935

30-
In addition to the reparameterization gradient, `AdvancedVI` provides the following features:
36+
Algorithmically, `KLMinRepGradDescent` iterates the step
3137

32-
1. **Posteriors with constrained supports** are handled through [`Bijectors`](https://github.com/TuringLang/Bijectors.jl), which is known as the automatic differentiation VI (ADVI; [^KTRGB2017]) formulation. (See [this section](@ref bijectors).)
33-
2. **The gradient of the entropy** can be estimated through various strategies depending on the capabilities of the variational family. (See [this section](@ref entropygrad).)
38+
```math
39+
\lambda_{t+1} = \mathrm{operator}\big(
40+
\lambda_{t} + \gamma_t \widehat{\nabla_{\lambda} \mathrm{ELBO}} (q_{\lambda_t})
41+
\big) ,
42+
```
3443

35-
## `RepGradELBO`
44+
where $\widehat{\nabla \mathrm{ELBO}}(q_{\lambda})$ is the reparameterization gradient estimate[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient and $$\mathrm{operator}$$ is an optional operator (*e.g.* projections, identity mapping).
3645

37-
To use the reparameterization gradient, `AdvancedVI` provides the following variational objective:
46+
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
47+
For the variational family $$\mathcal{Q}$$, suppose the process of sampling from $$q_{\lambda} \in \mathcal{Q}$$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution* $$\varphi$$ independent of $$\lambda$$ such that
3848

39-
```@docs
40-
RepGradELBO
49+
```math
50+
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
51+
z \stackrel{d}{=} T_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi \; .
4152
```
4253

43-
## [Handling Constraints with `Bijectors`](@id bijectors)
54+
In these cases, denoting the target log denstiy as $\log \pi$, we can effectively estimate the gradient of the ELBO by directly differentiating the stochastic estimate of the ELBO objective
4455

45-
As mentioned in the docstring, the `RepGradELBO` objective assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
56+
```math
57+
\widehat{\mathrm{ELBO}}\left(q_{\lambda}\right) = \frac{1}{M}\sum^M_{m=1} \log \pi\left(T_{\lambda}\left(\epsilon_m\right)\right) + \mathbb{H}\left(q_{\lambda}\right),
58+
```
59+
60+
where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
61+
62+
[^JGJS1999]: Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
63+
[^HC1983]: Ho, Y. C., & Cao, X. (1983). Perturbation analysis and optimization of queueing networks. Journal of optimization theory and Applications, 40(4), 559-582.
64+
[^G1991]: Glasserman, P. (1991). Gradient estimation via perturbation analysis (Vol. 116). Springer Science & Business Media.
65+
[^R1992]: Rubinstein, R. Y. (1992). Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1), 229-250.
66+
[^P1996]: Pflug, G. C. (1996). Optimization of stochastic models: the interface between simulation and optimization (Vol. 373). Springer Science & Business Media.
67+
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
68+
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
69+
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
70+
## [Handling Constraints with `Bijectors`](@id bijectors)
4671

72+
As mentioned in the docstring, `KLMinRepGradDescent` assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
4773
However, in general, it is most convenient to use variational families that have the whole Euclidean space $$\mathbb{R}^d$$ as their support.
4874
This is the case for the [location-scale distributions](@ref locscale) provided by `AdvancedVI`.
4975
For target distributions which the support is not the full $$\mathbb{R}^d$$, we can apply some transformation $$b$$ to $$q_{\lambda}$$ to match its support such that
@@ -57,9 +83,11 @@ where $$b$$ is often called a *bijector*, since it is often chosen among bijecti
5783
This idea is known as automatic differentiation VI[^KTRGB2017] and has subsequently been improved by Tensorflow Probability[^DLTBV2017].
5884
In Julia, [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl)[^FXTYG2020] provides a comprehensive collection of bijections.
5985

60-
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
61-
That is, the objective is now
62-
86+
[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*, 18(14), 1-45.
87+
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
88+
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
89+
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
90+
That is, the objective is now
6391
```math
6492
\mathrm{ADVI}\left(\lambda\right)
6593
\triangleq
@@ -84,13 +112,10 @@ q_transformed = Bijectors.TransformedDistribution(q, binv)
84112
By passing `q_transformed` to `optimize`, the Jacobian adjustment for the bijector `b` is automatically applied.
85113
(See the [Basic Example](@ref basic) for a fully working example.)
86114

87-
[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*.
88-
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
89-
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
90-
## [Entropy Estimators](@id entropygrad)
115+
## [Entropy Gradient Estimators](@id entropygrad)
91116

92117
For the gradient of the entropy term, we provide three choices with varying requirements.
93-
The user can select the entropy estimator by passing it as a keyword argument when constructing the `RepGradELBO` objective.
118+
The user can select the entropy estimator by passing it as a keyword argument when constructing the algorithm object.
94119

95120
| Estimator | `entropy(q)` | `logpdf(q)` | Type |
96121
|:--------------------------- |:------------:|:-----------:|:-------------------------------- |
@@ -179,7 +204,7 @@ end
179204

180205
In this example, the true posterior is contained within the variational family.
181206
This setting is known as "perfect variational family specification."
182-
In this case, the `RepGradELBO` estimator with `StickingTheLandingEntropy` is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
207+
In this case, `KLMinRepGradDescent` with `StickingTheLandingEntropy` is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
183208

184209
Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:
185210

@@ -281,7 +306,7 @@ Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.
281306
[^KMG2024]: Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
282307
## Advanced Usage
283308

284-
There are two major ways to customize the behavior of `RepGradELBO`
309+
There are two major ways to customize the behavior of `KLMinRepGradDescent`
285310

286311
- Customize the `Distributions` functions: `rand(q)`, `entropy(q)`, `logpdf(q)`.
287312
- Customize `AdvancedVI.reparam_with_entropy`.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# [`KLMinRepGradProxDescent`](@id klminrepgradproxdescent)
2+
3+
## Description
4+
5+
This algorithm is a slight variation of [`KLMinRepGradDescent`](@ref klminrepgraddescent) specialized to [location-scale families](@ref locscale).
6+
Therefore, it also aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence over the space of parameters.
7+
But instead, it uses stochastic proximal gradient descent with the [proximal operator](@ref proximalocationscaleentropy) of the entropy of location-scale variational families as discussed in: [^D2020][^KMG2024][^DGG2023].
8+
The remainder of the section will only discuss details specific to `KLMinRepGradProxDescent`.
9+
Thus, for general usage and additional details, please refer to the docs of `KLMinRepGradDescent` instead.
10+
11+
```@docs
12+
KLMinRepGradProxDescent
13+
```
14+
15+
It implements the stochastic proximal gradient descent-based algorithm described in: .
16+
17+
## Methodology
18+
19+
Recall that [KLMinRepGradDescent](@ref klminrepgraddescent) maximizes the ELBO.
20+
Now, the ELBO can be re-written as follows:
21+
22+
```math
23+
\mathrm{ELBO}\left(q\right) \triangleq \mathcal{E}\left(q\right) + \mathbb{H}\left(q\right),
24+
```
25+
26+
where
27+
28+
```math
29+
\mathcal{E}\left(q\right) = \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right)
30+
```
31+
32+
is often referred to as the *negative energy functional*.
33+
`KLMinRepGradProxDescent` attempts to address the fact that minimizing the whole ELBO can be unstable due to non-smoothness of $$\mathbb{H}\left(q\right)$$[^D2020].
34+
For this, `KLMinRepGradProxDescent` relies on proximal stochastic gradient descent, where the problematic term $$\mathbb{H}\left(q\right)$$ is separately handled via a *proximal operator*.
35+
Specifically, `KLMinRepGradProxDescent` first estimates the gradient of the energy $$\mathcal{E}\left(q\right)$$ only via the reparameterization gradient estimator.
36+
Let us denote this as $$\widehat{\nabla_{\lambda} \mathcal{E}}\left(q_{\lambda}\right)$$.
37+
Then `KLMinRepGradProxDescent` iterates the step
38+
39+
```math
40+
\lambda_{t+1} = \mathrm{prox}_{-\gamma_t \mathbb{H}}\big(
41+
\lambda_{t} + \gamma_t \widehat{\nabla_{\lambda} \mathcal{E}}(q_{\lambda_t})
42+
\big) ,
43+
```
44+
45+
where
46+
47+
```math
48+
\mathrm{prox}_{h}(\lambda_t)
49+
= \argmin_{\lambda \in \Lambda}\left\{
50+
h(\lambda) + {\lVert \lambda - \lambda_t \rVert}_2^2
51+
\right\}
52+
```
53+
54+
is a proximal operator for the entropy.
55+
As long as $$\mathrm{prox}_{-\gamma_t \mathbb{H}}$$ can be evaluated efficiently, this scheme can side-step the fact that $$\mathbb{H}(\lambda)$$ is difficult to deal with via gradient descent.
56+
For location-scale families, it turns out the proximal operator of the entropy can be operated efficiently[^D2020], which is implemented as [`ProximalLocationScaleEntropy`](@ref proximalocationscaleentropy).
57+
This has been empirically shown to be more robust[^D2020][^KMG2024].
58+
59+
[^D2020]: Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In *International Conference on Machine Learning*.
60+
[^KMG2024]: Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
61+
[^DGG2023]: Domke, J., Gower, R., & Garrigos, G. (2023). Provable convergence guarantees for black-box variational inference. Advances in neural information processing systems, 36, 66289-66327.

0 commit comments

Comments
 (0)