You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* remove the type `ParamSpaceSGD`
* run formatter
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* run formatter
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* run formatter
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix rename file paramspacesgd.jl to interface.jl
* throw invalid state for unknown paramspacesgd type
* add docstring for union type of paramspacesgd algorithms
* fix remove custom state types for paramspacesgd algorithms
* fix remove custom state types for paramspacesgd
* fix file path
* fix bug in BijectorsExt
* fix include `SubSampleObjective` as part of `ParamSpaceSGD`
* fix formatting
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix revert adding SubsampledObjective into ParamSpaceSGD
* refactor flatten algorithms
* fix error update paths in main file
* refactor flatten the tests to reflect new structure
* fix file include path in tests
* fix missing operator in subsampledobj tests
* fix formatting
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* update docs
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
The [RepGradELBO](@refrepgradelbo) objective assumes that the members of the variational family have a differentiable sampling path.
3
+
Algorithms such as [`KLMinRepGradELBO`](@refklminrepgraddescent) assume that the members of the variational family have a differentiable sampling path.
4
4
We provide multiple pre-packaged variational families that can be readily used.
The `RepGradELBO` objective implements the reparameterization gradient estimator[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient.
6
-
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
7
-
For the variational family $\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda\}$, suppose the process of sampling from $q_{\lambda}$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution*$$\varphi$$ independent of $$\lambda$$ such that
5
+
This algorithm aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
6
+
Specifically, it uses the the *reparameterization gradient estimator*.
7
+
As a result, this algorithm is best applicable when the target log-density is differentiable and the sampling process of the variational family is differentiable.
8
+
(See the [methodology section](@ref klminrepgraddescent_method) for more details.)
9
+
This algorithm is also commonly referred to as automatic differentiation variational inference, black-box variational inference with the reparameterization gradient, and stochastic gradient variational inference.
10
+
`KLMinRepGradDescent` is also an alias of `ADVI` .
11
+
12
+
```@docs
13
+
KLMinRepGradDescent
14
+
```
15
+
16
+
## [Methodology](@id klminrepgraddescent_method)
17
+
18
+
This algorithm aims to solve the problem
8
19
9
-
[^HC1983]: Ho, Y. C., & Cao, X. (1983). Perturbation analysis and optimization of queueing networks. Journal of optimization theory and Applications, 40(4), 559-582.
10
-
[^G1991]: Glasserman, P. (1991). Gradient estimation via perturbation analysis (Vol. 116). Springer Science & Business Media.
11
-
[^R1992]: Rubinstein, R. Y. (1992). Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1), 229-250.
12
-
[^P1996]: Pflug, G. C. (1996). Optimization of stochastic models: the interface between simulation and optimization (Vol. 373). Springer Science & Business Media.
13
-
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
14
-
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
15
-
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
16
20
```math
17
-
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
18
-
z \stackrel{d}{=} T_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi \; .
In these cases, denoting the target log denstiy as $\log \pi$, we can effectively estimate the gradient of the ELBO by directly differentiating the stochastic estimate of the ELBO objective
24
+
where $\mathcal{Q}$ is some family of distributions, often called the variational family, by running stochastic gradient descent in the (Euclidean) space of parameters.
25
+
That is, for all $$q_{\lambda} \in \mathcal{Q}$$, we assume $$q_{\lambda}$$ there is a corresponding vector of parameters $$\lambda \in \Lambda$$, where the space of parameters is Euclidean such that $$\Lambda \subset \mathbb{R}^p$$.
26
+
27
+
Since we usually only have access to the unnormalized densities of the target distribution $\pi$, we don't have direct access to the KL divergence.
28
+
Instead, the ELBO maximization strategy maximizes a surrogate objective, the *evidence lower bound* (ELBO; [^JGJS1999])
where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
28
-
The resulting gradient estimate is called the reparameterization gradient estimator.
34
+
which is equivalent to the KL up to an additive constant (the evidence).
29
35
30
-
In addition to the reparameterization gradient, `AdvancedVI` provides the following features:
36
+
Algorithmically, `KLMinRepGradDescent` iterates the step
31
37
32
-
1.**Posteriors with constrained supports** are handled through [`Bijectors`](https://github.com/TuringLang/Bijectors.jl), which is known as the automatic differentiation VI (ADVI; [^KTRGB2017]) formulation. (See [this section](@ref bijectors).)
33
-
2.**The gradient of the entropy** can be estimated through various strategies depending on the capabilities of the variational family. (See [this section](@ref entropygrad).)
where $\widehat{\nabla \mathrm{ELBO}}(q_{\lambda})$ is the reparameterization gradient estimate[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient and $$\mathrm{operator}$$ is an optional operator (*e.g.* projections, identity mapping).
36
45
37
-
To use the reparameterization gradient, `AdvancedVI` provides the following variational objective:
46
+
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
47
+
For the variational family $$\mathcal{Q}$$, suppose the process of sampling from $$q_{\lambda} \in \mathcal{Q}$$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution*$$\varphi$$ independent of $$\lambda$$ such that
38
48
39
-
```@docs
40
-
RepGradELBO
49
+
```math
50
+
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
51
+
z \stackrel{d}{=} T_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi \; .
41
52
```
42
53
43
-
## [Handling Constraints with `Bijectors`](@id bijectors)
54
+
In these cases, denoting the target log denstiy as $\log \pi$, we can effectively estimate the gradient of the ELBO by directly differentiating the stochastic estimate of the ELBO objective
44
55
45
-
As mentioned in the docstring, the `RepGradELBO` objective assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
61
+
62
+
[^JGJS1999]: Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
63
+
[^HC1983]: Ho, Y. C., & Cao, X. (1983). Perturbation analysis and optimization of queueing networks. Journal of optimization theory and Applications, 40(4), 559-582.
64
+
[^G1991]: Glasserman, P. (1991). Gradient estimation via perturbation analysis (Vol. 116). Springer Science & Business Media.
65
+
[^R1992]: Rubinstein, R. Y. (1992). Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1), 229-250.
66
+
[^P1996]: Pflug, G. C. (1996). Optimization of stochastic models: the interface between simulation and optimization (Vol. 373). Springer Science & Business Media.
67
+
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
68
+
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
69
+
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
70
+
## [Handling Constraints with `Bijectors`](@id bijectors)
46
71
72
+
As mentioned in the docstring, `KLMinRepGradDescent` assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
47
73
However, in general, it is most convenient to use variational families that have the whole Euclidean space $$\mathbb{R}^d$$ as their support.
48
74
This is the case for the [location-scale distributions](@ref locscale) provided by `AdvancedVI`.
49
75
For target distributions which the support is not the full $$\mathbb{R}^d$$, we can apply some transformation $$b$$ to $$q_{\lambda}$$ to match its support such that
@@ -57,9 +83,11 @@ where $$b$$ is often called a *bijector*, since it is often chosen among bijecti
57
83
This idea is known as automatic differentiation VI[^KTRGB2017] and has subsequently been improved by Tensorflow Probability[^DLTBV2017].
58
84
In Julia, [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl)[^FXTYG2020] provides a comprehensive collection of bijections.
59
85
60
-
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
61
-
That is, the objective is now
62
-
86
+
[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*, 18(14), 1-45.
87
+
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
88
+
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
89
+
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
By passing `q_transformed` to `optimize`, the Jacobian adjustment for the bijector `b` is automatically applied.
85
113
(See the [Basic Example](@ref basic) for a fully working example.)
86
114
87
-
[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*.
88
-
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
89
-
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
90
-
## [Entropy Estimators](@id entropygrad)
115
+
## [Entropy Gradient Estimators](@id entropygrad)
91
116
92
117
For the gradient of the entropy term, we provide three choices with varying requirements.
93
-
The user can select the entropy estimator by passing it as a keyword argument when constructing the `RepGradELBO` objective.
118
+
The user can select the entropy estimator by passing it as a keyword argument when constructing the algorithm object.
In this example, the true posterior is contained within the variational family.
181
206
This setting is known as "perfect variational family specification."
182
-
In this case, the `RepGradELBO` estimator with `StickingTheLandingEntropy` is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
207
+
In this case, `KLMinRepGradDescent` with `StickingTheLandingEntropy` is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
183
208
184
209
Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:
185
210
@@ -281,7 +306,7 @@ Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.
281
306
[^KMG2024]: Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
282
307
## Advanced Usage
283
308
284
-
There are two major ways to customize the behavior of `RepGradELBO`
309
+
There are two major ways to customize the behavior of `KLMinRepGradDescent`
285
310
286
311
- Customize the `Distributions` functions: `rand(q)`, `entropy(q)`, `logpdf(q)`.
This algorithm is a slight variation of [`KLMinRepGradDescent`](@ref klminrepgraddescent) specialized to [location-scale families](@ref locscale).
6
+
Therefore, it also aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence over the space of parameters.
7
+
But instead, it uses stochastic proximal gradient descent with the [proximal operator](@ref proximalocationscaleentropy) of the entropy of location-scale variational families as discussed in: [^D2020][^KMG2024][^DGG2023].
8
+
The remainder of the section will only discuss details specific to `KLMinRepGradProxDescent`.
9
+
Thus, for general usage and additional details, please refer to the docs of `KLMinRepGradDescent` instead.
10
+
11
+
```@docs
12
+
KLMinRepGradProxDescent
13
+
```
14
+
15
+
It implements the stochastic proximal gradient descent-based algorithm described in: .
16
+
17
+
## Methodology
18
+
19
+
Recall that [KLMinRepGradDescent](@ref klminrepgraddescent) maximizes the ELBO.
is often referred to as the *negative energy functional*.
33
+
`KLMinRepGradProxDescent` attempts to address the fact that minimizing the whole ELBO can be unstable due to non-smoothness of $$\mathbb{H}\left(q\right)$$[^D2020].
34
+
For this, `KLMinRepGradProxDescent` relies on proximal stochastic gradient descent, where the problematic term $$\mathbb{H}\left(q\right)$$ is separately handled via a *proximal operator*.
35
+
Specifically, `KLMinRepGradProxDescent` first estimates the gradient of the energy $$\mathcal{E}\left(q\right)$$ only via the reparameterization gradient estimator.
36
+
Let us denote this as $$\widehat{\nabla_{\lambda} \mathcal{E}}\left(q_{\lambda}\right)$$.
As long as $$\mathrm{prox}_{-\gamma_t \mathbb{H}}$$ can be evaluated efficiently, this scheme can side-step the fact that $$\mathbb{H}(\lambda)$$ is difficult to deal with via gradient descent.
56
+
For location-scale families, it turns out the proximal operator of the entropy can be operated efficiently[^D2020], which is implemented as [`ProximalLocationScaleEntropy`](@ref proximalocationscaleentropy).
57
+
This has been empirically shown to be more robust[^D2020][^KMG2024].
58
+
59
+
[^D2020]: Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In *International Conference on Machine Learning*.
60
+
[^KMG2024]: Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
61
+
[^DGG2023]: Domke, J., Gower, R., & Garrigos, G. (2023). Provable convergence guarantees for black-box variational inference. Advances in neural information processing systems, 36, 66289-66327.
0 commit comments