Skip to content

Commit d12bd01

Browse files
Update the default parameter value for DoG and DoWG. (#199)
* update default of optimizers make them more invariant to dimension * update history * fix make notation more consistent in docs for parameter-free rules * run formatter Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent c5c2090 commit d12bd01

File tree

2 files changed

+25
-13
lines changed

2 files changed

+25
-13
lines changed

HISTORY.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Release 0.5
22

3+
## Default Configuration Changes
4+
5+
The default parameters for the parameter-free optimizers `DoG` and `DoWG` has been changed.
6+
Now, the choice of parameter should be more invariant to dimension such that convergence will become faster than before on high dimensional problems.
7+
38
## Interface Changes
49

510
An additional layer of indirection, `AbstractAlgorithms` has been added.

src/optimization/rules.jl

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,24 @@
11

22
"""
3-
DoWG(repsilon)
3+
DoWG(alpha)
44
55
Distance over weighted gradient (DoWG[^KMJ2024]) optimizer.
6-
It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.
6+
Its only parameter is the guess for the distance between the optimum and the initialization `alpha`, which shouldn't need much tuning.
7+
8+
DoWG is a minor modification to DoG so that the step sizes are always provably larger than DoG.
9+
Similarly to DoG, it works by starting from a AdaGrad-like update rule with a small step size, but then automatically increases the step size ("warming up") to be as large as possible.
10+
If `alpha` is too large, the optimzier can initially diverge, while if it is too small, the warm up period can be too long.
11+
Depending on the problem, DoWG can be too aggressive and result in unstable behavior.
12+
If this is suspected, try using DoG instead.
713
814
# Parameters
9-
- `repsilon`: Initial guess of the Euclidean distance between the initial point and
10-
the optimum. (default value: `1e-6`)
15+
- `alpha`: Scaling factor for initial guess (`repsilon` in the original paper) of the Euclidean distance between the initial point and the optimum. For the initial parameter `lambda0`, `repsilon` is calculated as `repsilon = alpha*(1 + norm(lambda0))`. (default value: `1e-6`)
1116
"""
1217
Optimisers.@def struct DoWG <: Optimisers.AbstractRule
13-
repsilon = 1e-6
18+
alpha = 1e-6
1419
end
1520

16-
Optimisers.init(o::DoWG, x::AbstractArray{T}) where {T} = (copy(x), zero(T), T(o.repsilon))
21+
Optimisers.init(o::DoWG, x::AbstractArray{T}) where {T} = (copy(x), zero(T), T(o.alpha)*(1 + norm(x)))
1722

1823
function Optimisers.apply!(::DoWG, state, x::AbstractArray{T}, dx) where {T}
1924
x0, v, r = state
@@ -27,20 +32,22 @@ function Optimisers.apply!(::DoWG, state, x::AbstractArray{T}, dx) where {T}
2732
end
2833

2934
"""
30-
DoG(repsilon)
35+
DoG(alpha)
3136
3237
Distance over gradient (DoG[^IHC2023]) optimizer.
33-
It's only parameter is the initial guess of the Euclidean distance to the optimum repsilon.
34-
The original paper recommends \$ 10^{-4} ( 1 + \\lVert \\lambda_0 \\rVert ) \$, but the default value is \$ 10^{-6} \$.
38+
Its only parameter is the guess for the distance between the optimum and the initialization `alpha`, which shouldn't need much tuning.
39+
40+
DoG works by starting from a AdaGrad-like update rule with a small step size, but then automatically increases the step size ("warming up") to be as large as possible.
41+
If `alpha` is too large, the optimzier can initially diverge, while if it is too small, the warm up period can be too long.
3542
3643
# Parameters
37-
- `repsilon`: Initial guess of the Euclidean distance between the initial point and the optimum. (default value: `1e-6`)
44+
- `alpha`: Scaling factor for initial guess (`repsilon` in the original paper) of the Euclidean distance between the initial point and the optimum. For the initial parameter `lambda0`, `repsilon` is calculated as `repsilon = alpha*(1 + norm(lambda0))`. (default value: `1e-6`)
3845
"""
3946
Optimisers.@def struct DoG <: Optimisers.AbstractRule
40-
repsilon = 1e-6
47+
alpha = 1e-6
4148
end
4249

43-
Optimisers.init(o::DoG, x::AbstractArray{T}) where {T} = (copy(x), zero(T), T(o.repsilon))
50+
Optimisers.init(o::DoG, x::AbstractArray{T}) where {T} = (copy(x), zero(T), T(o.alpha)*(1 + norm(x)))
4451

4552
function Optimisers.apply!(::DoG, state, x::AbstractArray{T}, dx) where {T}
4653
x0, v, r = state
@@ -57,7 +64,7 @@ end
5764
5865
Continuous Coin Betting (COCOB[^OT2017]) optimizer.
5966
We use the "COCOB-Backprop" variant, which is closer to the Adam optimizer.
60-
Its only parameter is the maximum change per parameter α, which shouldn't need much tuning.
67+
Its only parameter is the maximum change per parameter `alpha`, which shouldn't need much tuning.
6168
6269
# Parameters
6370
- `alpha`: Scaling parameter. (default value: `100`)

0 commit comments

Comments
 (0)