-
Notifications
You must be signed in to change notification settings - Fork 3
Description
I have been profiling some of our unit tests to see if we can reduce the runtime of our CI pipeline (and learn more about the bottlenecks of our implementation.
I was surprised to find that for the CSMC test, a surprising amount of time is taken deepcopying the states over to the callback container.
This currently needs to be done because we make no promises that the state you save in the callback container will not later be modified by predict
/update
.
This is probably something we should address since this will make smoothing/PG much slower than it needs to be.
There are two options:
- We start to provide some guarantees that we won't modify states (or have a safe non in-place, plus an in-place version if we want)
- We provide a batched CPU interface
To elaborate on the latter point, the reason why the deepcopies are so slow is because we're copying a load of tiny vectors/matrices. This has two problems:
- If the vectors are small (less than 16 Float64s), they will be padded in memory so are not adjacent, making reading slower
- Many read operations are needed rather than one.
If instead, these were stored compactly in a matrix, copies are 100–1000x faster
julia> x = [rand(2) for _ in 1:10000];
julia> y = rand(2, 10000);
julia> @benchmark deepcopy($x)
BenchmarkTools.Trial: 2095 samples with 1 evaluation per sample.
Range (min … max): 1.910 ms … 8.215 ms ┊ GC (min … max): 0.00% … 67.91%
Time (median): 2.185 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.385 ms ± 807.511 μs ┊ GC (mean ± σ): 7.41% ± 13.06%
▂▇██▇▇▆▅▃▂ ▁
██████████▇▆▄▅▅▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▅▆▇▆█▇▇█▇▆█▇ █
1.91 ms Histogram: log(frequency) by time 5.99 ms <
Memory estimate: 3.38 MiB, allocs estimate: 69001.
julia> @benchmark deepcopy($y)
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
Range (min … max): 6.404 μs … 255.764 μs ┊ GC (min … max): 0.00% … 96.48%
Time (median): 7.296 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.726 μs ± 14.747 μs ┊ GC (mean ± σ): 21.57% ± 13.47%
█▅ ▁
██▇▄▁▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▄▁▁▁▁▃▁▃▃▁▃▄▃▃▆▆▃▁▁▁▁▁▁▁▁▁▁▁▃▁▃▄▅▅▆▇▇▇▇ █
6.4 μs Histogram: log(frequency) by time 96.8 μs <
Memory estimate: 156.67 KiB, allocs estimate: 6.
julia> x = [rand(32) for _ in 1:1000];
julia> y = rand(32, 1000);
julia> @benchmark deepcopy($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 139.110 μs … 10.880 ms ┊ GC (min … max): 0.00% … 97.60%
Time (median): 156.897 μs ┊ GC (median): 0.00%
Time (mean ± σ): 207.138 μs ± 374.546 μs ┊ GC (mean ± σ): 18.86% ± 10.30%
█
█▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▂
139 μs Histogram: frequency by time 2.97 ms <
Memory estimate: 517.55 KiB, allocs estimate: 5997.
julia> @benchmark deepcopy($y)
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
Range (min … max): 9.720 μs … 430.233 μs ┊ GC (min … max): 0.00% … 96.79%
Time (median): 11.256 μs ┊ GC (median): 0.00%
Time (mean ± σ): 15.797 μs ± 28.136 μs ┊ GC (mean ± σ): 21.13% ± 11.88%
█▃ ▁
██▄▃▁▄▃▃▃▅▁▁▁▁▁▄▄▁▁▄▄▁▄▃▁▆▆▇▇▆▄▁▃▃▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▄▅▆▇▇ █
9.72 μs Histogram: log(frequency) by time 188 μs <
Memory estimate: 250.42 KiB, allocs estimate: 6.
This is similar to the GPU interface so it will be not be much work to create similar wrappers to describe how to translate a regular KF into this batched form (I can provide more details at a later point).