Skip to content

Reducing Impact of deepcopies in callbacks/batched CPU interface #76

@THargreaves

Description

@THargreaves

I have been profiling some of our unit tests to see if we can reduce the runtime of our CI pipeline (and learn more about the bottlenecks of our implementation.

I was surprised to find that for the CSMC test, a surprising amount of time is taken deepcopying the states over to the callback container.

Image

This currently needs to be done because we make no promises that the state you save in the callback container will not later be modified by predict/update.

This is probably something we should address since this will make smoothing/PG much slower than it needs to be.

There are two options:

  1. We start to provide some guarantees that we won't modify states (or have a safe non in-place, plus an in-place version if we want)
  2. We provide a batched CPU interface

To elaborate on the latter point, the reason why the deepcopies are so slow is because we're copying a load of tiny vectors/matrices. This has two problems:

  1. If the vectors are small (less than 16 Float64s), they will be padded in memory so are not adjacent, making reading slower
  2. Many read operations are needed rather than one.

If instead, these were stored compactly in a matrix, copies are 100–1000x faster

julia> x = [rand(2) for _ in 1:10000];

julia> y = rand(2, 10000);

julia> @benchmark deepcopy($x)
BenchmarkTools.Trial: 2095 samples with 1 evaluation per sample.
 Range (min  max):  1.910 ms    8.215 ms  ┊ GC (min  max): 0.00%  67.91%
 Time  (median):     2.185 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.385 ms ± 807.511 μs  ┊ GC (mean ± σ):  7.41% ± 13.06%

  ▂▇██▇▇▆▅▃▂                                                  ▁
  ██████████▇▆▄▅▅▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▅▆▇▆█▇▇█▇▆█▇ █
  1.91 ms      Histogram: log(frequency) by time      5.99 ms <

 Memory estimate: 3.38 MiB, allocs estimate: 69001.

julia> @benchmark deepcopy($y)
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
 Range (min  max):  6.404 μs  255.764 μs  ┊ GC (min  max):  0.00%  96.48%
 Time  (median):     7.296 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   9.726 μs ±  14.747 μs  ┊ GC (mean ± σ):  21.57% ± 13.47%

  █▅                                                          ▁
  ██▇▄▁▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▄▁▁▁▁▃▁▃▃▁▃▄▃▃▆▆▃▁▁▁▁▁▁▁▁▁▁▁▃▁▃▄▅▅▆▇▇▇▇ █
  6.4 μs       Histogram: log(frequency) by time      96.8 μs <

 Memory estimate: 156.67 KiB, allocs estimate: 6.

julia> x = [rand(32) for _ in 1:1000];

julia> y = rand(32, 1000);

julia> @benchmark deepcopy($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  139.110 μs   10.880 ms  ┊ GC (min  max):  0.00%  97.60%
 Time  (median):     156.897 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   207.138 μs ± 374.546 μs  ┊ GC (mean ± σ):  18.86% ± 10.30%

  █                                                              
  █▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▂
  139 μs           Histogram: frequency by time         2.97 ms <

 Memory estimate: 517.55 KiB, allocs estimate: 5997.

julia> @benchmark deepcopy($y)
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min  max):   9.720 μs  430.233 μs  ┊ GC (min  max):  0.00%  96.79%
 Time  (median):     11.256 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   15.797 μs ±  28.136 μs  ┊ GC (mean ± σ):  21.13% ± 11.88%

  █▃                                                           ▁
  ██▄▃▁▄▃▃▃▅▁▁▁▁▁▄▄▁▁▄▄▁▄▃▁▆▆▇▇▆▄▁▃▃▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▄▅▆▇▇ █
  9.72 μs       Histogram: log(frequency) by time       188 μs <

 Memory estimate: 250.42 KiB, allocs estimate: 6.

This is similar to the GPU interface so it will be not be much work to create similar wrappers to describe how to translate a regular KF into this batched form (I can provide more details at a later point).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions