Reducing Impact of deepcopies in callbacks/batched CPU interface

I have been profiling some of our unit tests to see if we can reduce the runtime of our CI pipeline (and learn more about the bottlenecks of our implementation.

I was surprised to find that for the CSMC test, a surprising amount of time is taken deepcopying the states over to the callback container.

![Image](https://github.com/user-attachments/assets/b88b744c-ef24-43ac-9e3d-d4402bcf0f0c)

This currently needs to be done because we make no promises that the state you save in the callback container will not later be modified by `predict`/`update`.

This is probably something we should address since this will make smoothing/PG much slower than it needs to be. 

There are two options:

1. We start to provide some guarantees that we won't modify states (or have a safe non in-place, plus an in-place version if we want)
2. We provide a batched CPU interface

To elaborate on the latter point, the reason why the deepcopies are so slow is because we're copying a load of tiny vectors/matrices. This has two problems:

1. If the vectors are small (less than 16 Float64s), they will be padded in memory so are not adjacent, making reading slower
2. Many read operations are needed rather than one.

If instead, these were stored compactly in a matrix, copies are 100–1000x faster 

```julia
julia> x = [rand(2) for _ in 1:10000];

julia> y = rand(2, 10000);

julia> @benchmark deepcopy($x)
BenchmarkTools.Trial: 2095 samples with 1 evaluation per sample.
 Range (min … max):  1.910 ms …   8.215 ms  ┊ GC (min … max): 0.00% … 67.91%
 Time  (median):     2.185 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.385 ms ± 807.511 μs  ┊ GC (mean ± σ):  7.41% ± 13.06%

  ▂▇██▇▇▆▅▃▂                                                  ▁
  ██████████▇▆▄▅▅▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▅▆▇▆█▇▇█▇▆█▇ █
  1.91 ms      Histogram: log(frequency) by time      5.99 ms <

 Memory estimate: 3.38 MiB, allocs estimate: 69001.

julia> @benchmark deepcopy($y)
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
 Range (min … max):  6.404 μs … 255.764 μs  ┊ GC (min … max):  0.00% … 96.48%
 Time  (median):     7.296 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   9.726 μs ±  14.747 μs  ┊ GC (mean ± σ):  21.57% ± 13.47%

  █▅                                                          ▁
  ██▇▄▁▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▄▁▁▁▁▃▁▃▃▁▃▄▃▃▆▆▃▁▁▁▁▁▁▁▁▁▁▁▃▁▃▄▅▅▆▇▇▇▇ █
  6.4 μs       Histogram: log(frequency) by time      96.8 μs <

 Memory estimate: 156.67 KiB, allocs estimate: 6.

julia> x = [rand(32) for _ in 1:1000];

julia> y = rand(32, 1000);

julia> @benchmark deepcopy($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  139.110 μs …  10.880 ms  ┊ GC (min … max):  0.00% … 97.60%
 Time  (median):     156.897 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   207.138 μs ± 374.546 μs  ┊ GC (mean ± σ):  18.86% ± 10.30%

  █                                                              
  █▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▂
  139 μs           Histogram: frequency by time         2.97 ms <

 Memory estimate: 517.55 KiB, allocs estimate: 5997.

julia> @benchmark deepcopy($y)
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):   9.720 μs … 430.233 μs  ┊ GC (min … max):  0.00% … 96.79%
 Time  (median):     11.256 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   15.797 μs ±  28.136 μs  ┊ GC (mean ± σ):  21.13% ± 11.88%

  █▃                                                           ▁
  ██▄▃▁▄▃▃▃▅▁▁▁▁▁▄▄▁▁▄▄▁▄▃▁▆▆▇▇▆▄▁▃▃▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▄▅▆▇▇ █
  9.72 μs       Histogram: log(frequency) by time       188 μs <

 Memory estimate: 250.42 KiB, allocs estimate: 6.
```

This is similar to the GPU interface so it will be not be much work to create similar wrappers to describe how to translate a regular KF into this batched form (I can provide more details at a later point).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reducing Impact of deepcopies in callbacks/batched CPU interface #76

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reducing Impact of deepcopies in callbacks/batched CPU interface #76

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions