Conversation
cbourjau
left a comment
There was a problem hiding this comment.
Thanks for boiling this PR down a bit. I'd suggest boiling it down still a bit further, though, to sort out the following questions without getting lost in the weeds:
- How do we cleanly install the Rust part without resorting to copying
.sofiles around - Can we efficiently leverage ndarray here?
- How do we set up the benchmarks in a simple and efficient manner (the current "backend"/"dispatch" made it very hard for me to figure out what is going on)
- Can we identify some common operations (e.g., some iterator version of
numpy.take) which we could reuse to keep the code DRY?
| @@ -0,0 +1,403 @@ | |||
| """Backend selection for categorical and sparse matrix operations. | |||
There was a problem hiding this comment.
Aren't the modules ducktype-compatible? This seems like quite some overkill on first sight, but I'm not sure if I'm missing context. The same comment applies to the dispatch files.
| let mut result = vec![0.0; out_size]; | ||
| let offset = if drop_first { 1 } else { 0 }; | ||
|
|
||
| for (i, &idx) in indices.iter().enumerate() { | ||
| let col_idx = idx - offset; | ||
| if col_idx >= 0 { | ||
| result[col_idx as usize] += other[i]; | ||
| } | ||
| } |
There was a problem hiding this comment.
Isn't this identical to the implementation below? More abstractly, this is analogous to numpy.take with an additional drop_first, isn't it?
| indices: &[i32], | ||
| other: &[f64], |
There was a problem hiding this comment.
Do you need this function to be generic over the dtype of indices and other?
| /// | ||
| /// For categorical matrices, the result is always diagonal. | ||
| /// Result[j] = sum of d[k] for all rows k in `rows` where indices[k] == j. | ||
| pub fn sandwich_diagonal( |
There was a problem hiding this comment.
Just for my understanding, does this boil down to something akin to (not tested, using NumPy's ufunc):
out = np.zeros_like(d)
np.add.at(
out,
np.take(indices, rows) - offset,
np.take(d, rows)
)
(I'm trying to understand if we can find some more abstract functions which we can nicely reuse)
| fn get_element(data: &[f64], row: usize, col: usize, n_rows: usize, n_cols: usize, is_c_contiguous: bool) -> f64 { | ||
| if is_c_contiguous { | ||
| data[row * n_cols + col] | ||
| } else { | ||
| data[col * n_rows + row] | ||
| } | ||
| } |
There was a problem hiding this comment.
This would be abstracted away if we were to use ndarray.
| Note: The Rust backend has simpler function signatures without row/column | ||
| restrictions. When restrictions are present, we fall back to the C++ backend. |
There was a problem hiding this comment.
Which functions have different signatures?
| ] | ||
|
|
||
| # Re-export categorical functions | ||
| transpose_matvec = tabmat_rust_ext.transpose_matvec |
There was a problem hiding this comment.
Why the assignment? Doesn't the following work?
from .tabmat_rust_ext import matvec
| maturin build --release -m rust_ext/Cargo.toml \ | ||
| && pip install --force-reinstall rust_ext/target/wheels/*.whl \ | ||
| && cp $(python -c "import tabmat_rust_ext; print(tabmat_rust_ext.__file__.replace('__init__.py', 'tabmat_rust_ext.abi3.so'))") src/tabmat/tabmat_rust_ext/ | ||
| """ |
There was a problem hiding this comment.
This looks a bit odd. Why is there a need to install the wheel and to copy the .so file?Doesn't maturing develop Just Work for now?
| # Strip symbols for smaller binary | ||
| strip = true |
There was a problem hiding this comment.
| # Strip symbols for smaller binary | |
| strip = true |
This is likely not what we want while developing this.
| [package] | ||
| name = "tabmat_rust_ext" | ||
| version = "0.1.0" | ||
| edition = "2021" |
There was a problem hiding this comment.
No need to start outdated.
| edition = "2021" | |
| edition = "2024" |
Retrying the Rust implementation with a more step-by-step approach, still with some help from Claude. I will keep this PR in a draft state for a while.
Context: the goal of this PR is to transfer the C++/cython functions (hard to maintain) to a Rust backend. The build system for the C++ backend has a lot of finicky dependencies (e.g. jemalloc, xsimd) and is hard to maintain. While we do not want to reduce the performance of tabmat, I think moving to rust will have many long-term upside in terms of the maintainability of the library.
The plan is:
Checklist
CHANGELOG.rstentry