Skip to content

Conversation

@non-Jedi
Copy link

@non-Jedi non-Jedi commented Aug 28, 2020

I saw some of the discussion in
JuliaData/DataFrames.jl#2340 and got
curious about what was possible.

This avoids allocating a Vector for the case where l does not have
multiple indices with the same value. For the smoke-test benchmark in
JuliaData/DataFrames.jl#2340 (comment),
this reduces allocations by half and overall runtime by 10%.

Most of the allocations still come from this
line

which it's much less clear how to reduce allocations in. I'm not sure
how much JuliaLang/julia#24909 affects
performance in this case. One option would be to heuristically
estimate the size of out based on the size of l and r and call
sizehint! on it; this didn't seem to help in my testing.

I realize I'm only optimizing a single method of innerjoin, but I'm
not super familiar with this field nor with the inner workings of this
package, so I leave it to you to decide if this is a worthwhile
complication and if it's relevant elsewhere in the package.

This avoids allocating a Vector for the case where l does not have
multiple indices with the same value. For the smoke-test benchmark in
<JuliaData/DataFrames.jl#2340 (comment)>,
this reduces allocations by half and overall runtime by 10%.
@andyferris
Copy link
Member

Ah thanks.

Another strategy I tried a while back was to first assume the join keys are distinct and then bail to a more general implementation when that's not the case. It would be awesome to compare the performance vesus this.

(Sorry @non-Jedi I haven't been keeping track of PRs...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants