Skip to content

Conversation

shawjef3
Copy link

No description provided.

@oschwald
Copy link
Member

oschwald commented Jan 17, 2025

Thanks for the pull request! In my testing of it, we take a pretty significant performance hit (~20-30%) with these changes. I am also not entirely sure about pulling in a separate dependency or reimplementing so much of ByteBuffer.

At a high level, my hope for solving #154 was a bit different. I think we could still use a single ByteBuffer for databases under 2 GB. For database larger than 2 GB, we would provide a thin abstraction over N ByteBuffers with each covering part of the database. I think this would result in less code, eliminate the need for the dependency, and keep any performance impact minimal.

@shawjef3
Copy link
Author

You're welcome.

It's possible there's room for optimization. It's also possible to make an interface that would let you switch between ByteBuffer and BigByteBuffer as you wish.

For database larger than 2 GB, we would provide a thin abstraction over N ByteBuffers with each covering part of the database. I think this would result in less code, eliminate the need for the dependency, and keep any performance impact minimal.

This is what ByteMappedBigList does. Its implementation uses an array of ByteBuffer. It doesn't expose a constructor to do it explicitly, but it could be worth modifying it to see if using its ByteBuffer[] implementation instead of ByteBigArrayBigList's byte[][] is faster. Either way, I think it's worth keeping fastutil as a dependency for now, since while exploring solutions to this issue, it provides a bunch of useful functionality. There certainly might be better solutions.

@shawjef3
Copy link
Author

Also, can you point to some specific things that need optimization? I can look at them.

@shawjef3
Copy link
Author

Hello @oschwald . I made some benchmarks to see if I could find what is slow. I tried ByteBuffer#getInt() and Reader#getRecord(). Neither showed a regression. Can you give me a hint about what you saw?

@oschwald
Copy link
Member

I was running the Benchmark code on both the old and new version. Putting aside the other concerns, the performance would need to be within a couple of percent of the current version for this to be considered.

As mentioned above, I think this could be done different, without the third party dependency, so that there is almost no change in performance for databases under 2 GB. We would just have an extremely lightweight ByteBuffer wrapper than wraps a single ByteBuffer but uses longs rather than ints. We would then have a separate wrapper for the > 2 GB case that handles all the complexity that multiple underlying ByteBuffers introduces.

The tests are also failing.

@shawjef3
Copy link
Author

shawjef3 commented Feb 4, 2025

@oschwald what do you think of making Reader more flexible, so it can read from things other than ByteBuffers? I made a patch to allow someone to implement an interface and pass it to Reader. This way someone could use something other than a ByteBuffer, and it doesn't add any dependencies. For instance, I could use fastutil, or someone else else might want to use MemorySegment for off-heap storage. Without using the new constructor, it would continue to use ByteBuffer.

7a751d8

@shawjef3 shawjef3 closed this Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants