-
-
Couldn't load subscription status.
- Fork 5.7k
Madvise Transparent Huge Pages for large allocations #59858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
What kernel are you on? THP is usually automatic. |
|
IIUC it requires alignment so you don't get it unless you ask for them. The transparent part is that they aren't specially segmented memory. |
|
huge pages always need to be aligned. |
|
@Keno, @oscardssmith thanks for looking into this! I am testing on a dedicated server running Ubuntu 25.04, kernel 6.14 From my experiments the performance jump happens only when either explicit import Mmap: MADV_HUGEPAGE
function memory_from_mmap(n)
capacity = n*8
# PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS
ptr = @ccall mmap(C_NULL::Ptr{Cvoid}, capacity::Csize_t, 3::Cint, 34::Cint, (-1)::Cint, 0::Csize_t)::Ptr{Cvoid}
retcode = @ccall madvise(ptr::Ptr{Cvoid}, capacity::Csize_t, MADV_HUGEPAGE::Cint)::Cint
iszero(retcode) || @warn "Madvise HUGEPAGE failed"
ptr_int = convert(Ptr{Int}, ptr)
mem = unsafe_wrap(Memory{Int}, ptr_int, n; own=false)
end
function f(N; with_mmap=false)
if with_mmap
mem = memory_from_mmap(N)
else
mem = Memory{Int}(undef, N)
end
mem .= 0
mem[end]
end
f(1; with_mmap=true)
f(1; with_mmap=false)
N = 10_000_000
GC.enable(false)
@time f(N; with_mmap=true) # 0.015535 seconds (1 allocation: 32 bytes)
@time f(N; with_mmap=false) # 0.043966 seconds (2 allocations: 76.297 MiB)With I've also tried commenting out As for whether this is done in python:
|
|
Seems like it's almost a bug that this doesn't just work by default, but 2x perf is 2x perf, so I say we merge this with a note that when linux starts doing the not dumb thing by default we can delete it. |
|
I'm confused why glibc isn't doing this, but then again their allocator is middling at best |
Could be working starting from glibc 2.35 reference:
Although I still think we need to take action on julia's side as even on a quite modern 25.04 Ubuntu the effect of |
|
Current state of the PR:
Here is a snippet I used to test for perf bump. When import Mmap: MADV_HUGEPAGE
function memory_from_mmap(n)
# PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS
ptr = @ccall mmap(C_NULL::Ptr{Cvoid}, n::Csize_t, 3::Cint, 34::Cint, (-1)::Cint, 0::Csize_t)::Ptr{Cvoid}
retcode = @ccall madvise(ptr::Ptr{Cvoid}, n::Csize_t, MADV_HUGEPAGE::Cint)::Cint
iszero(retcode) || @warn "Madvise HUGEPAGE for arena memory failed"
mem = unsafe_wrap(Memory{UInt8}, convert(Ptr{UInt8}, ptr), n; own=false)
end
function f(N; with_mmap=false)
if with_mmap
mem = memory_from_mmap(N)
else
mem = Memory{UInt8}(undef, N)
end
mem .= 0
0
end
f(1; with_mmap=true)
f(1; with_mmap=false)
N = 2^22-2^12 + 1 # minimal value to see madvise having an effect
# experimentally when transparent_hugepage/enabled is set to always explicit madvise kicks in THP in 50% of the cases
# some allocation smaller than N above for allocation
GC.enable(false)
@time f(N; with_mmap=true) # 0.000833 seconds (1 allocation: 32 bytes)
@time f(N; with_mmap=false) # 0.000850 seconds (2 allocations: 4.000 MiB)
@time f(N-1; with_mmap=true) # 0.001634 seconds (2 allocations: 48 bytes)
@time f(N-1; with_mmap=false) # 0.001641 seconds (3 allocations: 4.000 MiB) |
|
I wouldn't add an option. I would look at mimallocs or another allocators source and just default to whatever they do. |
|
Specifically, we probably want to hugepages at 3/4-1 of hugepage size, and 1.5+ hugepages or something. We generally try to cap fragmentation and 1.05 pages rounding up to 2 is a bit more fragmentation than we want. |
|
I think it would be preferable if the allocator did the job. Maybe we can use the We should think about other libcs as well: musl and possibly bionic (does julia work on bionic?). And we should think about users who LD_PRELOAD other allocators like jemalloc, tcmalloc or mimalloc instead of sticking to good old glibc ptmalloc. There is a certain combinatorial explosion here :( I think it would be enough to fix glibc performance via tunable, if possible, and then document in the performance tips how to deal with alternative configurations ( The approach here probably incurs quite the overhead: I would guess that glibc grabs an mmap, and since we requested page-aligned, the first 4k page is effectively discarded, containing only the metadata in the last 16 bytes of the page (needed for Using libc for large allocations sucks, it would be more convenient to talk to the kernel directly. But then users can't |
Interesting, didn't know about LD_PRELOAD trick. I've just tried it with mimalloc and it didn't help - the timing without explicit
Not sure where the overhead would come from - only large allocations are explicitly page aligned in this PR. I have not seen a recommendation to align to a higher value, correct me if I am wrong. I was hoping for the kernel to be doing the right things here. On a more general note: I think there is a real performance opportunity. If the cost is calling update: |
|
I've removed the startup option. Current rule: if the allocation is bigger than is 256kB it's page aligned and, furthermore, if Overall I am satisfied with this. My biggest concern is that performance of some workloads gets non-monotonic - making a larger allocation can trigger hugepages path and run faster. Although it's probably better than being consistently slow. |
To get things going with #56521 I've made a minimal implementation that mirrors one from numpy (https://github.com/numpy/numpy/blob/7c0e2e4224c6feb04a2ac4aa851f49a2c2f6189f/numpy/_core/src/multiarray/alloc.c#L113).
What this does: changes
jl_gc_managed_malloc(size_t sz)to check whether requested allocation is big enough to benefit from huge pages. And if so, we ensure the allocation is page aligned and then appropriatemadviseis called on the memory pointer.For a simple "fill memory" test I see around 2x timing improvement.
I would appreciate help with this PR as I have no experience writing C code and little knowledge of julia internals. In particular I think it would make sense to have a startup option controlling minimal eligible allocation size which should default to system's hugepage size - for this initial implementation the same constant as in numpy is hardcoded.