Skip to content

Conversation

@artemsolod
Copy link

To get things going with #56521 I've made a minimal implementation that mirrors one from numpy (https://github.com/numpy/numpy/blob/7c0e2e4224c6feb04a2ac4aa851f49a2c2f6189f/numpy/_core/src/multiarray/alloc.c#L113).

What this does: changes jl_gc_managed_malloc(size_t sz) to check whether requested allocation is big enough to benefit from huge pages. And if so, we ensure the allocation is page aligned and then appropriate madvise is called on the memory pointer.

For a simple "fill memory" test I see around 2x timing improvement.

function f(N)
    mem =  Memory{Int}(undef, N)
    mem .= 0
    mem[end]
end

f(1)
@time f(1_000_000)
0.001464 seconds (2 allocations: 7.633 MiB) # this branch
0.003431 seconds (2 allocations: 7.633 MiB) # master

I would appreciate help with this PR as I have no experience writing C code and little knowledge of julia internals. In particular I think it would make sense to have a startup option controlling minimal eligible allocation size which should default to system's hugepage size - for this initial implementation the same constant as in numpy is hardcoded.

@oscardssmith oscardssmith added performance Must go faster arrays [a, r, r, a, y, s] labels Oct 15, 2025
@Keno
Copy link
Member

Keno commented Oct 15, 2025

What kernel are you on? THP is usually automatic.

@oscardssmith
Copy link
Member

IIUC it requires alignment so you don't get it unless you ask for them. The transparent part is that they aren't specially segmented memory.

@Keno
Copy link
Member

Keno commented Oct 16, 2025

huge pages always need to be aligned. transparent means that they're ordinary pages, rather than being mmap'd from hugetlb which is the (very) old way to get hugepages. But regardless, modern kernel should automatically assign huge pages to sufficiently large mappings that it thinks are used. My suspicion here is that the reported perf difference isn't actually due to hugepages, but rather the fact that for the initial allocation the hugepage advice overwrites the fault granularity. We might see even better performance by prefaulting the pages. However, if that's the case, then that's a more general concern and in particular is workload dependent. Does python actually do this madvise by default?

@artemsolod
Copy link
Author

@Keno, @oscardssmith thanks for looking into this!

I am testing on a dedicated server running Ubuntu 25.04, kernel 6.14

uname -a
Linux ubuntu-c-8-intel-ams3-01 6.14.0-32-generic #32-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 29 14:21:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

From my experiments the performance jump happens only when either explicit madvise is called or /sys/kernel/mm/transparent_hugepage/enabled is set to always (by default it's set to madvise). I was first suspecting that using mmap to allocate could be sufficient but this does not seem to work. Here is a test script comparing manual madivse and usual julia memory allocation, it can be run on 1.12 or master.

import Mmap: MADV_HUGEPAGE

function memory_from_mmap(n)
    capacity = n*8
    # PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS  
    ptr = @ccall mmap(C_NULL::Ptr{Cvoid}, capacity::Csize_t, 3::Cint, 34::Cint, (-1)::Cint, 0::Csize_t)::Ptr{Cvoid}
    retcode = @ccall madvise(ptr::Ptr{Cvoid}, capacity::Csize_t, MADV_HUGEPAGE::Cint)::Cint
    iszero(retcode) || @warn "Madvise HUGEPAGE failed"

    ptr_int = convert(Ptr{Int}, ptr)
    mem = unsafe_wrap(Memory{Int}, ptr_int, n; own=false)
end

function f(N; with_mmap=false)
    if with_mmap
        mem = memory_from_mmap(N)
    else
        mem =  Memory{Int}(undef, N)
    end
    mem .= 0
    mem[end]
end

f(1; with_mmap=true)
f(1; with_mmap=false)
N = 10_000_000
GC.enable(false)
@time f(N; with_mmap=true)  # 0.015535 seconds (1 allocation: 32 bytes)
@time f(N; with_mmap=false) # 0.043966 seconds (2 allocations: 76.297 MiB)

With echo always > /sys/kernel/mm/transparent_hugepage/enabled both versions are fast, echo never > /sys/kernel/mm/transparent_hugepage/enabled both are slow. For default one echo madvise > /sys/kernel/mm/transparent_hugepage/enabled performance is very different depending on with_mmap=true .

I've also tried commenting out madvise in this PR branch, this shows the same performance as master, i.e. makes it slower.

As for whether this is done in python:

  • numpy definitely does it and relies on madvise being activated in the system, the threshold is hardcoded - source and documentation
  • CPython also has explicit madvise (or rather I see it their mimalloc code source). However, the mechanism is more sophisticated and they mention in comments that they expect it to not be necessary:
      // Many Linux systems don't allow MAP_HUGETLB but they support instead
      // transparent huge pages (THP). Generally, it is not required to call `madvise` with MADV_HUGE
      // though since properly aligned allocations will already use large pages if available
      // in that case -- in particular for our large regions (in `memory.c`).
      // However, some systems only allow THP if called with explicit `madvise`, so
      // when large OS pages are enabled for mimalloc, we call `madvise` anyways.

@oscardssmith
Copy link
Member

Seems like it's almost a bug that this doesn't just work by default, but 2x perf is 2x perf, so I say we merge this with a note that when linux starts doing the not dumb thing by default we can delete it.

@gbaraldi
Copy link
Member

I'm confused why glibc isn't doing this, but then again their allocator is middling at best

@artemsolod
Copy link
Author

I'm confused why glibc isn't doing this, but then again their allocator is middling at best

Could be working starting from glibc 2.35 reference:

Most of the high performance memory allocators (jemalloc, tcmalloc, mimalloc, etc.) support allocating 2MiB pages backed by either THP, hugetlbfs, or both. ... One glaring exception used to be the glibc allocator, which is, for most, the default allocator on Linux. Fortunately, glibc 2.35 introduced native support for allocating hugetlbfs pages controlled by the glibc.malloc.hugetlb tunable.

Although I still think we need to take action on julia's side as even on a quite modern 25.04 Ubuntu the effect of madvise is very visible.

@artemsolod
Copy link
Author

Current state of the PR:

  • Only gc-stock.c: jl_gc_managed_malloc does the following check ((jl_options.hugepage_threshold >= 0) && (allocsz >= jl_options.hugepage_threshold)). If that succeeds we ensure allocation gets page aligned and than madvise MADV_HUGEPAGE is called.
  • A command-line option --hugepage-threshold={auto|no|<size>[<unit>]} added. Parsing of <size>[<unit>] is mostly a copy-paste from parse_heap_size_option - not sure if this warrants a separate function.
  • auto tries to get hugepage size from the system (defaults to 2mb if fails - that's the standard). For that jl_gethugepagesize is impemented in sys.c. I had to resort to an LLM for this so might need a more thorough review, I did my best to review though.
  • the default I chose for hugepage_threshold is at 3/4 of hugepage size. My intent was to balance potential memory waste while trying to bring some benefits for arrays a bit under 2mb.
  • However from experiments it appears that allocations smaller than N = 2^22-2^12 + 1 (almost 4mb) are not THP despite the madvise call. So not sure whether all this tricks with getting system hugepage even make sense - maybe a hardcoded constant/option would suffice.

Here is a snippet I used to test for perf bump. When N has +1 the timing is 2 times fater

import Mmap: MADV_HUGEPAGE

function memory_from_mmap(n)
    # PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS
    ptr = @ccall mmap(C_NULL::Ptr{Cvoid}, n::Csize_t, 3::Cint, 34::Cint, (-1)::Cint, 0::Csize_t)::Ptr{Cvoid}
    retcode = @ccall madvise(ptr::Ptr{Cvoid}, n::Csize_t, MADV_HUGEPAGE::Cint)::Cint
    iszero(retcode) || @warn "Madvise HUGEPAGE for arena memory failed"
    mem = unsafe_wrap(Memory{UInt8}, convert(Ptr{UInt8}, ptr), n; own=false)
end

function f(N; with_mmap=false)
    if with_mmap
        mem = memory_from_mmap(N)
    else
        mem =  Memory{UInt8}(undef, N)
    end
    mem .= 0
    0
end

f(1; with_mmap=true)
f(1; with_mmap=false)
N = 2^22-2^12 + 1 # minimal value to see madvise having an effect
# experimentally when transparent_hugepage/enabled is set to always explicit madvise kicks in THP in 50% of the cases
# some allocation smaller than N above for allocation
GC.enable(false)
@time f(N; with_mmap=true)    # 0.000833 seconds (1 allocation: 32 bytes)
@time f(N; with_mmap=false)   # 0.000850 seconds (2 allocations: 4.000 MiB)
@time f(N-1; with_mmap=true)  # 0.001634 seconds (2 allocations: 48 bytes)
@time f(N-1; with_mmap=false) # 0.001641 seconds (3 allocations: 4.000 MiB)

@gbaraldi
Copy link
Member

I wouldn't add an option. I would look at mimallocs or another allocators source and just default to whatever they do.

@oscardssmith
Copy link
Member

Specifically, we probably want to hugepages at 3/4-1 of hugepage size, and 1.5+ hugepages or something. We generally try to cap fragmentation and 1.05 pages rounding up to 2 is a bit more fragmentation than we want.

@bbrehm
Copy link

bbrehm commented Oct 25, 2025

I think it would be preferable if the allocator did the job. Maybe we can use the glibc.malloc.hugetlb tunable?

We should think about other libcs as well: musl and possibly bionic (does julia work on bionic?). And we should think about users who LD_PRELOAD other allocators like jemalloc, tcmalloc or mimalloc instead of sticking to good old glibc ptmalloc. There is a certain combinatorial explosion here :(

I think it would be enough to fix glibc performance via tunable, if possible, and then document in the performance tips how to deal with alternative configurations (echo "always" > /sys/... or whatever is needed to configure musl or other common allocators). But it would be cool if julia on linux + glibc could get the fast variant without the user needing root access to reconfigure their system.

The approach here probably incurs quite the overhead: I would guess that glibc grabs an mmap, and since we requested page-aligned, the first 4k page is effectively discarded, containing only the metadata in the last 16 bytes of the page (needed for free!). Now for large allocs, the kernel is not unlikely to give us something with natural good alignment (e.g. 2M), and then we madvise a small subset of that range to enable THP, and I have no idea what the kernel will make of that nonsense. (madvising thp with alignment less than 2M makes no sense, right?)

Using libc for large allocations sucks, it would be more convenient to talk to the kernel directly. But then users can't free the memory anymore, they'd need to use some kind of jl_free_large_alloc, and that would be a giant API-break and julia 2.0.

@artemsolod
Copy link
Author

artemsolod commented Oct 25, 2025

We should think about other libcs as well: musl and possibly bionic (does julia work on bionic?). And we should think about users who LD_PRELOAD other allocators like jemalloc, tcmalloc or mimalloc instead of sticking to good old glibc ptmalloc. There is a certain combinatorial explosion here :(

Interesting, didn't know about LD_PRELOAD trick. I've just tried it with mimalloc and it didn't help - the timing without explicit madvise was the same as what I saw on in normal runs. Also saw a segfault in the end. The command I used: env LD_PRELOAD=/root/mimalloc/out/release/libmimalloc.so julia test_copy.jl

The approach here probably incurs quite the overhead: I would guess that glibc grabs an mmap, and since we requested page-aligned, the first 4k page is effectively discarded, containing only the metadata in the last 16 bytes of the page (needed for free!). Now for large allocs, the kernel is not unlikely to give us something with natural good alignment (e.g. 2M), and then we madvise a small subset of that range to enable THP, and I have no idea what the kernel will make of that nonsense. (madvising thp with alignment less than 2M makes no sense, right?)

Not sure where the overhead would come from - only large allocations are explicitly page aligned in this PR. I have not seen a recommendation to align to a higher value, correct me if I am wrong. I was hoping for the kernel to be doing the right things here.

On a more general note: I think there is a real performance opportunity. If the cost is calling madvise hugepage on a memory that was already madvised - this seems quite cheap. That's quite a universal solution, I can't quite see if anything specific really needs to be done for each particular allocator.

update:
LD_PRELOAD mimalloc actually does help, but only when arrays are quite big - 64mb for me

@artemsolod
Copy link
Author

artemsolod commented Oct 26, 2025

I've removed the startup option. Current rule: if the allocation is bigger than is 256kB it's page aligned and, furthermore, if allocsz % jl_hugepage_size is less than a quarter of the allocation we do a madvise hugepage. jl_hugepage_size is set in init.c using jl_gethugepagesize(). Note that both mimalloc and numpy hardcode either hugepage size or the cutoff for madvise - e.g. mimalloc:

#define MI_UNIX_LARGE_PAGE_SIZE (2*MI_MiB) // TODO: can we query the OS for this?

Overall I am satisfied with this. My biggest concern is that performance of some workloads gets non-monotonic - making a larger allocation can trigger hugepages path and run faster. Although it's probably better than being consistently slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrays [a, r, r, a, y, s] performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants