Skip to content

Conversation

@xal-0
Copy link
Member

@xal-0 xal-0 commented Nov 4, 2025

Overview

This PR overhauls the way linking works in Julia, both in the JIT and AOT. The
point is to enable us to generate LLVM IR that depends only on the source IR,
eliminating both nondeterminism and the effect of redefining methods in the same
session. This serves two purposes. First, if the IR is predictable, we can
cache the compilation by using the bitcode hash as a key, like how the ThinLTO
cache works. #58592 was an early experiment along these lines. Second, we can
reuse work that was done in a previous session, like pkgimages, but for the JIT.

We accomplish this by generating names that are unique only within the current
LLVM module, removing most uses of the globalUniqueGeneratedNames counter.
The replacement for jl_codegen_params_t, jl_codegen_output_t, represents a
Julia "translation unit", and tracks the information we'll need to link the
compiled module into the running session. When linking, we manipulate the
JITLink LinkGraph (after compilation) instead of renaming
functions in the LLVM IR (before).

Example

julia> @noinline foo(x) = x + 2.0
       baz(x) = foo(foo(x))

       code_llvm(baz, (Int64,); dump_module=true, optimize=false)

Nightly:

[...]
@"+Core.Float64#774" = private unnamed_addr constant ptr @"+Core.Float64#774.jit"
@"+Core.Float64#774.jit" = private alias ptr, inttoptr (i64 4797624416 to ptr)

; Function Signature: baz(Int64)
;  @ REPL[1]:2 within `baz`
define double @julia_baz_772(i64 signext %"x::Int64") #0 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %0 = call double @j_foo_775(i64 signext %"x::Int64")
  %1 = call double @j_foo_776(double %0)
  ret double %1
}

; Function Attrs: noinline optnone
define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %0 = getelementptr inbounds i8, ptr %"args::Any[]", i32 0
  %1 = load ptr, ptr %0, align 8
  %.unbox = load i64, ptr %1, align 8
  %2 = call double @julia_baz_772(i64 signext %.unbox)
  %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8
  %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64
  %3 = inttoptr i64 %Float64 to ptr
  %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152
  %"box::Float64" = call noalias nonnull align 8 dereferenceable(8) ptr @julia.gc_alloc_obj(ptr %current_task, i64 8, ptr %3) #5
  store double %2, ptr %"box::Float64", align 8
  ret ptr %"box::Float64"
}
[...]

Diff after this PR. Notice how each symbol gets the lowest possible integer
suffix that will make it unique to the module, and how the two specializations
for foo get different names:

@@ -4,18 +4,18 @@
 target triple = "arm64-apple-darwin24.6.0"
 
-@"+Core.Float64#774" = external global ptr
+@"+Core.Float64#_0" = external global ptr
 
 ; Function Signature: baz(Int64)
 ;  @ REPL[1]:2 within `baz`
-define double @julia_baz_772(i64 signext %"x::Int64") #0 {
+define double @julia_baz_0(i64 signext %"x::Int64") #0 {
 top:
   %pgcstack = call ptr @julia.get_pgcstack()
-  %0 = call double @j_foo_775(i64 signext %"x::Int64")
-  %1 = call double @j_foo_776(double %0)
+  %0 = call double @j_foo_0(i64 signext %"x::Int64")
+  %1 = call double @j_foo_1(double %0)
   ret double %1
 }
 
 ; Function Attrs: noinline optnone
-define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
+define nonnull ptr @jfptr_baz_0(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
 top:
   %pgcstack = call ptr @julia.get_pgcstack()
@@ -23,7 +23,7 @@
   %1 = load ptr, ptr %0, align 8
   %.unbox = load i64, ptr %1, align 8
-  %2 = call double @julia_baz_772(i64 signext %.unbox)
-  %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8
-  %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64
+  %2 = call double @julia_baz_0(i64 signext %.unbox)
+  %"+Core.Float64#_0" = load ptr, ptr @"+Core.Float64#_0", align 8
+  %Float64 = ptrtoint ptr %"+Core.Float64#_0" to i64
   %3 = inttoptr i64 %Float64 to ptr
   %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152
@@ -39,8 +39,8 @@
 
 ; Function Signature: foo(Int64)
-declare double @j_foo_775(i64 signext) #3
+declare double @j_foo_0(i64 signext) #3
 
 ; Function Signature: foo(Float64)
-declare double @j_foo_776(double) #4
+declare double @j_foo_1(double) #4
 
 attributes #0 = { "frame-pointer"="all" "julia.fsig"="baz(Int64)" "probe-stack"="inline-asm" }

List of changes

  • Many sources of statefulness and nondeterminism in the emitted LLVM IR have
    been eliminated, namely:

    • Function symbols defined for CodeInstances
    • Global symbols referring to data on the Julia heap
    • Undefined function symbols referring to invoked external CodeInstances
  • jl_codeinst_params_t has become jl_codegen_output_t. It now represents
    one Julia "translation unit". More than one CodeInstance can be emitted to
    the same jl_codegen_output_t, if desired, though in the JIT every CI gets
    its own right now. One motivation behind this is to allow us to emit code on
    multiple threads and avoid the bitcode serialize/deserialize step we currently
    do, if that proves worthwhile.

    When we are done emitting to a jl_codegen_output_t, we call .finish(),
    which discards the intermediate state and returns only the LLVM module and the
    info needed for linking (jl_linker_info_t).

  • The new JLMaterializationUnit wraps compiled Julia object files and the
    associated jl_linker_info_t. It informs ORC that we can materialize symbols
    for the CIs defined by that output, and picks globally unique names for them.
    When it is materialized, it resolves all the call targets and generates
    trampolines for CodeInstances that are invoked but have the wrong calling
    convention, or are not yet compiled.

  • We now postpone linking decisions to after codegen whenever possible. For
    example, emit_invoke no longer tries to find a compiled version of the
    CodeInstance, and it no longer generates trampolines to adapt calling
    conventions. jl_analyze_workqueue's job has been absorbed into
    JuliaOJIT::linkOutput.

  • Some image_codegen differences have been removed:

    • Globals for Julia heap addresses no longer get initialized, so the resulting
      IR won't have the addresses embedded. I expect the impact of this to be
      small on RISC-y platforms, where it is typical to load address-sized values
      out of a constant pool.
    • Codegen no longer cares if a compiled CodeInstance came from an image.
      During ahead-of-time linking, we generate thunk functions that load the
      address from the fvars table.
  • In jl_emit_native_impl, emit every CodeInstance into one
    jl_codegen_output_t. We now defer the creation of the llvm::Linker for
    llvmcalls, which has construction cost that grows with the size of the
    destination module, until the very end.

General refactoring

  • Adapt the jl_callingconv_t enum from staticdata.c into jl_invoke_api_t
    and use it in more places. There is one enumerator for each special
    jl_callptr_t function that can go in a CodeInstance's invoke field, as
    well as one that indicates an invoke wrapper should be there. There is a
    convenience function for reading an invoke pointer and getting the API type,
    and vice versa.
  • Avoid using magic string values, and try to directly pass pointers to LLVM
    Function * or ORC string pool entries when possible.

Remaining TODO items

  • RTDyld: on this branch, it is removed completely. I will pursue one of
    these two options:
    - [ ] Use the ahead-of-time linking to get it working again.
    - [ ] Port over the memory management to JITLink and use that on all
    platforms.

  • DLSymOptimizer is unused. It will be replaced with an ORC
    MaterializationUnit that, when materialized, defines the symbols as
    absolute addresses (with a fallback that generates a jlplt function).

  • Since tojlinvoke and other trampolines don't take long to compile, we
    just compile them while holding the JuliaOJIT::LinkerMutex. Since we
    most often generate tojlinvoke wrappers when an invoked CodeInstance is
    not yet compiled, it is my intention to eventually replace this with a
    GOT/PLT mechanism that will also allow us to start running code before all
    of the edges are compiled.

  • I have yet to measure the impact of global addresses not being visible to
    the LLVM optimizer or code generation. If it turns out to be important to
    have immediate addresses, I'd like to try using external LLVM globals
    address values directly, since that can generate code with immediate
    relocations, and LLVM can assume the address won't alias.

  • We should support ahead-of-time linking multiple jl_codegen_output_ts
    together.

  • We still pass strings to emit_call_specfun_other, even though the
    prototype for the function is now created by
    jl_codegen_output_t::get_call_target. We should hold on to the calling
    convention info so it doesn't have to be recomputed.

xal-0 added 3 commits November 3, 2025 15:28
Use JITLink everywhere

Rename jlcall_type, add jl_funcs_invoke_ptr

Move JLLinkingLayer into JuliaOJIT

Use jl_invoke_api_t elsewhere

Rename JL_INVOKE_JFPTR -> JL_INVOKE_SPECSIG

Put all special symbol names in one place

Add helper for specsig -> tojlinvoke (fptr1) and use it

Fix invariants for code_outputs

Document JIT invariants better; remove invalid assertions

Replace workqueue, partially support OpaqueClosure

Add JIT tests

Stop using strings so much

Don't create an LLVM::Linker unless necessary

Generate trampolines in aot_link_output

GCChecker annotations, misc changes

Re-add emit_always_inline

Get JLDebuginfoPlugin and eh_frame working again

Re-add OpaqueClosure MethodInstance global root

Fix GCChecker annotations

Clean up TODOs

Read dump compile

Use multiple threads in the JIT

Add PLT/GOT for external fns

Name Julia PLT GOT entries

Do emit_llvmcall_modules at the end

Suppress clang-tidy, static analyzer warnings

Keep temporary_roots alive during emit_always_inline

Mark pkg PLT thunks noinline

Don't attempt to emit inline codeinsts when IR is too large or missing

Improve thunk generation on x86

Fix infinite loop in emit_always_inline if inlining not possible

Use local names for global targets

Fix jl_get_llvmf_defn_impl cfunction hacks
@xal-0 xal-0 added compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM labels Nov 4, 2025
Comment on lines +872 to +877
class JLMaterializationUnit : public orc::MaterializationUnit {
public:
static JLMaterializationUnit Create(JuliaOJIT &JIT, ObjectLinkingLayer &OL,
std::unique_ptr<jl_linker_info_t> Info,
std::unique_ptr<MemoryBuffer> Obj) JL_NOTSAFEPOINT
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I have been wanting this for a long time!

Would it make sense to have a C-API for creating these? So that LLVM.jl could create them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, though I would not want to expose it in a way that would lock in some of the design choices, like how JLMaterializationUnit owns the object buffer.

I'm undecided on how much work should be deferred to materialization. Right now jl_compile_codeinst_now blocks all threads waiting on compilation until everything is compiled to object files, like on master. I'd like to leave the door open to letting ORC decide when to compile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have been wanting to try and ORC based setup for GPUCompiler

xal-0 added a commit to xal-0/julia that referenced this pull request Nov 11, 2025
Ports our RTDyLD memory manager to JITLink in order to avoid memory use
regressions after switching to JITLink everywhere (JuliaLang#60031).  This is
essentially a direct port: finalization must happen all at once, because
it invalidates all allocation `wr_ptr`s.  I decided it wasn't worth it
to associate `OnFinalizedFunction` callbacks with each block, since they
are large enough to make it extremely likely that all in-flight
allocations land in the same block; everything must be relocated before
finalization can happen.

I plan to add support for DualMapAllocator on ARM64 macOS, as well as an
alternative for executable memory to come later.  For now, we fall back
to the old MapperJITLinkMemoryManager.
xal-0 added a commit to xal-0/julia that referenced this pull request Nov 11, 2025
Ports our RTDyLD memory manager to JITLink in order to avoid memory use
regressions after switching to JITLink everywhere (JuliaLang#60031).  This is a
direct port: finalization must happen all at once, because it
invalidates all allocation `wr_ptr`s.  I decided it wasn't worth it to
associate `OnFinalizedFunction` callbacks with each block, since they
are large enough to make it extremely likely that all in-flight
allocations land in the same block; everything must be relocated before
finalization can happen.

I plan to add support for DualMapAllocator on ARM64 macOS, as well as an
alternative for executable memory later.  For now, we fall back to the
old MapperJITLinkMemoryManager.
xal-0 added a commit to xal-0/julia that referenced this pull request Nov 11, 2025
Ports our RTDyLD memory manager to JITLink in order to avoid memory use
regressions after switching to JITLink everywhere (JuliaLang#60031).  This is a
direct port: finalization must happen all at once, because it
invalidates all allocation `wr_ptr`s.  I decided it wasn't worth it to
associate `OnFinalizedFunction` callbacks with each block, since they
are large enough to make it extremely likely that all in-flight
allocations land in the same block; everything must be relocated before
finalization can happen.

I plan to add support for DualMapAllocator on ARM64 macOS, as well as an
alternative for executable memory later.  For now, we fall back to the
old MapperJITLinkMemoryManager.

Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
xal-0 added a commit to xal-0/julia that referenced this pull request Nov 11, 2025
Ports our RTDyLD memory manager to JITLink in order to avoid memory use
regressions after switching to JITLink everywhere (JuliaLang#60031).  This is a
direct port: finalization must happen all at once, because it
invalidates all allocation `wr_ptr`s.  I decided it wasn't worth it to
associate `OnFinalizedFunction` callbacks with each block, since they
are large enough to make it extremely likely that all in-flight
allocations land in the same block; everything must be relocated before
finalization can happen.

I plan to add support for DualMapAllocator on ARM64 macOS, as well as an
alternative for executable memory later.  For now, we fall back to the
old MapperJITLinkMemoryManager.

Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
xal-0 added a commit to xal-0/julia that referenced this pull request Nov 11, 2025
Ports our RTDyLD memory manager to JITLink in order to avoid memory use
regressions after switching to JITLink everywhere (JuliaLang#60031).  This is a
direct port: finalization must happen all at once, because it
invalidates all allocation `wr_ptr`s.  I decided it wasn't worth it to
associate `OnFinalizedFunction` callbacks with each block, since they
are large enough to make it extremely likely that all in-flight
allocations land in the same block; everything must be relocated before
finalization can happen.

I plan to add support for DualMapAllocator on ARM64 macOS, as well as an
alternative for executable memory later.  For now, we fall back to the
old MapperJITLinkMemoryManager.

Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
xal-0 added a commit to xal-0/julia that referenced this pull request Nov 11, 2025
Ports our RTDyLD memory manager to JITLink in order to avoid memory use
regressions after switching to JITLink everywhere (JuliaLang#60031).  This is a
direct port: finalization must happen all at once, because it
invalidates all allocation `wr_ptr`s.  I decided it wasn't worth it to
associate `OnFinalizedFunction` callbacks with each block, since they
are large enough to make it extremely likely that all in-flight
allocations land in the same block; everything must be relocated before
finalization can happen.

I plan to add support for DualMapAllocator on ARM64 macOS, as well as an
alternative for executable memory later.  For now, we fall back to the
old MapperJITLinkMemoryManager.

Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants