-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Local names linking #60031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Local names linking #60031
Conversation
Renumber jl_invoke_api_t
Use JITLink everywhere Rename jlcall_type, add jl_funcs_invoke_ptr Move JLLinkingLayer into JuliaOJIT Use jl_invoke_api_t elsewhere Rename JL_INVOKE_JFPTR -> JL_INVOKE_SPECSIG Put all special symbol names in one place Add helper for specsig -> tojlinvoke (fptr1) and use it Fix invariants for code_outputs Document JIT invariants better; remove invalid assertions Replace workqueue, partially support OpaqueClosure Add JIT tests Stop using strings so much Don't create an LLVM::Linker unless necessary Generate trampolines in aot_link_output GCChecker annotations, misc changes Re-add emit_always_inline Get JLDebuginfoPlugin and eh_frame working again Re-add OpaqueClosure MethodInstance global root Fix GCChecker annotations Clean up TODOs Read dump compile Use multiple threads in the JIT Add PLT/GOT for external fns Name Julia PLT GOT entries Do emit_llvmcall_modules at the end Suppress clang-tidy, static analyzer warnings Keep temporary_roots alive during emit_always_inline Mark pkg PLT thunks noinline Don't attempt to emit inline codeinsts when IR is too large or missing Improve thunk generation on x86 Fix infinite loop in emit_always_inline if inlining not possible Use local names for global targets Fix jl_get_llvmf_defn_impl cfunction hacks
| class JLMaterializationUnit : public orc::MaterializationUnit { | ||
| public: | ||
| static JLMaterializationUnit Create(JuliaOJIT &JIT, ObjectLinkingLayer &OL, | ||
| std::unique_ptr<jl_linker_info_t> Info, | ||
| std::unique_ptr<MemoryBuffer> Obj) JL_NOTSAFEPOINT | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I have been wanting this for a long time!
Would it make sense to have a C-API for creating these? So that LLVM.jl could create them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, though I would not want to expose it in a way that would lock in some of the design choices, like how JLMaterializationUnit owns the object buffer.
I'm undecided on how much work should be deferred to materialization. Right now jl_compile_codeinst_now blocks all threads waiting on compilation until everything is compiled to object files, like on master. I'd like to leave the door open to letting ORC decide when to compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I have been wanting to try and ORC based setup for GPUCompiler
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is essentially a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory to come later. For now, we fall back to the old MapperJITLinkMemoryManager.
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager.
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
Overview
This PR overhauls the way linking works in Julia, both in the JIT and AOT. The
point is to enable us to generate LLVM IR that depends only on the source IR,
eliminating both nondeterminism and the effect of redefining methods in the same
session. This serves two purposes. First, if the IR is predictable, we can
cache the compilation by using the bitcode hash as a key, like how the ThinLTO
cache works. #58592 was an early experiment along these lines. Second, we can
reuse work that was done in a previous session, like pkgimages, but for the JIT.
We accomplish this by generating names that are unique only within the current
LLVM module, removing most uses of the
globalUniqueGeneratedNamescounter.The replacement for
jl_codegen_params_t,jl_codegen_output_t, represents aJulia "translation unit", and tracks the information we'll need to link the
compiled module into the running session. When linking, we manipulate the
JITLink LinkGraph (after compilation) instead of renaming
functions in the LLVM IR (before).
Example
Nightly:
Diff after this PR. Notice how each symbol gets the lowest possible integer
suffix that will make it unique to the module, and how the two specializations
for
fooget different names:List of changes
Many sources of statefulness and nondeterminism in the emitted LLVM IR have
been eliminated, namely:
jl_codeinst_params_thas becomejl_codegen_output_t. It now representsone Julia "translation unit". More than one CodeInstance can be emitted to
the same
jl_codegen_output_t, if desired, though in the JIT every CI getsits own right now. One motivation behind this is to allow us to emit code on
multiple threads and avoid the bitcode serialize/deserialize step we currently
do, if that proves worthwhile.
When we are done emitting to a
jl_codegen_output_t, we call.finish(),which discards the intermediate state and returns only the LLVM module and the
info needed for linking (
jl_linker_info_t).The new
JLMaterializationUnitwraps compiled Julia object files and theassociated
jl_linker_info_t. It informs ORC that we can materialize symbolsfor the CIs defined by that output, and picks globally unique names for them.
When it is materialized, it resolves all the call targets and generates
trampolines for CodeInstances that are invoked but have the wrong calling
convention, or are not yet compiled.
We now postpone linking decisions to after codegen whenever possible. For
example,
emit_invokeno longer tries to find a compiled version of theCodeInstance, and it no longer generates trampolines to adapt calling
conventions.
jl_analyze_workqueue's job has been absorbed intoJuliaOJIT::linkOutput.Some
image_codegendifferences have been removed:IR won't have the addresses embedded. I expect the impact of this to be
small on RISC-y platforms, where it is typical to load address-sized values
out of a constant pool.
During ahead-of-time linking, we generate thunk functions that load the
address from the fvars table.
In
jl_emit_native_impl, emit every CodeInstance into onejl_codegen_output_t. We now defer the creation of thellvm::Linkerforllvmcalls, which has construction cost that grows with the size of the
destination module, until the very end.
General refactoring
jl_callingconv_tenum fromstaticdata.cintojl_invoke_api_tand use it in more places. There is one enumerator for each special
jl_callptr_tfunction that can go in a CodeInstance'sinvokefield, aswell as one that indicates an invoke wrapper should be there. There is a
convenience function for reading an invoke pointer and getting the API type,
and vice versa.
Function *or ORC string pool entries when possible.Remaining TODO items
RTDyld: on this branch, it is removed completely. I will pursue one of
these two options:
- [ ] Use the ahead-of-time linking to get it working again.
- [ ] Port over the memory management to JITLink and use that on all
platforms.
DLSymOptimizeris unused. It will be replaced with an ORCMaterializationUnit that, when materialized, defines the symbols as
absolute addresses (with a fallback that generates a
jlpltfunction).Since
tojlinvokeand other trampolines don't take long to compile, wejust compile them while holding the
JuliaOJIT::LinkerMutex. Since wemost often generate
tojlinvokewrappers when an invoked CodeInstance isnot yet compiled, it is my intention to eventually replace this with a
GOT/PLT mechanism that will also allow us to start running code before all
of the edges are compiled.
I have yet to measure the impact of global addresses not being visible to
the LLVM optimizer or code generation. If it turns out to be important to
have immediate addresses, I'd like to try using external LLVM globals
address values directly, since that can generate code with immediate
relocations, and LLVM can assume the address won't alias.
We should support ahead-of-time linking multiple
jl_codegen_output_tstogether.
We still pass strings to
emit_call_specfun_other, even though theprototype for the function is now created by
jl_codegen_output_t::get_call_target. We should hold on to the callingconvention info so it doesn't have to be recomputed.