Skip to content

Conversation

christiangnrd
Copy link
Member

@christiangnrd christiangnrd commented Jun 23, 2025

Update as of 07/29 (also added to todo):

  • I believe synchronization must be done using MTLSharedEvents (Synchronize using MTLSharedEvents #633).
  • Kernels return zeros, I don't know when I'll have the chance to return to this, but for anyone interested in contributing, my next attempt would be looking into residency sets. To make it easier, this could be done by first using residency sets in Metal 3 on macOS > 15, and then converting to Metal 4 could be easier.

To actually test the Metal 4 stuff, you need to be on the macOS 26 developer beta, and be using ObjectiveC.jl > v3.4.2

For 8-bit integer fills, it works for smaller vectors, but not for bigger ones. I think it's because I can't figure out how to do synchronization like we do with Metal3.

julia> using Metal; Metal.fill(UInt8(4), 10000)
[ Info: 4
10000-element MtlVector{UInt8, Metal.PrivateStorage}:
 0x04
 0x04
    ⋮
 0x04
 0x04

julia> using Metal; Metal.fill(UInt8(4), 100000)
[ Info: 4
100000-element MtlVector{UInt8, Metal.PrivateStorage}:
 0x00
 0x00
    ⋮
 0x00
 0x00

For custom kernels, I don't know if the only issue is synchronization or if there's something else that's broken.

julia> using Metal; a = mtl([1,2,3,4]); a .+ 1
4-element MtlVector{Int64, Metal.PrivateStorage}:
 0
 0
 0
 0

TODO:

I'm kind of leaving this here for feedback/suggestions/help.

Copy link

codecov bot commented Jun 23, 2025

Codecov Report

Attention: Patch coverage is 22.31405% with 188 lines in your changes missing coverage. Please review.

Project coverage is 75.78%. Comparing base (1e1e193) to head (34e5dea).

Files with missing lines Patch % Lines
lib/mtl/command_queue4.jl 0.00% 39 Missing ⚠️
lib/mtl/command_buf4.jl 0.00% 28 Missing ⚠️
lib/mtl/command_alloc4.jl 0.00% 27 Missing ⚠️
lib/mtl/command_enc/compute4.jl 0.00% 27 Missing ⚠️
src/state.jl 27.58% 21 Missing ⚠️
lib/mtl/arg_table.jl 0.00% 16 Missing ⚠️
src/compiler/execution.jl 61.76% 13 Missing ⚠️
lib/mtl/command_enc4.jl 0.00% 8 Missing ⚠️
lib/mtl/compute_pipeline4.jl 0.00% 5 Missing ⚠️
src/memory.jl 78.94% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #612      +/-   ##
==========================================
- Coverage   80.76%   75.78%   -4.99%     
==========================================
  Files          61       68       +7     
  Lines        2693     2907     +214     
==========================================
+ Hits         2175     2203      +28     
- Misses        518      704     +186     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

#default_storage = "private"

# when false, always use metal 3 even on a metal 4 platform
#force_metal3 = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about a api_version preference, determined automatically if unset, but overridable by setting "3" or "4"?

I'd also use string values; literals have given me issues in the past (if there's a typo in the user-provided TOML, stuff breaks).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted about the string values.

Metal 3 can be used alongside Metal 4, so maybe we should call it default_api_version to make it clear that all it's doing is selecting the default API?

I think maybe that this preference should only affect whether global_queue returns an MTLCommandQueue or an MTL4CommandQueue, and everything else that takes in a queue should dispatch on the correct implementation based on which type was provided?

# command allocator
#

export MTL4CommandAllocator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is the Metal 4 API? Should we put it in a mtl4 top-level folder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the Metal 4 API lives as new methods of Metal 3 Objects so maybe a mtl4 subfolder in lib/mtl would be more appropriate?

Comment on lines 46 to 69
# Copy Operations (Blit functionality integrated into compute encoder in Metal 4)
# function append_copy!(cce::MTL4ComputeCommandEncoder, dst::MTLBuffer, dstOffset::Integer,
# src::MTLBuffer, srcOffset::Integer, size::Integer)
# @objc [cce::id{MTL4ComputeCommandEncoder} copyFromBuffer:src::id{MTLBuffer}
# sourceOffset:srcOffset::NSUInteger
# toBuffer:dst::id{MTLBuffer}
# destinationOffset:dstOffset::NSUInteger
# size:size::NSUInteger]::Nothing
# end

# function append_copy!(cce::MTL4ComputeCommandEncoder, dst::MTLTexture, dstSlice::Integer, dstLevel::Integer, dstOrigin::MTLOrigin,
# src::MTLBuffer, srcOffset::Integer, srcBytesPerRow::Integer, srcBytesPerImage::Integer,
# size::MTLSize)
# @objc [cce::id{MTL4ComputeCommandEncoder} copyFromBuffer:src::id{MTLBuffer}
# sourceOffset:srcOffset::NSUInteger
# sourceBytesPerRow:srcBytesPerRow::NSUInteger
# sourceBytesPerImage:srcBytesPerImage::NSUInteger
# sourceSize:size::MTLSize
# toTexture:dst::id{MTLTexture}
# destinationSlice:dstSlice::NSUInteger
# destinationLevel:dstLevel::NSUInteger
# destinationOrigin:dstOrigin::MTLOrigin]::Nothing
# end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these can be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they exist with the new MTL4ComputerCommandEncoder. I just wanted to lay the groundwork for custom kernels and then one piece of functionality using Metal 4 (unsafe_fill! is what I landed on). This code hasn't been tested or anything but once synchronization is figured out I'll work on the rest of the Metal 4 MtlArray functionality.


@autoreleasepool function (kernel::HostKernel)(args...; groups=1, threads=1,
queue=global_queue(device()))
queue=use_metal4() ? global_queue4(device()) : global_queue(device()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_queue4 is weird, and I'd rather not have to do this dance everywhere. I'd rather we can use these regardless of the API version, so what about having global_queue(::MTLDevice) return an appropriate object given the selected API? Given that preferences are determined at compile time, that should also be type stable.

Comment on lines 324 to 347
if !use_mtl4
roots = [kernel.f, args]
MTL.on_completed(cmdbuf) do buf
empty!(roots)
foreach(free, argument_buffers)

# Check for errors
# XXX: we cannot do this nicely, e.g. throwing an `error` or reporting with `@error`
# because we're not allowed to switch tasks from this contexts.
if buf.status == MTL.MTLCommandBufferStatusError
Core.println("ERROR: Failed to submit command buffer: $(buf.error.localizedDescription)")
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need an equivalent for this, or Julia can early-free resources.

Related:

Unlike the default behavior of MTLCommandBuffer, you may need to consider a resource’s retain count because each MTL4CommandBuffer instance doesn’t create strong references to resources. This is similar to creating an MTLCommandBuffer with the makeCommandBufferWithUnretainedReferences() method of an MTLCommandQueue.

cce = MTLComputeCommandEncoder(cmdbuf)
if use_mtl4
allocator = MTL4CommandAllocator(device())
cmdbuf = MTL4CommandBuffer(device())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reuse and repurpose each command buffer indefinitely by starting over, encoding new commands, and committing it again, instead of allocating a new buffer.

As a future TODO, we could probably use a task-local command buffer just like we have task-local queues now.

Comment on lines +12 to +16
const use_metal4 = OncePerProcess{Bool}() do
dev = device()
force_metal3 = @load_preference("force_metal3", false)
return functional() && !force_metal3 && supports_family(dev, MTL.MTLGPUFamilyMetal4)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this api_version::Int then, parsing the api_version String preferences here.

src/state.jl Outdated
device!(dev::MTLDevice) = task_local_storage(:MTLDevice, dev)

const global_queues = WeakKeyDict{MTLCommandQueue,Nothing}()
const global_queues4 = WeakKeyDict{MTL4CommandQueue,Nothing}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the preference being compile-time determined, we don't need multiple of these.

If necessary (e.g. for type definitions or macro expansions), you can use @static if. Otherwise a plain top-level if api_version == ... should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants