Add mixed-attention Core ML mask support for stateful generation#331
Open
Skyline-23 wants to merge 3 commits intohuggingface:mainfrom
Open
Add mixed-attention Core ML mask support for stateful generation#331Skyline-23 wants to merge 3 commits intohuggingface:mainfrom
Skyline-23 wants to merge 3 commits intohuggingface:mainfrom
Conversation
- add support for fullAttentionMask and slidingAttentionMask model inputs in the stateful generation path - derive sliding window masks from model metadata or config when needed - add regression tests for additive full and sliding attention mask construction
- add fullAttentionMask and slidingAttentionMask handling to the stateful generation path - resolve sliding window size from model metadata or config for mixed-attention models - add regression tests for additive full and sliding attention mask construction
- factor stateful generation input assembly into a reusable helper - verify full and sliding attention mask keys, shapes, and additive values - keep single-mask generation behavior unchanged while covering mixed-attention inputs
pcuenca
reviewed
Mar 9, 2026
Member
pcuenca
left a comment
There was a problem hiding this comment.
Very interesting and cool PR @Skyline-23! I won't be able to properly test and review it until the end of the week. Meanwhile, a couple of questions:
- The converted example model seems to be using float32 instead of float16 (because of this line, and because the repo takes ~16 GB). Did you try to convert to float16? Did you try any quantization options?
- Are you using or planning to use this Core ML model in a downstream app?
Thanks a lot for the contribution!
Contributor
Author
|
@pcuenca Sorry for late reply! It's fine. Please review slowly |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add support for stateful Core ML language models that require multiple attention masks during generation.
Why
The current runtime only handles attentionMask / causalMask, which is not sufficient for mixed-attention Core ML exports that need separate masks for different layer types.
This change allows the stateful generation path to populate:
when those inputs are present in the Core ML model description.
Implementation
Tests
swift test --filter LanguageModelCoreMLMaskTests
Scope clarification
This PR is intended to support explicit multi-mask Core ML generation contracts in the runtime.
It does not attempt to fix exporter-side approaches that reconstruct multiple masks inside a Core ML graph from a single causalMask input.
Additional context
Closes #330
Example converted Core ML repo using the explicit multi-mask contract:
https://huggingface.co/Skyline23/translategemma-4b-it-coreml