Replies: 1 comment
-
Maybe MaxText codebase can be useful for porting the model: https://github.com/AI-Hypercomputer/maxtext/blob/7070e8eecbea8951c8e5281219ce797c8df1441f/MaxText/layers/gemma3.py#L15-L16 While Gemma3n is still a feature request in the MaxText repository, you can check Gemma 3 vision encoder and text encoder fusion code which may help:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I’m porting Gemma 3n to Flax NNX and could use some guidance on the vision + text path.
What I’ve done so far:
• Adapted the text‑only Linen version from the Gemma repo to NNX.
• Implemented MobileNet v5 for vision, taking cues from Transformers and vlm-mlx.
• Load text checkpoints from the Gemma 3n repo and vision checkpoints from the Transformers version.
• I can forward through the model; text‑only generation looks good.
Where I’m stuck:
• Vision + text generation isn’t working. I’m likely mishandling the mask, specifically _create_sliding_mask_for_gemma_3n. I’m not confident I’ve reproduced the intended behavior.
Could anyone share high‑level hints or pointers to the intended masking/fusion behavior for the multimodal path (especially how the sliding mask should treat vision tokens vs. text)? Any insight from Gemma 3n devs or folks who have done a similar port would be super helpful. Happy to share code snippets if useful.
Jao
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions