Issue with sharding the model and data batches

I have tried to change the single processor ViT training to Data Parallel on kaggle TPUs with (8, 1) sharding.
I used the mechanism introduced by the notebook Train a MiniGPT by jax-ai-stack but im facing some issues in shrading the batches ( and maybe further in collaboration of batches and model).

This is the gist of the notebook and errors: (note that i skip the weight conversion from HF)
https://gist.github.com/heydaari/854e00c28f57806f0f7ac0818f013bbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with sharding the model and data batches #222

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with sharding the model and data batches #222

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions