-
Notifications
You must be signed in to change notification settings - Fork 691
Description
Description
I would like to request an expansion of the Croatian language dataset, specifically focusing on everyday communication and specialized domains such as IT, law, and medicine, with a strong emphasis on correct spelling and grammar.
Current Issues
Currently, the model often generates sentences in Croatian that contain minor but critical errors in gender agreement, grammar, or orthography. These small mistakes often render the final output unprofessional or unusable, which is quite frustrating ("so close, yet so far").
Furthermore, I have noticed a recurring pattern where models often default to Serbian (Ekavian dialect, Cyrillic script) as the primary reference point. This forces me to manually specify in every prompt that I am using the Croatian language (Latin script) and to explicitly ask the model to avoid loanwords or non-Croatian expressions.
Requested Improvement
Please dedicate more attention to:
- Expanding the Croatian vocabulary and idiomatic expressions.
- Improving grammatical accuracy and orthographic precision.
- Better differentiation between South Slavic languages to avoid mixing dialects.
I want to be able to use the full potential of the Gemma4 model without constant linguistic corrections. I am at your disposal if there is any way I can assist in providing feedback or helping improve this area.