Skip to content

Feature Request: Enhanced Grammar and Vocabulary Support for Underrepresented Languages (Croatian) #557

@MacakPasko

Description

@MacakPasko

Description

I would like to request an expansion of the Croatian language dataset, specifically focusing on everyday communication and specialized domains such as IT, law, and medicine, with a strong emphasis on correct spelling and grammar.

Current Issues

Currently, the model often generates sentences in Croatian that contain minor but critical errors in gender agreement, grammar, or orthography. These small mistakes often render the final output unprofessional or unusable, which is quite frustrating ("so close, yet so far").

Furthermore, I have noticed a recurring pattern where models often default to Serbian (Ekavian dialect, Cyrillic script) as the primary reference point. This forces me to manually specify in every prompt that I am using the Croatian language (Latin script) and to explicitly ask the model to avoid loanwords or non-Croatian expressions.

Requested Improvement

Please dedicate more attention to:

  1. Expanding the Croatian vocabulary and idiomatic expressions.
  2. Improving grammatical accuracy and orthographic precision.
  3. Better differentiation between South Slavic languages to avoid mixing dialects.

I want to be able to use the full potential of the Gemma4 model without constant linguistic corrections. I am at your disposal if there is any way I can assist in providing feedback or helping improve this area.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions