Skip to content

Conversation

@mosuka
Copy link
Member

@mosuka mosuka commented Oct 2, 2025

This commit introduces model training and dictionary export capabilities
to lindera-python, enabling users to train custom morphological analysis
models from annotated corpus data.

Features:

  • Add train() function to train CRF-based models from corpus
    • Supports L1 regularization, configurable iterations, and multi-threading
    • Accepts seed lexicon, corpus, character/unknown word/feature definitions
  • Add export() function to export trained models to dictionary files
    • Generates lex.csv, matrix.def, unk.def, char.def
    • Optional metadata.json update support

Implementation:

  • New src/trainer.rs module with PyO3 bindings for train/export
  • Add 'train' feature flag in Cargo.toml (requires lindera/train)
  • Use local lindera path (../lindera/lindera) for latest trainer API
  • Add num_cpus dependency for automatic thread detection

Documentation:

  • Update README.md with training/export usage examples
  • Add examples/train_and_export.py with complete workflow demonstration
  • Add tests/test_trainer.py with comprehensive test coverage
  • Corpus format follows lindera/resources/training conventions
    (tab-separated surface + features with EOS markers)

Changes:

  • Modified: Cargo.toml, src/lib.rs, README.md
  • Added: src/trainer.rs, examples/train_and_export.py, tests/test_trainer.py
  • Updated: Cargo.lock, poetry.lock, pyproject.toml, Makefile

@mosuka mosuka merged commit f8a3652 into main Oct 2, 2025
5 checks passed
@mosuka mosuka deleted the training branch October 2, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants