feat: add vowel length marks (ː) to IPA output by yocontra · Pull Request #14 · hans00/phonemize

yocontra · 2026-03-04T07:26:13Z

Summary

Standard English IPA transcription uses vowel length marks (ː) to distinguish long vowels:

Lexical set	Without (current)	With length mark (this PR)
FLEECE ("see")	siː	siː
GOOSE ("food")	fuːd	fuːd
THOUGHT ("thought")	θɔːt	θɔːt
PALM ("car")	kɑːɹ	kɑːɹ
LOT ("hot")	hɑːt	hɑːt

The library already correctly distinguishes short vs long vowels by using different symbols (ɪ vs i, ʊ vs u), but omits the length mark ː. This matters for TTS systems trained on length-marked IPA (like espeak-ng output), where the presence or absence of ː affects vowel duration in synthesized speech.

Rules implemented

FLEECE (i → iː) and GOOSE (u → uː): Lengthened only when they are the first vowel in a stressed syllable (after ˈ or ˌ). This preserves short unstressed final vowels — e.g. "happy" remains hæpi, not hæpiː.
THOUGHT (ɔ → ɔː): Always long in American English.
PALM/LOT (ɑ → ɑː): Always long (LOT-PALM merger in American English).

Scope

Only applies to English words in IPA format output. Chinese and other language IPA is not affected.
ARPABET output is unchanged (ARPABET doesn't use length marks).
The addVowelLength() function is idempotent — it uses negative lookahead (?!ː) to avoid double-marking.

Changes

src/tokenizer.ts: Added addVowelLength() post-processing function in the IPA output path, gated by language detection (en only). Added language parameter to _postProcess() and _predict().
__tests__/index.test.ts: Updated existing test expectations to include length marks.
__tests__/vowel-length.test.ts: New dedicated test suite covering all four vowel length rules, stress-sensitivity, language isolation, and idempotency.

When ipa-dict provides multiple pronunciation variants for a word (e.g. "caught" → /ˈkɑt/, /ˈkɔt/), prefer the variant containing ɔ (THOUGHT vowel) over ɑ (LOT vowel). This better represents standard American English pronunciation for THOUGHT-class words like caught, bought, law, fall, walk, want, etc.

Add standard IPA vowel length marks for American English: - FLEECE (iː) and GOOSE (uː): lengthened in stressed syllables only, preserving short unstressed vowels (e.g. "happy" = hæpi) - THOUGHT (ɔː): always long in American English - PALM/LOT (ɑː): always long (LOT-PALM merger in AmE) The library already uses the correct vowel symbols (ɪ vs i, ʊ vs u) but was missing the length mark ː that standard IPA transcription requires. This matters for TTS systems trained on length-marked IPA (like espeak-ng output). Vowel lengthening is applied only to English words — Chinese and other language IPA output is not affected.

Merges the following PRs into a single combined branch: - hans00#11: fix: prefer THOUGHT vowel (ɔ) variant in dictionary - hans00#12: fix: use STRUT vowel (ʌ) instead of schwa (ə) in stressed syllables - hans00#13: fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ - hans00#14: feat: add vowel length marks (ː) to IPA output - hans00#15: fix: British English pronunciation corrections

yocontra added 2 commits March 3, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vowel length marks (ː) to IPA output#14

feat: add vowel length marks (ː) to IPA output#14
yocontra wants to merge 2 commits intohans00:mainfrom
yocontra:feat/vowel-length

yocontra commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yocontra commented Mar 4, 2026

Summary

Rules implemented

Scope

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant