Skip to content

feat: add vowel length marks (ː) to IPA output#14

Open
yocontra wants to merge 2 commits intohans00:mainfrom
yocontra:feat/vowel-length
Open

feat: add vowel length marks (ː) to IPA output#14
yocontra wants to merge 2 commits intohans00:mainfrom
yocontra:feat/vowel-length

Conversation

@yocontra
Copy link
Contributor

@yocontra yocontra commented Mar 4, 2026

Summary

Standard English IPA transcription uses vowel length marks (ː) to distinguish long vowels:

Lexical set Without (current) With length mark (this PR)
FLEECE ("see") siː siː
GOOSE ("food") fuːd fuːd
THOUGHT ("thought") θɔːt θɔːt
PALM ("car") kɑːɹ kɑːɹ
LOT ("hot") hɑːt hɑːt

The library already correctly distinguishes short vs long vowels by using different symbols (ɪ vs i, ʊ vs u), but omits the length mark ː. This matters for TTS systems trained on length-marked IPA (like espeak-ng output), where the presence or absence of ː affects vowel duration in synthesized speech.

Rules implemented

  • FLEECE (i → iː) and GOOSE (u → uː): Lengthened only when they are the first vowel in a stressed syllable (after ˈ or ˌ). This preserves short unstressed final vowels — e.g. "happy" remains hæpi, not hæpiː.
  • THOUGHT (ɔ → ɔː): Always long in American English.
  • PALM/LOT (ɑ → ɑː): Always long (LOT-PALM merger in American English).

Scope

  • Only applies to English words in IPA format output. Chinese and other language IPA is not affected.
  • ARPABET output is unchanged (ARPABET doesn't use length marks).
  • The addVowelLength() function is idempotent — it uses negative lookahead (?!ː) to avoid double-marking.

Changes

  • src/tokenizer.ts: Added addVowelLength() post-processing function in the IPA output path, gated by language detection (en only). Added language parameter to _postProcess() and _predict().
  • __tests__/index.test.ts: Updated existing test expectations to include length marks.
  • __tests__/vowel-length.test.ts: New dedicated test suite covering all four vowel length rules, stress-sensitivity, language isolation, and idempotency.

yocontra added 2 commits March 3, 2026 23:17
When ipa-dict provides multiple pronunciation variants for a word
(e.g. "caught" → /ˈkɑt/, /ˈkɔt/), prefer the variant containing ɔ
(THOUGHT vowel) over ɑ (LOT vowel). This better represents standard
American English pronunciation for THOUGHT-class words like caught,
bought, law, fall, walk, want, etc.
Add standard IPA vowel length marks for American English:

- FLEECE (iː) and GOOSE (uː): lengthened in stressed syllables only,
  preserving short unstressed vowels (e.g. "happy" = hæpi)
- THOUGHT (ɔː): always long in American English
- PALM/LOT (ɑː): always long (LOT-PALM merger in AmE)

The library already uses the correct vowel symbols (ɪ vs i, ʊ vs u)
but was missing the length mark ː that standard IPA transcription
requires. This matters for TTS systems trained on length-marked IPA
(like espeak-ng output).

Vowel lengthening is applied only to English words — Chinese and
other language IPA output is not affected.
yocontra added a commit to yocontra/phonemize that referenced this pull request Mar 15, 2026
Merges the following PRs into a single combined branch:
- hans00#11: fix: prefer THOUGHT vowel (ɔ) variant in dictionary
- hans00#12: fix: use STRUT vowel (ʌ) instead of schwa (ə) in stressed syllables
- hans00#13: fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ
- hans00#14: feat: add vowel length marks (ː) to IPA output
- hans00#15: fix: British English pronunciation corrections
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant