Skip to content

fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ#13

Open
yocontra wants to merge 3 commits intohans00:mainfrom
yocontra:fix/nurse-vowel-linking-r
Open

fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ#13
yocontra wants to merge 3 commits intohans00:mainfrom
yocontra:fix/nurse-vowel-linking-r

Conversation

@yocontra
Copy link
Contributor

@yocontra yocontra commented Mar 4, 2026

Summary

The G2P dictionary uses ɝ (rhotacized vowel) for all rhotacized vowels, but standard IPA distinguishes between two phonemes:

  • Stressed NURSE vowel ɜː: "bird" = bɜːd, "word" = wɜːd, "nurse" = nɜːs
  • Unstressed rhotacized schwa ɚ: "doctor" = dɑktɚ, "letter" = lɛtɚ, "teacher" = titʃɚ

Currently the library outputs ɝ for both, losing this distinction. For example:

  • worldˈwɝɫd (should be ˈwɜːɫd)
  • doctorˈdɑktɝ (should be ˈdɑktɚ)

Additionally, when ɚ appears before a vowel, a linking ɹ consonant surfaces in connected speech:

  • "centuries" = sɛntʃɚɹiz (not sɛntʃɝiz)
  • "batteries" = bætɚɹiz (not bætɝiz)

Changes

Adds a fixRhotacizedVowels() post-processing function in src/tokenizer.ts that:

  1. Stressed ɝɜː: Detects ɝ as the first vowel after a stress mark (ˈ or ˌ) with only consonants between them, converting to the NURSE vowel
  2. Unstressed ɝɚ: Converts all remaining ɝ to the standard unstressed rhotacized schwa
  3. Linking ɹ: Inserts ɹ between ɚ and a following vowel

Motivation

This improves compatibility with TTS systems trained on espeak-ng output, which maintains the ɜː/ɚ distinction. Without it, TTS models may produce incorrect vowel quality for stressed NURSE words.

Test plan

  • Updated existing test expectations in __tests__/index.test.ts to reflect new output
  • Added dedicated test case for NURSE vowel distinction
  • Verified ARPABET output is not affected (conversion happens before ipaToArpabet)

yocontra added 3 commits March 3, 2026 23:17
When ipa-dict provides multiple pronunciation variants for a word
(e.g. "caught" → /ˈkɑt/, /ˈkɔt/), prefer the variant containing ɔ
(THOUGHT vowel) over ɑ (LOT vowel). This better represents standard
American English pronunciation for THOUGHT-class words like caught,
bought, law, fall, walk, want, etc.
The ipa-dict source uses ə (schwa) for the STRUT vowel in words like
"but", "cut", "come", "love", "other", "mother", "nothing", etc.
In standard IPA for English, STRUT is /ʌ/ — a distinct phoneme from
schwa /ə/. This adds a post-processing step to the dictionary build
that detects ə in stressed closed syllables and converts it to ʌ.
The G2P dictionary uses ɝ for all rhotacized vowels, losing the
distinction between stressed NURSE vowels and unstressed rhotacized
schwa. This adds post-processing to:

1. Convert stressed ɝ → ɜː (NURSE set: bird, word, nurse)
2. Convert unstressed ɝ → ɚ (doctor, letter, teacher)
3. Insert linking ɹ when ɚ precedes a vowel (centuries, batteries)

This improves compatibility with TTS systems trained on espeak-ng
output, which maintains these distinctions.
yocontra added a commit to yocontra/phonemize that referenced this pull request Mar 15, 2026
Merges the following PRs into a single combined branch:
- hans00#11: fix: prefer THOUGHT vowel (ɔ) variant in dictionary
- hans00#12: fix: use STRUT vowel (ʌ) instead of schwa (ə) in stressed syllables
- hans00#13: fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ
- hans00#14: feat: add vowel length marks (ː) to IPA output
- hans00#15: fix: British English pronunciation corrections
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant