fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ#13
Open
yocontra wants to merge 3 commits intohans00:mainfrom
Open
fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ#13yocontra wants to merge 3 commits intohans00:mainfrom
yocontra wants to merge 3 commits intohans00:mainfrom
Conversation
When ipa-dict provides multiple pronunciation variants for a word (e.g. "caught" → /ˈkɑt/, /ˈkɔt/), prefer the variant containing ɔ (THOUGHT vowel) over ɑ (LOT vowel). This better represents standard American English pronunciation for THOUGHT-class words like caught, bought, law, fall, walk, want, etc.
The ipa-dict source uses ə (schwa) for the STRUT vowel in words like "but", "cut", "come", "love", "other", "mother", "nothing", etc. In standard IPA for English, STRUT is /ʌ/ — a distinct phoneme from schwa /ə/. This adds a post-processing step to the dictionary build that detects ə in stressed closed syllables and converts it to ʌ.
The G2P dictionary uses ɝ for all rhotacized vowels, losing the distinction between stressed NURSE vowels and unstressed rhotacized schwa. This adds post-processing to: 1. Convert stressed ɝ → ɜː (NURSE set: bird, word, nurse) 2. Convert unstressed ɝ → ɚ (doctor, letter, teacher) 3. Insert linking ɹ when ɚ precedes a vowel (centuries, batteries) This improves compatibility with TTS systems trained on espeak-ng output, which maintains these distinctions.
yocontra
added a commit
to yocontra/phonemize
that referenced
this pull request
Mar 15, 2026
Merges the following PRs into a single combined branch: - hans00#11: fix: prefer THOUGHT vowel (ɔ) variant in dictionary - hans00#12: fix: use STRUT vowel (ʌ) instead of schwa (ə) in stressed syllables - hans00#13: fix: distinguish NURSE vowel (ɜː) from unstressed ɚ, add linking ɹ - hans00#14: feat: add vowel length marks (ː) to IPA output - hans00#15: fix: British English pronunciation corrections
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The G2P dictionary uses
ɝ(rhotacized vowel) for all rhotacized vowels, but standard IPA distinguishes between two phonemes:ɜː: "bird" =bɜːd, "word" =wɜːd, "nurse" =nɜːsɚ: "doctor" =dɑktɚ, "letter" =lɛtɚ, "teacher" =titʃɚCurrently the library outputs
ɝfor both, losing this distinction. For example:world→ˈwɝɫd(should beˈwɜːɫd)doctor→ˈdɑktɝ(should beˈdɑktɚ)Additionally, when
ɚappears before a vowel, a linking ɹ consonant surfaces in connected speech:sɛntʃɚɹiz(notsɛntʃɝiz)bætɚɹiz(notbætɝiz)Changes
Adds a
fixRhotacizedVowels()post-processing function insrc/tokenizer.tsthat:ɝ→ɜː: Detectsɝas the first vowel after a stress mark (ˈorˌ) with only consonants between them, converting to the NURSE vowelɝ→ɚ: Converts all remainingɝto the standard unstressed rhotacized schwaɹ: Insertsɹbetweenɚand a following vowelMotivation
This improves compatibility with TTS systems trained on espeak-ng output, which maintains the
ɜː/ɚdistinction. Without it, TTS models may produce incorrect vowel quality for stressed NURSE words.Test plan
__tests__/index.test.tsto reflect new outputipaToArpabet)