-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Hi!
Describing the bug
When using pt-PT neural voices (pt-PT-RaquelNeural
, pt-PT-DuarteNeural
, pt-PT-FernandaNeural
) in the Node.js Speech SDK, the synthesisWordBoundary
events fire but the audioOffset
value is always 0 (or non-incrementing).
This makes it impossible to align text with audio for word highlighting.
The same configuration works as expected with English voices (e.g. en-US-JennyNeural
), where audioOffset
values increase correctly.
To Reproduce
1 .Create a Speech resource in North Europe region.
2. Install SDK:
npm install microsoft-cognitiveservices-speech-sdk
- Run the following minimal Node.js script:
import sdk from "microsoft-cognitiveservices-speech-sdk";
import fs from "fs";
const key = process.env.AZURE_TTS_KEY;
const region = "northeurope"; // your resource region
const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
speechConfig.speechSynthesisVoiceName = "pt-PT-RaquelNeural";
speechConfig.speechSynthesisOutputFormat =
sdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm;
// ensure events sync to audio
speechConfig.setProperty(
sdk.PropertyId.SpeechServiceResponse_SynthesisEventsSyncToAudio,
"true"
);
const synthesizer = new sdk.SpeechSynthesizer(speechConfig);
synthesizer.synthesisWordBoundary = (s, e) => {
console.log("Word:", e.text, "offsetMs:", e.audioOffset / 10000);
};
synthesizer.speakTextAsync(
"O que é que o António faz todas as manhãs?",
result => {
fs.writeFileSync("out.wav", result.audioData);
synthesizer.close();
console.log("Synthesis completed");
},
error => {
console.error("ERROR:", error);
synthesizer.close();
}
);
-
Observe that
synthesisWordBoundary
fires, but everyaudioOffset
is 0. -
Change the voice to
en-US-JennyNeural
and re-run → offsets increase as expected.
Expected behavior
synthesisWordBoundary
events should provide correct, increasing audioOffset
values (ms) corresponding to each word’s start time in the audio stream.
Version of the Cognitive Services Speech SDK
[e.g. 1.37.0 — please insert your actual version from package.json
or npm ls microsoft-cognitiveservices-speech-sdk
]
Platform, Operating System, and Programming Language
OS: Linux (Cloudways Ubuntu 20.04 LTS)
Hardware: x64
Programming Language: Node.js (JavaScript, v18.x)
Also reproduced in browser (Chrome 140 on Windows 10) with client-side SDK.
Additional context
Region: North Europe
Voices tested: pt-PT-RaquelNeural
, pt-PT-DuarteNeural
, pt-PT-FernandaNeural
Same code works correctly with en-US-JennyNeural
.
Property SpeechServiceResponse_SynthesisEventsSyncToAudio
is set to "true"
.
PCM format used: Riff16Khz16BitMonoPcm
.
Logs show events firing, but offsets remain 0.
This strongly suggests a service-side bug with pt-PT neural voices not generating timing data.
Microsoft Learn engineer Gerald Felix confirmed this is likely a voice-model issue, not SDK usage.
Let me know if you need any additional info.
Thank you in advance, p