Word boundary events return zero audioOffset for pt-PT neural voices in Node.js Speech SDK

Hi!

**Describing the bug**

When using pt-PT neural voices (`pt-PT-RaquelNeural`, `pt-PT-DuarteNeural`, `pt-PT-FernandaNeural`) in the Node.js Speech SDK, the `synthesisWordBoundary` events fire but the `audioOffset` value is always 0 (or non-incrementing).

This makes it impossible to align text with audio for word highlighting.
The same configuration works as expected with English voices (e.g. `en-US-JennyNeural`), where `audioOffset` values increase correctly.

**To Reproduce**

1 .Create a Speech resource in North Europe region.
2. Install SDK:

`npm install microsoft-cognitiveservices-speech-sdk`

3. Run the following minimal Node.js script:

```
import sdk from "microsoft-cognitiveservices-speech-sdk";
import fs from "fs";

const key = process.env.AZURE_TTS_KEY;
const region = "northeurope"; // your resource region

const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
speechConfig.speechSynthesisVoiceName = "pt-PT-RaquelNeural";
speechConfig.speechSynthesisOutputFormat =
  sdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm;

// ensure events sync to audio
speechConfig.setProperty(
  sdk.PropertyId.SpeechServiceResponse_SynthesisEventsSyncToAudio,
  "true"
);

const synthesizer = new sdk.SpeechSynthesizer(speechConfig);

synthesizer.synthesisWordBoundary = (s, e) => {
  console.log("Word:", e.text, "offsetMs:", e.audioOffset / 10000);
};

synthesizer.speakTextAsync(
  "O que é que o António faz todas as manhãs?",
  result => {
    fs.writeFileSync("out.wav", result.audioData);
    synthesizer.close();
    console.log("Synthesis completed");
  },
  error => {
    console.error("ERROR:", error);
    synthesizer.close();
  }
);
```


4. Observe that `synthesisWordBoundary` fires, but every `audioOffset` is 0.

5. Change the voice to `en-US-JennyNeural` and re-run → offsets increase as expected.

**Expected behavior**

`synthesisWordBoundary` events should provide correct, increasing `audioOffset` values (ms) corresponding to each word’s start time in the audio stream.

**Version of the Cognitive Services Speech SDK**

[e.g. 1.37.0 — please insert your actual version from `package.json` or `npm ls microsoft-cognitiveservices-speech-sdk`]

**Platform, Operating System, and Programming Language**

OS: Linux (Cloudways Ubuntu 20.04 LTS)

Hardware: x64

Programming Language: Node.js (JavaScript, v18.x)

Also reproduced in browser (Chrome 140 on Windows 10) with client-side SDK.

**Additional context**

Region: North Europe

Voices tested: `pt-PT-RaquelNeural`, `pt-PT-DuarteNeural`, `pt-PT-FernandaNeural`

Same code works correctly with `en-US-JennyNeural`.

Property `SpeechServiceResponse_SynthesisEventsSyncToAudio` is set to `"true"`.

PCM format used: `Riff16Khz16BitMonoPcm`.

Logs show events firing, but offsets remain 0.

This strongly suggests a service-side bug with pt-PT neural voices not generating timing data.
Microsoft Learn engineer Gerald Felix confirmed this is likely a voice-model issue, not SDK usage.


Let me know if you need any additional info.
Thank you in advance, p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Word boundary events return zero audioOffset for pt-PT neural voices in Node.js Speech SDK #2930

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Word boundary events return zero audioOffset for pt-PT neural voices in Node.js Speech SDK #2930

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions