Skip to content

Word boundary events return zero audioOffset for pt-PT neural voices in Node.js Speech SDK #2930

@pedcam

Description

@pedcam

Hi!

Describing the bug

When using pt-PT neural voices (pt-PT-RaquelNeural, pt-PT-DuarteNeural, pt-PT-FernandaNeural) in the Node.js Speech SDK, the synthesisWordBoundary events fire but the audioOffset value is always 0 (or non-incrementing).

This makes it impossible to align text with audio for word highlighting.
The same configuration works as expected with English voices (e.g. en-US-JennyNeural), where audioOffset values increase correctly.

To Reproduce

1 .Create a Speech resource in North Europe region.
2. Install SDK:

npm install microsoft-cognitiveservices-speech-sdk

  1. Run the following minimal Node.js script:
import sdk from "microsoft-cognitiveservices-speech-sdk";
import fs from "fs";

const key = process.env.AZURE_TTS_KEY;
const region = "northeurope"; // your resource region

const speechConfig = sdk.SpeechConfig.fromSubscription(key, region);
speechConfig.speechSynthesisVoiceName = "pt-PT-RaquelNeural";
speechConfig.speechSynthesisOutputFormat =
  sdk.SpeechSynthesisOutputFormat.Riff16Khz16BitMonoPcm;

// ensure events sync to audio
speechConfig.setProperty(
  sdk.PropertyId.SpeechServiceResponse_SynthesisEventsSyncToAudio,
  "true"
);

const synthesizer = new sdk.SpeechSynthesizer(speechConfig);

synthesizer.synthesisWordBoundary = (s, e) => {
  console.log("Word:", e.text, "offsetMs:", e.audioOffset / 10000);
};

synthesizer.speakTextAsync(
  "O que é que o António faz todas as manhãs?",
  result => {
    fs.writeFileSync("out.wav", result.audioData);
    synthesizer.close();
    console.log("Synthesis completed");
  },
  error => {
    console.error("ERROR:", error);
    synthesizer.close();
  }
);
  1. Observe that synthesisWordBoundary fires, but every audioOffset is 0.

  2. Change the voice to en-US-JennyNeural and re-run → offsets increase as expected.

Expected behavior

synthesisWordBoundary events should provide correct, increasing audioOffset values (ms) corresponding to each word’s start time in the audio stream.

Version of the Cognitive Services Speech SDK

[e.g. 1.37.0 — please insert your actual version from package.json or npm ls microsoft-cognitiveservices-speech-sdk]

Platform, Operating System, and Programming Language

OS: Linux (Cloudways Ubuntu 20.04 LTS)

Hardware: x64

Programming Language: Node.js (JavaScript, v18.x)

Also reproduced in browser (Chrome 140 on Windows 10) with client-side SDK.

Additional context

Region: North Europe

Voices tested: pt-PT-RaquelNeural, pt-PT-DuarteNeural, pt-PT-FernandaNeural

Same code works correctly with en-US-JennyNeural.

Property SpeechServiceResponse_SynthesisEventsSyncToAudio is set to "true".

PCM format used: Riff16Khz16BitMonoPcm.

Logs show events firing, but offsets remain 0.

This strongly suggests a service-side bug with pt-PT neural voices not generating timing data.
Microsoft Learn engineer Gerald Felix confirmed this is likely a voice-model issue, not SDK usage.

Let me know if you need any additional info.
Thank you in advance, p

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions