Skip to content

Predictions.convert with SpeechToText always returns empty transcription or "stream too big" error when sending WAV (PCM16) from browser #3014

@brightversion1

Description

@brightversion1

Environment information

Framework: React (TypeScript)

AWS Amplify Version: 6.6.6, e.g. @aws-amplify/[email protected]

Browser: Chrome (latest, desktop)

OS: Windows 11

Device: Desktop

Audio Capture: navigator.mediaDevices.getUserMedia with AudioWorklet

Encoding: PCM16, WAV, 16kHz, mono

Describe the bug


# 🐞 Bug Report: Predictions.convert Speech-to-Text

## Problem Description
When calling `Predictions.convert` with short WAV audio (1–3 seconds), the result is always:

```json
{
  "fullText": ""
}

Occasionally, instead of empty text, the call fails with:

Error from AWS Predictions: Error: Your stream is too big. Reduce the frame size and try your request again

This happens even with very small audio clips (~70–140 KB).



🔍 Notes

  • Audio was tested at 16kHz, mono, PCM16.
  • Other sample rates and stereo also tested → still empty.
  • Verified that WAVs play correctly in browser via Audio() element.

Reproduction steps

🔬 Steps to Reproduce

  1. Record microphone input using navigator.mediaDevices.getUserMedia.
  2. Capture raw PCM samples with an AudioWorkletNode.
  3. Merge Int16 samples and encode into a 16-bit PCM WAV file at 16kHz.
  4. Pass the resulting ArrayBuffer to Predictions.convert.

Recording Code

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
await audioContext.audioWorklet.addModule("/recorderWorklet.js");

const source = audioContext.createMediaStreamSource(stream);
const workletNode = new AudioWorkletNode(audioContext, "recorder-processor");

workletNode.port.onmessage = (event) => {
  const int16Array = event.data;
  audioBuffer.addData(int16Array);
};

source.connect(workletNode).connect(audioContext.destination);

Buffer Management

const getBuffer = () => {
  let buffer: any[] = [];

  const add = (raw: any) => {
    if (Array.isArray(raw)) {
      buffer = buffer.concat(raw);
    } else {
      buffer.push(raw);
    }
    return buffer;
  };

  const reset = () => { buffer = []; };

  return {
    reset,
    addData: add,
    getData: () => buffer,
  };
};

Convert to WAV and Call Predictions

const convertFromBuffer = async () => {
  const mergedPCM = mergeInt16Arrays(audioBuffer.getData());

  // Encode to WAV (PCM16 LE)
  const wavBytes = encodeWAV(mergedPCM, 16000);

  const wavArrayBuffer: ArrayBuffer = wavBytes.buffer.slice(
    wavBytes.byteOffset,
    wavBytes.byteOffset + wavBytes.byteLength
  );

  console.log("WAV length (bytes):", wavBytes.byteLength);
  console.log("ArrayBuffer length:", wavArrayBuffer.byteLength);

  try {
    const { transcription } = await Predictions.convert({
      transcription: {
        source: { bytes: wavArrayBuffer },
        language: "en-US",
      },
    });

    console.log("Transcription result:", transcription);
  } catch (error) {
    console.error("AWS Predictions Error:", error);
  }
};

WAV Encoder

function encodeWAV(int16Array: Int16Array, sampleRate: number): Uint8Array {
  const buffer = new ArrayBuffer(44 + int16Array.length * 2);
  const view = new DataView(buffer);

  const writeString = (view: DataView, offset: number, str: string) => {
    for (let i = 0; i < str.length; i++) {
      view.setUint8(offset + i, str.charCodeAt(i));
    }
  };

  const bytesPerSample = 2;

  writeString(view, 0, "RIFF");
  view.setUint32(4, 36 + int16Array.length * bytesPerSample, true);
  writeString(view, 8, "WAVE");
  writeString(view, 12, "fmt ");
  view.setUint32(16, 16, true);
  view.setUint16(20, 1, true);
  view.setUint16(22, 1, true);
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * bytesPerSample, true);
  view.setUint16(32, bytesPerSample, true);
  view.setUint16(34, 16, true);
  writeString(view, 36, "data");
  view.setUint32(40, int16Array.length * bytesPerSample, true);

  let offset = 44;
  for (let i = 0; i < int16Array.length; i++, offset += 2) {
    view.setInt16(offset, int16Array[i], true);
  }

  return new Uint8Array(buffer);
}

⚠️ Observed Behavior

  • Always returns empty transcription { fullText: "" }.
  • Sometimes errors with "Your stream is too big".
  • Happens even with small WAVs (2–3 seconds, 70–140 KB).

✅ Expected Behavior

  • Short WAVs (≤3 seconds, ≤200 KB) should return valid transcriptions.
  • If the format is invalid, the API should return a descriptive error instead of silently returning { fullText: "" }.

❓ Questions for AWS Team

  1. What is the maximum supported audio size/duration for Predictions.convert?
  2. Which formats are supported? Docs suggest PCM16 WAV or FLAC. Are others (WebM/Opus, MP3) valid?
  3. Why does { fullText: "" } appear with no error? Does this indicate decoding failure (bad WAV header)?
  4. Why is a ~140 KB (3-second) WAV sometimes rejected as "too big"?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp2pending-responseIssue is pending response from author

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions