WebSocket STT
WebSocket API reference for real-time speech-to-text using Vosk.
Endpoint: ws://localhost:3000/ws/stt (or wss:// in production)
Overview
The STT WebSocket accepts binary audio input and returns JSON text events. It provides real-time speech recognition powered by Vosk, running entirely on-device with no external API calls.
Direction: Binary audio in → JSON events out
Input Format
The WebSocket expects raw PCM audio with the following specification:
| Property | Value |
|---|---|
| Format | Raw PCM |
| Bit depth | 16-bit |
| Byte order | Little-endian |
| Sample rate | 16,000 Hz (16 kHz) |
| Channels | Mono (1 channel) |
Output Events
All output events are JSON objects with a type discriminator.
Ready
Sent immediately after connection when the model is loaded and the recognizer is ready.
{ "type": "ready" }
Partial
Sent during active speech recognition with intermediate results. These update rapidly as the user speaks.
{ "type": "partial", "text": "hello how are" }
Final
Sent when Vosk detects an utterance boundary (pause in speech). Contains the finalized transcription.
{ "type": "final", "text": "hello how are you" }
Error
Sent when an error occurs during recognition.
{ "type": "error", "message": "Model not loaded" }
Client Implementation
Audio Capture
The client captures audio from the microphone using the Web Audio API, converting from Float32 to Int16 PCM before sending over the WebSocket.
// Create AudioContext at 16kHz sample rate
const audioContext = new AudioContext({ sampleRate: 16000 });
// Get microphone stream
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = audioContext.createMediaStreamSource(stream);
// Create processor node
const processor = audioContext.createScriptProcessor(4096, 1, 1);
// Connect: microphone → processor → destination
source.connect(processor);
processor.connect(audioContext.destination);
Float32 to Int16 Conversion
Audio samples from the Web Audio API are Float32 (-1.0 to 1.0) and must be converted to Int16 (-32768 to 32767) before sending.
processor.onaudioprocess = (event) => {
const float32Data = event.inputBuffer.getChannelData(0);
const int16Data = new Int16Array(float32Data.length);
for (let i = 0; i < float32Data.length; i++) {
const s = Math.max(-1, Math.min(1, float32Data[i]));
int16Data[i] = s < 0 ? s * 0x8000 : s * 0x7fff;
}
ws.send(int16Data.buffer);
};
WebSocket Connection
const ws = new WebSocket("ws://localhost:3000/ws/stt");
ws.binaryType = "arraybuffer";
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case "ready":
console.log("STT ready");
break;
case "partial":
// Update UI with intermediate text
setPartialText(data.text);
break;
case "final":
// Commit finalized text
appendFinalText(data.text);
break;
case "error":
console.error("STT error:", data.message);
break;
}
};
Lifecycle
1. OPEN
└─ Server loads the active Vosk model
└─ Server creates a recognizer instance
└─ Server sends: { "type": "ready" }
2. MESSAGES
└─ Client sends binary PCM audio buffers
└─ Server feeds audio to Vosk recognizer
└─ Server sends partial/final transcript events
3. CLOSE
└─ Server flushes remaining audio through recognizer
└─ Server sends any final transcript
└─ Server cleans up recognizer and resources
Error Handling
| Scenario | Behavior |
|---|---|
| No model downloaded | Error event: "No STT model available" |
| Model loading fails | Error event with details, connection closed |
| Invalid audio format | Recognition produces empty/garbage results |
| WebSocket disconnects | Resources cleaned up automatically |