This API provides streaming speech-to-text transcriptions using WebSockets.

Endpoint:

wss://api.sully.ai/v1/audio/transcriptions/stream?account_id=1234567890&api_token=1234567890&sample_rate=16000&language=en

Headers

X-API-KEY

string

The API key to use for authentication. Required if X-API-TOKEN is not provided.

X-API-TOKEN

string

The API token to use for authentication. Required if X-API-KEY is not provided.

X-ACCOUNT-ID

string

The account ID to use for authentication.

Query parameters

language

string

default:"en"

The language of your submitted audio. See our Supported Languages documentation for a complete list of language options.

sample_rate

string

default:"16000"

The sample rate of your submitted audio. If omitted, the streaming service currently defaults to 16000. For raw, headerless audio, send the actual sample rate of your stream explicitly.

encoding

string

Specifies the encoding format of the audio being sent.Important: This parameter is required when transmitting raw, unstructured audio packets without headers. If the audio data is encapsulated within a container format, this parameter should be omitted. When sample_rate is omitted, the streaming service currently still defaults it to 16000.Supported formats:

linear16: 16-bit, little-endian, signed PCM WAV data
linear32: 32-bit, little-endian, floating-point PCM WAV data
flac: Free Lossless Audio Codec (FLAC) encoded data
alaw: A-law encoded WAV data
mulaw: Mu-law encoded WAV data
amr-nb: Adaptive Multi-Rate (AMR) narrowband codec
amr-wb: Adaptive Multi-Rate (AMR) wideband codec
opus: The Opus audio codec
ogg-opus: The Opus audio codec encapsulated in the Ogg container format
speex: An open-source, speech-specific audio codec
g729: G729 low-bandwidth codec (usable with raw or containerized audio)

dictation

boolean

default:"false"

Requests dictation-oriented transcript formatting for the stream.

account_id

string

The account ID to use for authentication. Required if X-ACCOUNT-ID is not provided.

api_token

string

A temporary authentication token. Required if X-API-KEY is not provided.

When to use

The Speech-to-Text Websockets API is designed to generate text from partial audio input. It’s well-suited for scenarios where the input audio is being streamed or generated in chunks.

Protocol

The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.

Connection Status Messages

Upon successful connection, the server sends a status message:

{
  "type": "status",
  "status": "connected",
  "timestamp": "2026-04-22T14:32:07.123Z"
}

When the connection closes:

{
  "type": "status",
  "status": "disconnected",
  "timestamp": "2026-04-22T14:32:17.123Z"
}

Important: Wait for the type: "status" / status: "connected" message before sending audio data. This ensures the server is ready to process your stream.

Error Messages

During a live stream, the server may also send an error message:

{
  "type": "error",
  "error": "An error occurred during transcription",
  "timestamp": "2026-04-22T14:32:12.123Z"
}

An error message signals a transcription problem, but it does not always mean the session is over. Some runtime error frames are non-terminal and may be followed by later transcript or status messages, while other failures are followed by socket closure. Surface the error immediately, then keep listening for follow-up messages and reconnect when the socket actually closes.

Streaming input audio

The client can send messages with audio input to the server. The messages can contain the following fields:

{
  "audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZSA6KQ=="
}

audio

string

required

A generated partial audio chunk encoded as a base64 string.

Browser MediaRecorder Notice: When using Chrome’s MediaRecorder API, the first audio chunk contains critical header information. Always send this first chunk for proper audio processing. Failing to include header information may result in transcription errors or complete failure.

Streaming output audio

Transcript messages contain the following fields:

Response message

{
  "type": "transcript",
  "audio_start": 0,
  "audio_end": 3.2,
  "duration": 3.2,
  "text": "Hello, world",
  "isFinal": false, 
  "is_final": false,
  "words": [
    {
      "word": "Hello",
      "start": 0.2,
      "end": 0.7,
      "confidence": 0.9856743,
      "punctuated_word": "Hello,"
    },
    {
      "word": "world",
      "start": 0.9,
      "end": 1.2,
      "confidence": 0.9978341,
      "punctuated_word": "world"
    }
  ],
  "timestamp": "2023-08-15T14:32:07.123Z"
}

type

string

The type of response, will be “transcript” for transcription results.

audio_start

number

Start time of the audio segment in seconds.

audio_end

number

End time of the audio segment in seconds.

duration

number

Duration of the audio segment in seconds.

text

string

The processed text sequence.

isFinal

boolean

deprecated

Indicates if the generation is complete. Deprecated: Use is_final instead.

is_final

boolean

Indicates if the generation is complete.

words

array

Array of word objects with text content and timing information:

word: The raw word as recognized
start: Start time of the word in seconds
end: End time of the word in seconds
confidence: Confidence score between 0-1 for the word recognition
punctuated_word: The word with proper capitalization and punctuation

timestamp

string

ISO-formatted timestamp when the response was generated.

Audio Transcriptions

Notes

Note Styles

Codings

Macros

Utils

Events

Schemas

Streaming

Headers

Query parameters

When to use

Protocol

Connection Status Messages

Error Messages

Streaming input audio

Streaming output audio

Audio Transcriptions

Notes

Note Styles

Codings

Macros

Utils

Events

Schemas

Documentation Index

​Headers

​Query parameters

​When to use

​Protocol

​Connection Status Messages

​Error Messages

​Streaming input audio

​Streaming output audio

Headers

Query parameters

When to use

Protocol

Connection Status Messages

Error Messages

Streaming input audio

Streaming output audio