Documentation Index
Fetch the complete documentation index at: https://docs.sully.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Sully.ai provides two approaches for converting patient conversations to text:| Approach | Best For | Latency |
|---|---|---|
| File Upload | Pre-recorded audio, batch processing, large files | Async (seconds to minutes) |
| Real-time Streaming | Live visits, immediate feedback, interactive transcription | Real-time (~200ms) |
File Upload
Upload pre-recorded audio files for asynchronous transcription. This approach is ideal for batch processing, large files, or when real-time feedback is not required.Supported Formats
| Format | MIME Type | Extension |
|---|---|---|
| WAV | audio/wav | .wav |
| MP3 | audio/mpeg | .mp3 |
| FLAC | audio/flac | .flac |
| OGG | audio/ogg | .ogg |
| WebM | audio/webm | .webm |
| MP4 | audio/mp4 | .mp4 |
| M4A | audio/mp4 | .m4a |
| AAC | audio/aac | .aac |
| Opus | audio/opus | .opus |
Maximum file size: 100MB. For larger files, consider splitting into segments or using real-time streaming.
Upload and Poll
File transcription is asynchronous. Submit your file, then poll for completion.Dictation Formatting
If you want prerecorded transcript output formatted for dictation workflows, add the optionaldictation field and set it to true. If omitted,
dictation defaults to false.
Status Lifecycle
| Status | Description |
|---|---|
pending | Request received, queued for processing |
processing | Actively being transcribed |
completed | Transcription ready in payload.transcription |
failed | An error occurred |
Real-time Streaming
Stream audio in real-time during patient visits for immediate transcription feedback. This approach uses WebSockets to send audio chunks and receive transcription segments as they are processed.Connection Flow
Get a Streaming Token
Before connecting to the WebSocket, obtain a short-lived token:WebSocket URL
Connect to the streaming endpoint with your token and audio parameters:| Parameter | Required | Description |
|---|---|---|
sample_rate | No | Audio sample rate in Hz (e.g., 16000, 44100). If omitted, the streaming service currently defaults to 16000. For raw, headerless audio, send the actual sample rate explicitly. |
account_id | Yes | Your Sully account ID |
api_token | Yes | Token from the stream token endpoint |
language | No | BCP47 language tag (e.g., en, es, multi) |
dictation | No | Set to true to request dictation-oriented transcript formatting |
For raw, headerless audio, send both
encoding and sample_rate. For
containerized audio, omit encoding. If sample_rate is omitted, the current
streaming service still defaults it to 16000.Message Format
Stream ready:- Wait for the
status: connectedmessage before sending audio text: The transcribed text for the current segmentis_final: Canonical finality flag for the current segmentisFinal: Compatibility alias foris_finaltype: "error"indicates a transcription problem. Some runtime errors are non-terminal, while other failures are followed by socket closure.
Basic WebSocket Connection
Production Streaming
Real-time audio streaming in production requires handling network interruptions, reconnection, and audio buffering. This section provides battle-tested patterns for reliable streaming.Key Challenges
- Network interruptions - Mobile networks and WiFi can drop unexpectedly
- Token expiration - Streaming tokens have limited validity
- Audio continuity - Buffering audio during reconnection to prevent data loss
- State recovery - Resuming transcription context after reconnection
- Error frames - Some server error messages are non-terminal, while others precede disconnects
Reconnection with Exponential Backoff
Never reconnect immediately after a failure. Use exponential backoff with jitter to prevent thundering herd problems:Production WebSocket Implementation
The following implementation handles reconnection, audio buffering, and error recovery:This example reconnects on close events and connection failures. In your own
implementation, surface
type: "error" messages immediately, but do not assume
every error frame is terminal. Some runtime errors are followed by later stream
messages, while other failures are followed by socket closure.Error Recovery Strategies
| Error | Recovery Strategy |
|---|---|
| Connection timeout | Retry with backoff, check network |
| Token expired (401) | Fetch new token, reconnect |
| Rate limited (429) | Use Retry-After header, increase backoff |
| Server error (5xx) | Retry with backoff |
| WebSocket error message | Surface to caller, keep listening for follow-up messages, reconnect if the socket closes |
| Invalid audio format | Check sample rate, encoding |
| Network disconnect | Reconnect with buffered audio |
Language Support
Sully.ai supports transcription in multiple languages using BCP47 language tags.Supported Languages
| Language | Tag | Regional Variants |
|---|---|---|
| English | en | en-US, en-GB, en-AU |
| Spanish | es | es-US, es-ES, es-MX |
| Chinese | zh | zh-CN, zh-TW |
| French | fr | fr-FR, fr-CA |
| German | de | de-DE |
| Portuguese | pt | pt-BR, pt-PT |
| Japanese | ja | ja-JP |
| Korean | ko | ko-KR |
Multilingual Mode
For conversations that switch between languages, uselanguage=multi:
Language in Streaming
Specify language when connecting to the WebSocket:Audio in languages other than the specified language will be filtered out. Use
multi if your conversations include multiple languages.Choosing Upload vs Stream
Use this decision matrix to select the right approach:| Criterion | File Upload | Real-time Stream |
|---|---|---|
| Use Case | Pre-recorded audio, batch processing | Live patient visits |
| Latency Requirement | Seconds to minutes acceptable | Immediate feedback needed |
| File Size | Any size up to 100MB | N/A (continuous stream) |
| Network Reliability | Single request | Requires stable connection |
| Implementation Complexity | Simple (HTTP upload + polling) | Complex (WebSocket + reconnection) |
| Offline Support | Upload when online | Requires active connection |
When to Use File Upload
- Processing recorded audio from devices or archives
- Batch transcription of multiple files
- Integration with systems that produce audio files
- Environments with unreliable network connectivity (upload when stable)
- Backend processing pipelines
When to Use Real-time Streaming
- Live transcription during patient visits
- Providing immediate visual feedback to clinicians
- Interactive applications where users see text as they speak
- Reducing perceived latency in clinical workflows
- Mobile applications with microphone access
Next Steps
Generate Notes
Convert transcriptions into structured clinical notes
Webhooks
Get notified when transcriptions complete
TypeScript SDK
Full SDK reference for Node.js applications
Python SDK
Full SDK reference for Python applications