Audio Transcription

Overview

Sully.ai provides two approaches for converting patient conversations to text:

Approach	Best For	Latency
File Upload	Pre-recorded audio, batch processing, large files	Async (seconds to minutes)
Real-time Streaming	Live visits, immediate feedback, interactive transcription	Real-time (~200ms)

Both approaches produce the same high-quality medical transcription output that can be passed to note generation.

File Upload

Upload pre-recorded audio files for asynchronous transcription. This approach is ideal for batch processing, large files, or when real-time feedback is not required.

Supported Formats

Format	MIME Type	Extension
WAV	`audio/wav`	`.wav`
MP3	`audio/mpeg`	`.mp3`
FLAC	`audio/flac`	`.flac`
OGG	`audio/ogg`	`.ogg`
WebM	`audio/webm`	`.webm`
MP4	`audio/mp4`	`.mp4`
M4A	`audio/mp4`	`.m4a`
AAC	`audio/aac`	`.aac`
Opus	`audio/opus`	`.opus`

Maximum file size: 100MB. For larger files, consider splitting into segments or using real-time streaming.

Upload and Poll

File transcription is asynchronous. Submit your file, then poll for completion.

import SullyAI from '@sullyai/sullyai';
import * as fs from 'fs';

const client = new SullyAI();

// 1. Upload audio file
const transcription = await client.audio.transcriptions.create({
  audio: fs.createReadStream('patient-visit.mp3'),
});

console.log(`Transcription ID: ${transcription.transcriptionId}`);

// 2. Poll until complete
let result = await client.audio.transcriptions.retrieve(
  transcription.transcriptionId
);

while (result.status === 'STATUS_PROCESSING') {
  await new Promise((resolve) => setTimeout(resolve, 2000));
  result = await client.audio.transcriptions.retrieve(
    transcription.transcriptionId
  );
}

if (result.status === 'STATUS_ERROR') {
  throw new Error('Transcription failed');
}

console.log('Transcript:', result.payload?.transcription);

Status Lifecycle

Status	Description
`pending`	Request received, queued for processing
`processing`	Actively being transcribed
`completed`	Transcription ready in `payload.transcription`
`failed`	An error occurred

For production applications, use webhooks instead of polling to receive notifications when transcription completes.

Real-time Streaming

Stream audio in real-time during patient visits for immediate transcription feedback. This approach uses WebSockets to send audio chunks and receive transcription segments as they are processed.

Connection Flow

Get token     POST /v1/audio/transcriptions/stream/token
Connect       wss://api.sully.ai/v1/audio/transcriptions/stream?...
Send audio    { "audio": "<base64-encoded-audio>" }
Receive text  { "text": "...", "isFinal": true/false }
Close         ws.close()

Get a Streaming Token

Before connecting to the WebSocket, obtain a short-lived token:

const tokenResponse = await fetch(
  'https://api.sully.ai/v1/audio/transcriptions/stream/token',
  {
    method: 'POST',
    headers: {
      'X-API-Key': process.env.SULLY_API_KEY!,
      'X-Account-Id': process.env.SULLY_ACCOUNT_ID!,
    },
  }
);

const { data: { token } } = await tokenResponse.json();

WebSocket URL

Connect to the streaming endpoint with your token and audio parameters:

wss://api.sully.ai/v1/audio/transcriptions/stream?sample_rate=16000&account_id={accountId}&api_token={token}

Parameter	Required	Description
`sample_rate`	Yes	Audio sample rate in Hz (e.g., `16000`, `44100`)
`account_id`	Yes	Your Sully account ID
`api_token`	Yes	Token from the stream token endpoint
`language`	No	BCP47 language tag (e.g., `en`, `es`, `multi`)

Message Format

Sending audio:

{ "audio": "<base64-encoded-audio-chunk>" }

Receiving transcription:

{
  "text": "The patient reports feeling tired",
  "isFinal": false
}

text: The transcribed text for the current segment
isFinal: When true, this segment is complete and will not change

Basic WebSocket Connection

// Get streaming token first (see above)
const token = await getStreamingToken();
const accountId = process.env.SULLY_ACCOUNT_ID!;

// Connect to WebSocket
const ws = new WebSocket(
  `wss://api.sully.ai/v1/audio/transcriptions/stream?sample_rate=16000&account_id=${accountId}&api_token=${token}`
);

// Track transcription segments
const segments: string[] = [];
let currentIndex = 0;

ws.onopen = () => {
  console.log('Connected to transcription stream');
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.text) {
    segments[currentIndex] = data.text;

    // isFinal indicates this segment is complete
    if (data.isFinal) {
      console.log(`Segment ${currentIndex}: ${data.text}`);
      currentIndex++;
    }
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = (event) => {
  console.log(`Connection closed: ${event.code} ${event.reason}`);
  const fullTranscript = segments.join(' ');
  console.log('Full transcript:', fullTranscript);
};

// Send audio data (from microphone, file, etc.)
function sendAudio(audioBuffer: ArrayBuffer) {
  const base64Audio = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  ws.send(JSON.stringify({ audio: base64Audio }));
}

Production Streaming

Real-time audio streaming in production requires handling network interruptions, reconnection, and audio buffering. This section provides battle-tested patterns for reliable streaming.

Key Challenges

Network interruptions - Mobile networks and WiFi can drop unexpectedly
Token expiration - Streaming tokens have limited validity
Audio continuity - Buffering audio during reconnection to prevent data loss
State recovery - Resuming transcription context after reconnection

Reconnection with Exponential Backoff

Never reconnect immediately after a failure. Use exponential backoff with jitter to prevent thundering herd problems:

interface BackoffConfig {
  baseDelayMs: number;
  maxDelayMs: number;
  maxAttempts: number;
}

function calculateBackoff(
  attempt: number,
  config: BackoffConfig
): number {
  const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt);
  const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
  // Add jitter: random value between 0-25% of delay
  const jitter = cappedDelay * Math.random() * 0.25;
  return cappedDelay + jitter;
}

Production WebSocket Implementation

The following implementation handles reconnection, audio buffering, and error recovery:

import { EventEmitter } from 'events';

interface StreamConfig {
  accountId: string;
  apiKey: string;
  sampleRate: number;
  language?: string;
  maxReconnectAttempts?: number;
  baseReconnectDelayMs?: number;
  maxReconnectDelayMs?: number;
  audioBufferMaxSize?: number;
}

interface TranscriptionSegment {
  index: number;
  text: string;
  isFinal: boolean;
}

type ConnectionState =
  | 'disconnected'
  | 'connecting'
  | 'connected'
  | 'reconnecting';

class ProductionTranscriptionStream extends EventEmitter {
  private ws: WebSocket | null = null;
  private state: ConnectionState = 'disconnected';
  private reconnectAttempt = 0;
  private audioBuffer: string[] = [];
  private segments: string[] = [];
  private currentSegmentIndex = 0;
  private token: string | null = null;
  private abortController: AbortController | null = null;

  private readonly config: Required<StreamConfig>;

  constructor(config: StreamConfig) {
    super();
    this.config = {
      maxReconnectAttempts: 5,
      baseReconnectDelayMs: 1000,
      maxReconnectDelayMs: 30000,
      audioBufferMaxSize: 100, // Buffer up to 100 audio chunks
      language: 'en',
      ...config,
    };
  }

  async connect(signal?: AbortSignal): Promise<void> {
    if (signal?.aborted) {
      throw new Error('Connection aborted');
    }

    this.abortController = new AbortController();
    this.state = 'connecting';
    this.emit('stateChange', this.state);

    try {
      // Get fresh token
      this.token = await this.fetchToken();
      await this.establishConnection();
    } catch (error) {
      this.state = 'disconnected';
      this.emit('stateChange', this.state);
      throw error;
    }
  }

  private async fetchToken(): Promise<string> {
    const response = await fetch(
      'https://api.sully.ai/v1/audio/transcriptions/stream/token',
      {
        method: 'POST',
        headers: {
          'X-API-Key': this.config.apiKey,
          'X-Account-Id': this.config.accountId,
        },
        signal: this.abortController?.signal,
      }
    );

    if (!response.ok) {
      throw new Error(`Token fetch failed: ${response.status}`);
    }

    const { data } = await response.json();
    return data.token;
  }

  private async establishConnection(): Promise<void> {
    return new Promise((resolve, reject) => {
      const params = new URLSearchParams({
        sample_rate: this.config.sampleRate.toString(),
        account_id: this.config.accountId,
        api_token: this.token!,
      });

      if (this.config.language) {
        params.set('language', this.config.language);
      }

      const url = `wss://api.sully.ai/v1/audio/transcriptions/stream?${params}`;
      this.ws = new WebSocket(url);

      const connectionTimeout = setTimeout(() => {
        this.ws?.close();
        reject(new Error('Connection timeout'));
      }, 10000);

      this.ws.onopen = () => {
        clearTimeout(connectionTimeout);
        this.state = 'connected';
        this.reconnectAttempt = 0;
        this.emit('stateChange', this.state);
        this.emit('connected');

        // Flush buffered audio
        this.flushAudioBuffer();
        resolve();
      };

      this.ws.onmessage = (event) => {
        this.handleMessage(event.data);
      };

      this.ws.onerror = (error) => {
        clearTimeout(connectionTimeout);
        this.emit('error', error);
      };

      this.ws.onclose = (event) => {
        clearTimeout(connectionTimeout);
        this.handleDisconnect(event);
        if (this.state === 'connecting') {
          reject(new Error(`Connection closed: ${event.code}`));
        }
      };
    });
  }

  private handleMessage(data: string): void {
    try {
      const message = JSON.parse(data);

      if (message.error) {
        this.emit('error', new Error(message.error));
        return;
      }

      if (message.text !== undefined) {
        this.segments[this.currentSegmentIndex] = message.text;

        const segment: TranscriptionSegment = {
          index: this.currentSegmentIndex,
          text: message.text,
          isFinal: message.isFinal ?? false,
        };

        this.emit('transcription', segment);

        if (message.isFinal) {
          this.currentSegmentIndex++;
        }
      }
    } catch (error) {
      this.emit('error', new Error(`Failed to parse message: ${data}`));
    }
  }

  private async handleDisconnect(event: CloseEvent): Promise<void> {
    const wasConnected = this.state === 'connected';
    this.ws = null;

    // Normal closure or intentional disconnect
    if (event.code === 1000 || this.state === 'disconnected') {
      this.state = 'disconnected';
      this.emit('stateChange', this.state);
      this.emit('disconnected', { code: event.code, reason: event.reason });
      return;
    }

    // Unexpected disconnect - attempt reconnection
    if (wasConnected && this.reconnectAttempt < this.config.maxReconnectAttempts) {
      await this.attemptReconnect();
    } else {
      this.state = 'disconnected';
      this.emit('stateChange', this.state);
      this.emit('disconnected', {
        code: event.code,
        reason: event.reason,
        reconnectFailed: true,
      });
    }
  }

  private async attemptReconnect(): Promise<void> {
    this.state = 'reconnecting';
    this.emit('stateChange', this.state);

    const delay = this.calculateBackoff();
    this.emit('reconnecting', {
      attempt: this.reconnectAttempt + 1,
      maxAttempts: this.config.maxReconnectAttempts,
      delayMs: delay,
    });

    await this.sleep(delay);
    this.reconnectAttempt++;

    try {
      // Get fresh token for reconnection
      this.token = await this.fetchToken();
      await this.establishConnection();
    } catch (error) {
      this.emit('error', error);
      // Will trigger another reconnect attempt via onclose handler
    }
  }

  private calculateBackoff(): number {
    const exponentialDelay =
      this.config.baseReconnectDelayMs * Math.pow(2, this.reconnectAttempt);
    const cappedDelay = Math.min(
      exponentialDelay,
      this.config.maxReconnectDelayMs
    );
    const jitter = cappedDelay * Math.random() * 0.25;
    return Math.floor(cappedDelay + jitter);
  }

  sendAudio(audioData: ArrayBuffer | Uint8Array): void {
    const base64Audio = this.arrayBufferToBase64(audioData);

    if (this.state === 'connected' && this.ws?.readyState === WebSocket.OPEN) {
      // Send immediately if connected
      this.ws.send(JSON.stringify({ audio: base64Audio }));
    } else if (
      this.state === 'reconnecting' ||
      this.state === 'connecting'
    ) {
      // Buffer audio during reconnection
      this.bufferAudio(base64Audio);
    }
    // Drop audio if disconnected (not reconnecting)
  }

  private bufferAudio(base64Audio: string): void {
    this.audioBuffer.push(base64Audio);

    // Prevent unbounded buffer growth
    while (this.audioBuffer.length > this.config.audioBufferMaxSize) {
      this.audioBuffer.shift();
      this.emit('bufferOverflow');
    }
  }

  private flushAudioBuffer(): void {
    if (this.audioBuffer.length === 0) return;

    const bufferedCount = this.audioBuffer.length;
    this.emit('bufferFlush', { count: bufferedCount });

    for (const base64Audio of this.audioBuffer) {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ audio: base64Audio }));
      }
    }

    this.audioBuffer = [];
  }

  private arrayBufferToBase64(buffer: ArrayBuffer | Uint8Array): string {
    const bytes = buffer instanceof Uint8Array ? buffer : new Uint8Array(buffer);
    let binary = '';
    for (let i = 0; i < bytes.byteLength; i++) {
      binary += String.fromCharCode(bytes[i]);
    }
    return btoa(binary);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }

  getTranscript(): string {
    return this.segments.join(' ');
  }

  getState(): ConnectionState {
    return this.state;
  }

  disconnect(): void {
    this.state = 'disconnected';
    this.abortController?.abort();
    this.ws?.close(1000, 'Client disconnect');
    this.ws = null;
    this.audioBuffer = [];
  }
}

// Usage example
const stream = new ProductionTranscriptionStream({
  accountId: process.env.SULLY_ACCOUNT_ID!,
  apiKey: process.env.SULLY_API_KEY!,
  sampleRate: 16000,
  language: 'en',
});

stream.on('stateChange', (state) => {
  console.log(`Connection state: ${state}`);
});

stream.on('transcription', (segment) => {
  if (segment.isFinal) {
    console.log(`[Final] ${segment.text}`);
  } else {
    console.log(`[Interim] ${segment.text}`);
  }
});

stream.on('reconnecting', ({ attempt, maxAttempts, delayMs }) => {
  console.log(`Reconnecting (${attempt}/${maxAttempts}) in ${delayMs}ms`);
});

stream.on('error', (error) => {
  console.error('Stream error:', error);
});

await stream.connect();

// Send audio from microphone, file, etc.
// stream.sendAudio(audioChunk);

// When done
// stream.disconnect();

Error Recovery Strategies

Error	Recovery Strategy
Connection timeout	Retry with backoff, check network
Token expired (401)	Fetch new token, reconnect
Rate limited (429)	Use `Retry-After` header, increase backoff
Server error (5xx)	Retry with backoff
Invalid audio format	Check sample rate, encoding
Network disconnect	Reconnect with buffered audio

Always implement a maximum reconnection limit. Infinite reconnection loops can drain device batteries and create unnecessary server load.

Language Support

Sully.ai supports transcription in multiple languages using BCP47 language tags.

Supported Languages

Language	Tag	Regional Variants
English	`en`	`en-US`, `en-GB`, `en-AU`
Spanish	`es`	`es-US`, `es-ES`, `es-MX`
Chinese	`zh`	`zh-CN`, `zh-TW`
French	`fr`	`fr-FR`, `fr-CA`
German	`de`	`de-DE`
Portuguese	`pt`	`pt-BR`, `pt-PT`
Japanese	`ja`	`ja-JP`
Korean	`ko`	`ko-KR`

Multilingual Mode

For conversations that switch between languages, use language=multi:

// File upload with multilingual support
const transcription = await client.audio.transcriptions.create({
  audio: fs.createReadStream('multilingual-visit.mp3'),
  language: 'multi',
});

Language in Streaming

Specify language when connecting to the WebSocket:

wss://api.sully.ai/v1/audio/transcriptions/stream?sample_rate=16000&account_id={id}&api_token={token}&language=es

Audio in languages other than the specified language will be filtered out. Use multi if your conversations include multiple languages.

Choosing Upload vs Stream

Use this decision matrix to select the right approach:

Criterion	File Upload	Real-time Stream
Use Case	Pre-recorded audio, batch processing	Live patient visits
Latency Requirement	Seconds to minutes acceptable	Immediate feedback needed
File Size	Any size up to 100MB	N/A (continuous stream)
Network Reliability	Single request	Requires stable connection
Implementation Complexity	Simple (HTTP upload + polling)	Complex (WebSocket + reconnection)
Offline Support	Upload when online	Requires active connection

When to Use File Upload

Processing recorded audio from devices or archives
Batch transcription of multiple files
Integration with systems that produce audio files
Environments with unreliable network connectivity (upload when stable)
Backend processing pipelines

When to Use Real-time Streaming

Live transcription during patient visits
Providing immediate visual feedback to clinicians
Interactive applications where users see text as they speak
Reducing perceived latency in clinical workflows
Mobile applications with microphone access

Many applications use both approaches: real-time streaming for live visits with immediate feedback, and file upload for processing any recordings that were captured offline.

Next Steps

Generate Notes

Convert transcriptions into structured clinical notes

Webhooks

Get notified when transcriptions complete

TypeScript SDK

Full SDK reference for Node.js applications

Python SDK

Full SDK reference for Python applications

Core Workflows

Integrations

Production

Audio Transcription

Overview

File Upload

Supported Formats

Upload and Poll

Status Lifecycle

Real-time Streaming

Connection Flow

Get a Streaming Token

WebSocket URL

Message Format

Basic WebSocket Connection

Production Streaming

Key Challenges

Reconnection with Exponential Backoff

Production WebSocket Implementation

Error Recovery Strategies

Language Support

Supported Languages

Multilingual Mode

Language in Streaming

Choosing Upload vs Stream

When to Use File Upload

When to Use Real-time Streaming

Next Steps

Generate Notes

Webhooks

TypeScript SDK

Python SDK

Core Workflows

Integrations

Production

​Overview

​File Upload

​Supported Formats

​Upload and Poll

​Status Lifecycle

​Real-time Streaming

​Connection Flow

​Get a Streaming Token

​WebSocket URL

​Message Format

​Basic WebSocket Connection

​Production Streaming

​Key Challenges

​Reconnection with Exponential Backoff

​Production WebSocket Implementation

​Error Recovery Strategies

​Language Support

​Supported Languages

​Multilingual Mode

​Language in Streaming

​Choosing Upload vs Stream

​When to Use File Upload

​When to Use Real-time Streaming

​Next Steps

Generate Notes

Webhooks

TypeScript SDK

Python SDK

Overview

File Upload

Supported Formats

Upload and Poll

Status Lifecycle

Real-time Streaming

Connection Flow

Get a Streaming Token

WebSocket URL

Message Format

Basic WebSocket Connection

Production Streaming

Key Challenges

Reconnection with Exponential Backoff

Production WebSocket Implementation

Error Recovery Strategies

Language Support

Supported Languages

Multilingual Mode

Language in Streaming

Choosing Upload vs Stream

When to Use File Upload

When to Use Real-time Streaming

Next Steps