Skip to main content

Overview

Sully.ai provides two approaches for converting patient conversations to text:
ApproachBest ForLatency
File UploadPre-recorded audio, batch processing, large filesAsync (seconds to minutes)
Real-time StreamingLive visits, immediate feedback, interactive transcriptionReal-time (~200ms)
Both approaches produce the same high-quality medical transcription output that can be passed to note generation.

File Upload

Upload pre-recorded audio files for asynchronous transcription. This approach is ideal for batch processing, large files, or when real-time feedback is not required.

Supported Formats

FormatMIME TypeExtension
WAVaudio/wav.wav
MP3audio/mpeg.mp3
FLACaudio/flac.flac
OGGaudio/ogg.ogg
WebMaudio/webm.webm
MP4audio/mp4.mp4
M4Aaudio/mp4.m4a
AACaudio/aac.aac
Opusaudio/opus.opus
Maximum file size: 100MB. For larger files, consider splitting into segments or using real-time streaming.

Upload and Poll

File transcription is asynchronous. Submit your file, then poll for completion.
import SullyAI from '@sullyai/sullyai';
import * as fs from 'fs';

const client = new SullyAI();

// 1. Upload audio file
const transcription = await client.audio.transcriptions.create({
  audio: fs.createReadStream('patient-visit.mp3'),
});

console.log(`Transcription ID: ${transcription.transcriptionId}`);

// 2. Poll until complete
let result = await client.audio.transcriptions.retrieve(
  transcription.transcriptionId
);

while (result.status === 'STATUS_PROCESSING') {
  await new Promise((resolve) => setTimeout(resolve, 2000));
  result = await client.audio.transcriptions.retrieve(
    transcription.transcriptionId
  );
}

if (result.status === 'STATUS_ERROR') {
  throw new Error('Transcription failed');
}

console.log('Transcript:', result.payload?.transcription);

Status Lifecycle

StatusDescription
pendingRequest received, queued for processing
processingActively being transcribed
completedTranscription ready in payload.transcription
failedAn error occurred
For production applications, use webhooks instead of polling to receive notifications when transcription completes.

Real-time Streaming

Stream audio in real-time during patient visits for immediate transcription feedback. This approach uses WebSockets to send audio chunks and receive transcription segments as they are processed.

Connection Flow

1. Get token     POST /v1/audio/transcriptions/stream/token
2. Connect       wss://api.sully.ai/v1/audio/transcriptions/stream?...
3. Send audio    { "audio": "<base64-encoded-audio>" }
4. Receive text  { "text": "...", "isFinal": true/false }
5. Close         ws.close()

Get a Streaming Token

Before connecting to the WebSocket, obtain a short-lived token:
const tokenResponse = await fetch(
  'https://api.sully.ai/v1/audio/transcriptions/stream/token',
  {
    method: 'POST',
    headers: {
      'X-API-Key': process.env.SULLY_API_KEY!,
      'X-Account-Id': process.env.SULLY_ACCOUNT_ID!,
    },
  }
);

const { data: { token } } = await tokenResponse.json();

WebSocket URL

Connect to the streaming endpoint with your token and audio parameters:
wss://api.sully.ai/v1/audio/transcriptions/stream?sample_rate=16000&account_id={accountId}&api_token={token}
ParameterRequiredDescription
sample_rateYesAudio sample rate in Hz (e.g., 16000, 44100)
account_idYesYour Sully account ID
api_tokenYesToken from the stream token endpoint
languageNoBCP47 language tag (e.g., en, es, multi)

Message Format

Sending audio:
{ "audio": "<base64-encoded-audio-chunk>" }
Receiving transcription:
{
  "text": "The patient reports feeling tired",
  "isFinal": false
}
  • text: The transcribed text for the current segment
  • isFinal: When true, this segment is complete and will not change

Basic WebSocket Connection

// Get streaming token first (see above)
const token = await getStreamingToken();
const accountId = process.env.SULLY_ACCOUNT_ID!;

// Connect to WebSocket
const ws = new WebSocket(
  `wss://api.sully.ai/v1/audio/transcriptions/stream?sample_rate=16000&account_id=${accountId}&api_token=${token}`
);

// Track transcription segments
const segments: string[] = [];
let currentIndex = 0;

ws.onopen = () => {
  console.log('Connected to transcription stream');
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.text) {
    segments[currentIndex] = data.text;

    // isFinal indicates this segment is complete
    if (data.isFinal) {
      console.log(`Segment ${currentIndex}: ${data.text}`);
      currentIndex++;
    }
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = (event) => {
  console.log(`Connection closed: ${event.code} ${event.reason}`);
  const fullTranscript = segments.join(' ');
  console.log('Full transcript:', fullTranscript);
};

// Send audio data (from microphone, file, etc.)
function sendAudio(audioBuffer: ArrayBuffer) {
  const base64Audio = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  ws.send(JSON.stringify({ audio: base64Audio }));
}

Production Streaming

Real-time audio streaming in production requires handling network interruptions, reconnection, and audio buffering. This section provides battle-tested patterns for reliable streaming.

Key Challenges

  1. Network interruptions - Mobile networks and WiFi can drop unexpectedly
  2. Token expiration - Streaming tokens have limited validity
  3. Audio continuity - Buffering audio during reconnection to prevent data loss
  4. State recovery - Resuming transcription context after reconnection

Reconnection with Exponential Backoff

Never reconnect immediately after a failure. Use exponential backoff with jitter to prevent thundering herd problems:
interface BackoffConfig {
  baseDelayMs: number;
  maxDelayMs: number;
  maxAttempts: number;
}

function calculateBackoff(
  attempt: number,
  config: BackoffConfig
): number {
  const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt);
  const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
  // Add jitter: random value between 0-25% of delay
  const jitter = cappedDelay * Math.random() * 0.25;
  return cappedDelay + jitter;
}

Production WebSocket Implementation

The following implementation handles reconnection, audio buffering, and error recovery:
import { EventEmitter } from 'events';

interface StreamConfig {
  accountId: string;
  apiKey: string;
  sampleRate: number;
  language?: string;
  maxReconnectAttempts?: number;
  baseReconnectDelayMs?: number;
  maxReconnectDelayMs?: number;
  audioBufferMaxSize?: number;
}

interface TranscriptionSegment {
  index: number;
  text: string;
  isFinal: boolean;
}

type ConnectionState =
  | 'disconnected'
  | 'connecting'
  | 'connected'
  | 'reconnecting';

class ProductionTranscriptionStream extends EventEmitter {
  private ws: WebSocket | null = null;
  private state: ConnectionState = 'disconnected';
  private reconnectAttempt = 0;
  private audioBuffer: string[] = [];
  private segments: string[] = [];
  private currentSegmentIndex = 0;
  private token: string | null = null;
  private abortController: AbortController | null = null;

  private readonly config: Required<StreamConfig>;

  constructor(config: StreamConfig) {
    super();
    this.config = {
      maxReconnectAttempts: 5,
      baseReconnectDelayMs: 1000,
      maxReconnectDelayMs: 30000,
      audioBufferMaxSize: 100, // Buffer up to 100 audio chunks
      language: 'en',
      ...config,
    };
  }

  async connect(signal?: AbortSignal): Promise<void> {
    if (signal?.aborted) {
      throw new Error('Connection aborted');
    }

    this.abortController = new AbortController();
    this.state = 'connecting';
    this.emit('stateChange', this.state);

    try {
      // Get fresh token
      this.token = await this.fetchToken();
      await this.establishConnection();
    } catch (error) {
      this.state = 'disconnected';
      this.emit('stateChange', this.state);
      throw error;
    }
  }

  private async fetchToken(): Promise<string> {
    const response = await fetch(
      'https://api.sully.ai/v1/audio/transcriptions/stream/token',
      {
        method: 'POST',
        headers: {
          'X-API-Key': this.config.apiKey,
          'X-Account-Id': this.config.accountId,
        },
        signal: this.abortController?.signal,
      }
    );

    if (!response.ok) {
      throw new Error(`Token fetch failed: ${response.status}`);
    }

    const { data } = await response.json();
    return data.token;
  }

  private async establishConnection(): Promise<void> {
    return new Promise((resolve, reject) => {
      const params = new URLSearchParams({
        sample_rate: this.config.sampleRate.toString(),
        account_id: this.config.accountId,
        api_token: this.token!,
      });

      if (this.config.language) {
        params.set('language', this.config.language);
      }

      const url = `wss://api.sully.ai/v1/audio/transcriptions/stream?${params}`;
      this.ws = new WebSocket(url);

      const connectionTimeout = setTimeout(() => {
        this.ws?.close();
        reject(new Error('Connection timeout'));
      }, 10000);

      this.ws.onopen = () => {
        clearTimeout(connectionTimeout);
        this.state = 'connected';
        this.reconnectAttempt = 0;
        this.emit('stateChange', this.state);
        this.emit('connected');

        // Flush buffered audio
        this.flushAudioBuffer();
        resolve();
      };

      this.ws.onmessage = (event) => {
        this.handleMessage(event.data);
      };

      this.ws.onerror = (error) => {
        clearTimeout(connectionTimeout);
        this.emit('error', error);
      };

      this.ws.onclose = (event) => {
        clearTimeout(connectionTimeout);
        this.handleDisconnect(event);
        if (this.state === 'connecting') {
          reject(new Error(`Connection closed: ${event.code}`));
        }
      };
    });
  }

  private handleMessage(data: string): void {
    try {
      const message = JSON.parse(data);

      if (message.error) {
        this.emit('error', new Error(message.error));
        return;
      }

      if (message.text !== undefined) {
        this.segments[this.currentSegmentIndex] = message.text;

        const segment: TranscriptionSegment = {
          index: this.currentSegmentIndex,
          text: message.text,
          isFinal: message.isFinal ?? false,
        };

        this.emit('transcription', segment);

        if (message.isFinal) {
          this.currentSegmentIndex++;
        }
      }
    } catch (error) {
      this.emit('error', new Error(`Failed to parse message: ${data}`));
    }
  }

  private async handleDisconnect(event: CloseEvent): Promise<void> {
    const wasConnected = this.state === 'connected';
    this.ws = null;

    // Normal closure or intentional disconnect
    if (event.code === 1000 || this.state === 'disconnected') {
      this.state = 'disconnected';
      this.emit('stateChange', this.state);
      this.emit('disconnected', { code: event.code, reason: event.reason });
      return;
    }

    // Unexpected disconnect - attempt reconnection
    if (wasConnected && this.reconnectAttempt < this.config.maxReconnectAttempts) {
      await this.attemptReconnect();
    } else {
      this.state = 'disconnected';
      this.emit('stateChange', this.state);
      this.emit('disconnected', {
        code: event.code,
        reason: event.reason,
        reconnectFailed: true,
      });
    }
  }

  private async attemptReconnect(): Promise<void> {
    this.state = 'reconnecting';
    this.emit('stateChange', this.state);

    const delay = this.calculateBackoff();
    this.emit('reconnecting', {
      attempt: this.reconnectAttempt + 1,
      maxAttempts: this.config.maxReconnectAttempts,
      delayMs: delay,
    });

    await this.sleep(delay);
    this.reconnectAttempt++;

    try {
      // Get fresh token for reconnection
      this.token = await this.fetchToken();
      await this.establishConnection();
    } catch (error) {
      this.emit('error', error);
      // Will trigger another reconnect attempt via onclose handler
    }
  }

  private calculateBackoff(): number {
    const exponentialDelay =
      this.config.baseReconnectDelayMs * Math.pow(2, this.reconnectAttempt);
    const cappedDelay = Math.min(
      exponentialDelay,
      this.config.maxReconnectDelayMs
    );
    const jitter = cappedDelay * Math.random() * 0.25;
    return Math.floor(cappedDelay + jitter);
  }

  sendAudio(audioData: ArrayBuffer | Uint8Array): void {
    const base64Audio = this.arrayBufferToBase64(audioData);

    if (this.state === 'connected' && this.ws?.readyState === WebSocket.OPEN) {
      // Send immediately if connected
      this.ws.send(JSON.stringify({ audio: base64Audio }));
    } else if (
      this.state === 'reconnecting' ||
      this.state === 'connecting'
    ) {
      // Buffer audio during reconnection
      this.bufferAudio(base64Audio);
    }
    // Drop audio if disconnected (not reconnecting)
  }

  private bufferAudio(base64Audio: string): void {
    this.audioBuffer.push(base64Audio);

    // Prevent unbounded buffer growth
    while (this.audioBuffer.length > this.config.audioBufferMaxSize) {
      this.audioBuffer.shift();
      this.emit('bufferOverflow');
    }
  }

  private flushAudioBuffer(): void {
    if (this.audioBuffer.length === 0) return;

    const bufferedCount = this.audioBuffer.length;
    this.emit('bufferFlush', { count: bufferedCount });

    for (const base64Audio of this.audioBuffer) {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ audio: base64Audio }));
      }
    }

    this.audioBuffer = [];
  }

  private arrayBufferToBase64(buffer: ArrayBuffer | Uint8Array): string {
    const bytes = buffer instanceof Uint8Array ? buffer : new Uint8Array(buffer);
    let binary = '';
    for (let i = 0; i < bytes.byteLength; i++) {
      binary += String.fromCharCode(bytes[i]);
    }
    return btoa(binary);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }

  getTranscript(): string {
    return this.segments.join(' ');
  }

  getState(): ConnectionState {
    return this.state;
  }

  disconnect(): void {
    this.state = 'disconnected';
    this.abortController?.abort();
    this.ws?.close(1000, 'Client disconnect');
    this.ws = null;
    this.audioBuffer = [];
  }
}

// Usage example
const stream = new ProductionTranscriptionStream({
  accountId: process.env.SULLY_ACCOUNT_ID!,
  apiKey: process.env.SULLY_API_KEY!,
  sampleRate: 16000,
  language: 'en',
});

stream.on('stateChange', (state) => {
  console.log(`Connection state: ${state}`);
});

stream.on('transcription', (segment) => {
  if (segment.isFinal) {
    console.log(`[Final] ${segment.text}`);
  } else {
    console.log(`[Interim] ${segment.text}`);
  }
});

stream.on('reconnecting', ({ attempt, maxAttempts, delayMs }) => {
  console.log(`Reconnecting (${attempt}/${maxAttempts}) in ${delayMs}ms`);
});

stream.on('error', (error) => {
  console.error('Stream error:', error);
});

await stream.connect();

// Send audio from microphone, file, etc.
// stream.sendAudio(audioChunk);

// When done
// stream.disconnect();

Error Recovery Strategies

ErrorRecovery Strategy
Connection timeoutRetry with backoff, check network
Token expired (401)Fetch new token, reconnect
Rate limited (429)Use Retry-After header, increase backoff
Server error (5xx)Retry with backoff
Invalid audio formatCheck sample rate, encoding
Network disconnectReconnect with buffered audio
Always implement a maximum reconnection limit. Infinite reconnection loops can drain device batteries and create unnecessary server load.

Language Support

Sully.ai supports transcription in multiple languages using BCP47 language tags.

Supported Languages

LanguageTagRegional Variants
Englishenen-US, en-GB, en-AU
Spanisheses-US, es-ES, es-MX
Chinesezhzh-CN, zh-TW
Frenchfrfr-FR, fr-CA
Germandede-DE
Portugueseptpt-BR, pt-PT
Japanesejaja-JP
Koreankoko-KR

Multilingual Mode

For conversations that switch between languages, use language=multi:
// File upload with multilingual support
const transcription = await client.audio.transcriptions.create({
  audio: fs.createReadStream('multilingual-visit.mp3'),
  language: 'multi',
});

Language in Streaming

Specify language when connecting to the WebSocket:
wss://api.sully.ai/v1/audio/transcriptions/stream?sample_rate=16000&account_id={id}&api_token={token}&language=es
Audio in languages other than the specified language will be filtered out. Use multi if your conversations include multiple languages.

Choosing Upload vs Stream

Use this decision matrix to select the right approach:
CriterionFile UploadReal-time Stream
Use CasePre-recorded audio, batch processingLive patient visits
Latency RequirementSeconds to minutes acceptableImmediate feedback needed
File SizeAny size up to 100MBN/A (continuous stream)
Network ReliabilitySingle requestRequires stable connection
Implementation ComplexitySimple (HTTP upload + polling)Complex (WebSocket + reconnection)
Offline SupportUpload when onlineRequires active connection

When to Use File Upload

  • Processing recorded audio from devices or archives
  • Batch transcription of multiple files
  • Integration with systems that produce audio files
  • Environments with unreliable network connectivity (upload when stable)
  • Backend processing pipelines

When to Use Real-time Streaming

  • Live transcription during patient visits
  • Providing immediate visual feedback to clinicians
  • Interactive applications where users see text as they speak
  • Reducing perceived latency in clinical workflows
  • Mobile applications with microphone access
Many applications use both approaches: real-time streaming for live visits with immediate feedback, and file upload for processing any recordings that were captured offline.

Next Steps