📖 Documentation
📖

Overview

KWIKmotion AI Live Captions is an enterprise WebSocket API providing real-time speech-to-text transcription and multi-language translation. Designed for broadcast media, live streaming, and professional applications.

Key Features

  • Real-Time Transcription: Sub-second latency speech-to-text
  • Multi-Language Translation: Simultaneous translation to multiple languages
  • High Accuracy: Advanced AI processing for superior results
  • 50+ Languages: Comprehensive global language support
  • Flexible Audio: Dynamic chunk sizes supported
  • Production-Ready: Enterprise reliability and error handling
🏆 Industry-Leading Arabic Transcription: We achieve a Word Error Rate (WER) of less than 6% in Arabic—the most advanced Arabic transcription technology available today. Our system delivers exceptional accuracy for Arabic broadcast media and live streaming applications.
🌟 Language Excellence: We excel in English, French, and Dutch with exceptional accuracy rates. Need a special language? Contact us to train a custom ASR model specifically for your language requirements.
🎯 Single-Pass Processing: Send audio chunks of your desired duration (2-10 seconds recommended). The system processes each chunk with optimized single-pass transcription for maximum speed and accuracy.
🔌

WebSocket Connection

Endpoint

wss://your-server-address:PORT/
📝 Note: The server address and PORT number will be provided to you during integration. The connection uses secure WebSocket (WSS) with SSL/TLS encryption.

Protocol

  • Protocol: Secure WebSocket (WSS) with SSL/TLS
  • Message Format: JSON for control, Binary for audio
  • Keepalive: 20-second ping interval, 30-second timeout

Connection Flow

1. Connect to secure WebSocket endpoint (wss://) with Authorization header
2. Send StartRecognition message
4. Wait for RecognitionStarted confirmation
5. Send audio data (one chunk at a time)
6. Receive transcripts and translations
7. Send EndOfStream when done
8. Receive EndOfTranscript
9. Close connection
🔐

Authentication

🔑 Authentication Required

All connections to the KWIKmotion AI Live Captions API require a valid authentication token. You must subscribe to the service to obtain your authentication token.

Overview

The KWIKmotion AI Live Captions API uses bearer token authentication to secure access to the service. All API connections must include a valid authentication token in the HTTP headers during the WebSocket handshake.

Authentication Method

Include the Authorization header with your bearer token when establishing the WebSocket connection:

HeaderFormatExample
Authorization Bearer <token> Bearer eyJhbGciOiJSU0EtT0FFU...
⚠️ Important Format Requirements
  • The token must be prefixed with Bearer (note the space after "Bearer")
  • The header name is Authorization (case-sensitive in some libraries)
  • The header must be included in the initial WebSocket handshake request

Token Management

Your authentication token:

  • ✅ Is validated during the initial WebSocket handshake (before session starts)
  • ✅ Is validated once per connection (not for each message)
  • ✅ Can be used for multiple simultaneous connections (up to your subscription limits)
  • ✅ Remains valid for the duration specified in your subscription
💡 Token Expiry During Session

If your token expires during an active session, the current session will continue until you disconnect. You will need a valid token to establish a new connection.

Authentication Error Responses

If authentication fails, you will receive an error message immediately after connection:

Missing Authorization Header

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "Authentication required: Missing Authorization header",
  "code": 4001,
  "timestamp": 1730406000.123
}

Invalid Token Format

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "Authentication required: Invalid Authorization header format",
  "code": 4001,
  "timestamp": 1730406000.123
}

Authentication Failed (Invalid/Expired Token)

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "token_expired",
  "code": 401,
  "timestamp": 1730406000.123,
  "details": {
    "error": true,
    "message": "token_expired"
  }
}

Insufficient Permissions

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "Insufficient permissions for ailivecaptioning service",
  "code": 403,
  "timestamp": 1730406000.123,
  "details": {
    "error": true,
    "message": "subscription_required"
  }
}

Obtaining an Authentication Token

To use the KWIKmotion AI Live Captions service, you must first subscribe and obtain an authentication token:

  • New Subscriptions: Contact sales@whitepeaks.fr to purchase access to the service
  • Existing Customers: Contact your White Peaks Solutions account manager
  • Technical Support: For technical assistance, email support@whitepeaks.fr

After subscribing, you will receive your unique authentication token via email.

🔐 Security Best Practices
  • Store tokens securely: Use environment variables or secrets management systems
  • Never commit tokens: Do not include tokens in your source code or version control
  • Rotate tokens periodically: Contact your account manager for token rotation
  • Use separate tokens: Request different tokens for dev/staging/production environments
  • Monitor authentication: Log authentication failures in your application for security auditing
  • Secure transmission: Always use secure connections (wss://) in production
🎵

Audio Format

Required Specifications

ParameterValueDescription
Sample Rate 16,000 Hz 16 kHz (recommended)
Channels 1 (mono) Mono audio required
Bit Depth 16-bit Standard PCM
Encoding PCM S16LE Signed 16-bit little-endian

Audio Chunk Sizes

The system supports flexible, dynamic chunk sizes. Send one audio chunk at a time:

DurationBytes (16kHz mono)Use Case
2 seconds64,000Low latency
4 seconds128,000Balanced
6 seconds192,000Recommended
10 seconds320,000Longer context
Formula: Bytes = sample_rate × duration × 2
At 16kHz: Bytes = 16,000 × duration × 2
Example: 6 seconds = 16,000 × 6 × 2 = 192,000 bytes

Supported Encodings

  • pcm_s16le - 16-bit signed little-endian (recommended)
  • pcm_f32le - 32-bit float little-endian
  • mulaw - 8-bit μ-law (telephony)
📨

Message Protocol

Client → Server Messages

1. StartRecognition Required JSON

Initialize a new transcription session.

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_s16le",
    "sample_rate": 16000
  },
  "transcription_config": {
    "language": "en"
  },
  "translation_config": {
    "target_languages": ["fr", "es", "de"]
  }
}
🔐 Authentication: Authentication is handled via the Authorization header during the WebSocket handshake, not in the message body. See the Authentication section for details.

Parameters

FieldTypeRequiredDescription
message string Must be "StartRecognition"
audio_format.type string "raw" or "file"
audio_format.encoding string "pcm_s16le", "pcm_f32le", or "mulaw"
audio_format.sample_rate integer Sample rate in Hz (16000 recommended)
transcription_config.language string Source language ISO 639-1 code
translation_config.target_languages string[] Target language codes (optional)

2. Binary Audio Data Required BINARY

Send one audio chunk at a time in binary format.

Audio Sending Pattern:

Send binary audio chunks sequentially:

Each chunk: Your chosen duration (2-10 seconds recommended)
Format: Raw PCM audio bytes
Processing: Each chunk processed independently with automatic word boundary handling

Example: 6-second chunks

# At 16kHz mono 16-bit:
Chunk duration: 6 seconds = 192,000 bytes

# Send each chunk sequentially:
Chunk 0: Send 192,000 bytes
Chunk 1: Send 192,000 bytes
Chunk 2: Send 192,000 bytes
...

Example: Variable durations

# You can vary chunk sizes:
Chunk 0: 4 seconds = 128,000 bytes
Chunk 1: 6 seconds = 192,000 bytes
Chunk 2: 2 seconds = 64,000 bytes
...
💡 Flexible Durations: You can use any duration from 1-10 seconds. 6 seconds is recommended for optimal balance of latency and accuracy.

3. EndOfStream Required JSON

Signal end of audio stream.

{
  "message": "EndOfStream",
  "last_seq_no": 10
}
FieldTypeRequiredDescription
message string Must be "EndOfStream"
last_seq_no integer Number of segments sent

Server → Client Messages

1. RecognitionStarted JSON

{
  "message": "RecognitionStarted",
  "session_id": "session_1761130389621"
}

2. AudioAdded JSON

{
  "message": "AudioAdded",
  "seq_no": 0
}

Sent after each audio chunk is received and queued for processing.

3. AddTranscript JSON

{
  "message": "AddTranscript",
  "metadata": {
    "transcript": "This is the transcribed text",
    "start_time": 0.0,
    "end_time": 4.0,
    "language": "en",
    "chunk_index": 0,
    "word_count": 10
  }
}

4. AddTranslation JSON

{
  "message": "AddTranslation",
  "metadata": {
    "translation": "C'est le texte traduit",
    "target_language": "fr",
    "start_time": 0.0,
    "end_time": 4.0
  }
}

You'll receive one per target language per transcript.

5. EndOfTranscript JSON

{
  "message": "EndOfTranscript",
  "reason": "no_more_audio"
}

6. Error JSON

{
  "message": "Error",
  "type": "invalid_model",
  "reason": "Unsupported language: xyz",
  "code": 4004,
  "timestamp": 1729728000.123
}

Error Codes

CodeTypeDescription
4001invalid_messageMalformed message or invalid input
4004invalid_modelUnsupported language code
1008policy_violationServer at capacity (session limit)
1011internal_errorServer processing error
🌍

Supported Languages

The API supports 50+ languages for both transcription and translation using ISO 639-1 codes.

🏆 Language Performance Highlights:
  • Arabic: Industry-leading WER < 6% - The most advanced Arabic ASR available
  • English: Exceptional accuracy with broadcast-quality recognition
  • French: Superior performance for European French and Canadian French
  • Dutch: Excellent accuracy for Netherlands and Belgian Dutch
  • Custom Languages: We can train custom ASR models for your specific language needs - Contact us
LanguageISO CodeNative Name
AfrikaansafAfrikaans
Arabicarالعربية
ArmenianhyՀայերեն
AzerbaijaniazAzərbaycan
BelarusianbeБеларуская
BosnianbsBosanski
BulgarianbgБългарски
CatalancaCatalà
Chinese (Simplified)zh中文
CroatianhrHrvatski
CzechcsČeština
DanishdaDansk
DutchnlNederlands
EnglishenEnglish
EstonianetEesti
FinnishfiSuomi
FrenchfrFrançais
GalicianglGalego
GermandeDeutsch
GreekelΕλληνικά
Hebrewheעברית
Hindihiहिन्दी
HungarianhuMagyar
IcelandicisÍslenska
IndonesianidBahasa Indonesia
ItalianitItaliano
Japaneseja日本語
Kannadaknಕನ್ನಡ
KazakhkkҚазақ
Koreanko한국어
Kurdish (Kurmanji)kmrKurdî
Kurdish (Sorani)ckbکوردی
LatvianlvLatviešu
LithuanianltLietuvių
MacedonianmkМакедонски
MalaymsBahasa Melayu
MaorimiMāori
Marathimrमराठी
Nepalineनेपाली
NorwegiannoNorsk
Persian (Farsi)faفارسی
PolishplPolski
PortugueseptPortuguês
RomanianroRomână
RussianruРусский
SerbiansrСрпски
SlovakskSlovenčina
SlovenianslSlovenščina
SpanishesEspañol
SwahiliswKiswahili
SwedishsvSvenska
TagalogtlTagalog
Tamiltaதமிழ்
Thaithไทย
TurkishtrTürkçe
UkrainianukУкраїнська
Urduurاردو
VietnameseviTiếng Việt
WelshcyCymraeg

All languages support both transcription and translation to/from any other supported language.

💻

Implementation Examples

Python Example (Step-by-Step)

#!/usr/bin/env python3
import asyncio
import websockets
import json

async def transcribe_and_translate():
    uri = "wss://your-server-address:PORT"
    token = "YOUR_TOKEN_HERE"  # Your authentication token
    
    # Authentication is required - add Authorization header
    headers = {
        'Authorization': f'Bearer {token}'
    }
    
    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Step 1: Start recognition
        await ws.send(json.dumps({
            "message": "StartRecognition",
            "audio_format": {
                "type": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 16000
            },
            "transcription_config": {
                "language": "en"  # English
            },
            "translation_config": {
                "target_languages": ["fr", "es", "de"]
            }
        }))
        
        # Step 2: Wait for confirmation
        response = json.loads(await ws.recv())
        print(f"Session ID: {response['session_id']}")
        
        # Step 3: Read audio file (16kHz mono PCM)
        with open("audio.raw", "rb") as f:
            audio_data = f.read()
        
        # Step 4: Configure chunk size (example: 6 seconds)
        chunk_duration = 6  # Choose any duration (2-10 seconds)
        chunk_size = 16000 * chunk_duration * 2  # 192,000 bytes for 6s
        
        # Step 5: Send audio chunks sequentially
        chunk_num = 0
        for i in range(0, len(audio_data), chunk_size):
            # Send audio chunk
            chunk = audio_data[i:i + chunk_size]
            await ws.send(chunk)
            print(f"Sent chunk {chunk_num}: {len(chunk)} bytes")
            chunk_num += 1
        
        # Step 6: Listen for results
        while True:
            msg = json.loads(await ws.recv())
            
            if msg["message"] == "AddTranscript":
                print(f"Transcript: {msg['metadata']['transcript']}")
            
            elif msg["message"] == "AddTranslation":
                lang = msg['metadata']['target_language']
                text = msg['metadata']['translation']
                print(f"Translation ({lang}): {text}")
            
            elif msg["message"] == "EndOfTranscript":
                break
        
        # Step 7: End session
        await ws.send(json.dumps({
            "message": "EndOfStream",
            "last_seq_no": chunk_num
        }))

asyncio.run(transcribe_and_translate())

JavaScript Example (Browser)

Note: Browser WebSocket API doesn't support custom headers. For browser-based implementations, authentication token should be passed as a query parameter or use a server-side proxy.
// Option 1: Token as query parameter
const token = 'YOUR_TOKEN_HERE';
const ws = new WebSocket(`wss://your-server-address:PORT?token=${token}`);

ws.onopen = () => {
    // Start recognition
    ws.send(JSON.stringify({
        message: 'StartRecognition',
        audio_format: {
            type: 'raw',
            encoding: 'pcm_s16le',
            sample_rate: 16000
        },
        transcription_config: {
            language: 'en'
        },
        translation_config: {
            target_languages: ['fr', 'es']
        }
    }));
};

ws.onmessage = (event) => {
    const msg = JSON.parse(event.data);
    
    if (msg.message === 'RecognitionStarted') {
        console.log('Session:', msg.session_id);
        
        // Send audio chunks (6s each)
        sendAudioChunks(audioBuffer);
    }
    else if (msg.message === 'AddTranscript') {
        console.log('Transcript:', msg.metadata.transcript);
    }
    else if (msg.message === 'AddTranslation') {
        console.log(`Translation (${msg.metadata.target_language}):`, msg.metadata.translation);
    }
};

function sendAudioChunks(audioBuffer) {
    const chunkSize = 16000 * 6 * 2;  // 6 seconds = 192,000 bytes
    let chunkNum = 0;
    
    for (let i = 0; i < audioBuffer.byteLength; i += chunkSize) {
        // Send audio chunk
        const chunk = audioBuffer.slice(i, i + chunkSize);
        ws.send(chunk);
        console.log(`Sent chunk ${chunkNum}: ${chunk.byteLength} bytes`);
        chunkNum++;
    }
    
    // End session
    ws.send(JSON.stringify({
        message: 'EndOfStream',
        last_seq_no: chunkNum
    }));
}
⚠️

Error Handling

Error Types

TypeCodeDescriptionSolution
invalid_message 4001 Malformed JSON or invalid format Fix message structure
invalid_model 4004 Unsupported language code Use valid ISO 639-1 code
internal_error 1011 Server processing error Retry or contact support

Common Issues

No transcripts received

  • Verify audio format matches configuration (16kHz, mono, PCM S16LE)
  • Ensure you're sending binary audio data (not base64 or JSON)
  • Check audio contains speech (not silence)
  • Verify chunk size is reasonable (2-10 seconds recommended)

Translation not working

  • Ensure target_languages array is not empty
  • Use valid ISO 639-1 codes (lowercase, 2-letter)
  • Verify transcript is not empty
📊

Performance

Latency

OperationTypical Latency
Transcription< 1.5 seconds
Translation (per language)< 0.5 seconds
Total (with 3 translations)< 2.0 seconds

Limits

  • Connection Keepalive: Ping every 20 seconds, 30-second timeout if no response
  • Network Bandwidth: ~340 kbps upload recommended (for 6-second chunks)

FAQ

Q: What chunk duration should I use?

A: 6 seconds is recommended for optimal balance. Shorter (2-4s) for lower latency, longer (8-10s) for better context.

Q: Can I vary chunk sizes during a session?

A: Yes, the system adapts to different chunk sizes dynamically.

Q: How does word boundary handling work?

A: The system automatically handles word boundaries between chunks using 1-second audio buffering and smart deduplication.

Q: How many languages can I translate to simultaneously?

A: No hard limit, but 3-5 languages recommended for optimal performance.

Q: What's the transcription accuracy?

A: For broadcast-quality audio, expect >95% word accuracy. For Arabic specifically, we achieve a Word Error Rate (WER) of less than 6%, making it the most advanced Arabic transcription system available. We also excel in English, French, and Dutch.

Q: Can you support additional languages not listed?

A: Yes! We can train custom ASR (Automatic Speech Recognition) models for specific languages tailored to your needs. Contact our technical team at support@whitepeaks.fr to discuss custom language model training.

Q: Can I use this for live streaming?

A: Yes! Designed for real-time applications with sub-2-second total latency.

🔒

Privacy & Compliance

GDPR Compliance

KWIKmotion AI Live Captions is fully compliant with the General Data Protection Regulation (GDPR). We prioritize your data privacy and security:

  • No Audio Storage: We do not store, record, or keep copies of audio data sent to our service. Audio is processed in real-time and immediately discarded after transcription.
  • No Text Storage: Generated transcripts and translations are not stored on our servers. All text processing occurs in memory and is delivered directly to you via WebSocket.
  • No Logging of Content: We do not log or retain the actual content of your transcripts or translations for any purpose.
  • Session Data Only: We only retain minimal session metadata (connection timestamps, session IDs) necessary for service operation, which is automatically purged after session termination.
  • Data Processing Location: Audio and text processing occurs in real-time on our servers and is immediately discarded after transmission to your client.
  • AI-Powered Processing: Transcription and translation are performed using artificial intelligence (AI) models. We do not hold any responsibility or guarantee the authenticity, accuracy, or completeness of the generated content. AI-generated content may contain errors or inaccuracies.
  • Delicate Content Notice: If you are processing sensitive, legal, medical, or other delicate content, we strongly recommend that you inform your audience that transcription and translation services are provided via AI technology and may not be 100% accurate. It is your responsibility to review and verify any AI-generated content before use.

ISO 27001 Information Security Management

Our service adheres to ISO 27001 standards for information security management:

  • Session Separation for Audio: Each audio session is completely isolated from other sessions. Audio data from one session cannot access or interfere with audio data from another session, ensuring complete data isolation and privacy.
  • Access Controls: Authentication and authorization mechanisms protect your sessions.
  • Real-Time Processing: Audio and text are processed in-memory only, with no persistent storage.
  • Regular Security Audits: Our infrastructure undergoes regular security assessments and compliance reviews.
🛡️ Your Data, Your Control

All audio and text data flows directly through our system without retention. You maintain full control over your data at all times. For questions about our privacy practices, contact: support@whitepeaks.fr

📞

Contact & Support

White Peaks Solutions SAS

🚀 Interactive API Reference

WebSocket API Reference
View the complete API specification with all message types, parameters, and examples.

📥 Download OpenAPI Spec: live-captions-openapi.json ← Import for reference