KWIKmotion AI Live Captions - API Documentation | Real-Time Transcription & Translation

📖 Documentation

📖

Overview

KWIKmotion AI Live Captions is an enterprise WebSocket API providing real-time speech-to-text transcription and multi-language translation. Designed for broadcast media, live streaming, and professional applications.

Key Features

Real-Time Transcription: Sub-second latency speech-to-text
Multi-Language Translation: Simultaneous translation to multiple languages
High Accuracy: Advanced AI processing for superior results
50+ Languages: Comprehensive global language support
Flexible Audio: Dynamic chunk sizes supported
Production-Ready: Enterprise reliability and error handling

🏆 Industry-Leading Arabic Transcription: We achieve a Word Error Rate (WER) of less than 6% in Arabic—the most advanced Arabic transcription technology available today. Our system delivers exceptional accuracy for Arabic broadcast media and live streaming applications.

🌟 Language Excellence: We excel in English, French, and Dutch with exceptional accuracy rates. Need a special language? Contact us to train a custom ASR model specifically for your language requirements.

🎯 Single-Pass Processing: Send audio chunks of your desired duration (2-10 seconds recommended). The system processes each chunk with optimized single-pass transcription for maximum speed and accuracy.

🔌

WebSocket Connection

Endpoint

wss://your-server-address:PORT/

📝 Note: The server address and PORT number will be provided to you during integration. The connection uses secure WebSocket (WSS) with SSL/TLS encryption.

Protocol

Protocol: Secure WebSocket (WSS) with SSL/TLS
Message Format: JSON for control, Binary for audio
Keepalive: 20-second ping interval, 30-second timeout

Connection Flow

1. Connect to secure WebSocket endpoint (wss://) with Authorization header
2. Send StartRecognition message
4. Wait for RecognitionStarted confirmation
5. Send audio data (one chunk at a time)
6. Receive transcripts and translations
7. Send EndOfStream when done
8. Receive EndOfTranscript
9. Close connection

🔐

Authentication

🔑 Authentication Required

All connections to the KWIKmotion AI Live Captions API require a valid authentication token. You must subscribe to the service to obtain your authentication token.

Overview

The KWIKmotion AI Live Captions API uses bearer token authentication to secure access to the service. All API connections must include a valid authentication token in the HTTP headers during the WebSocket handshake.

Authentication Method

Include the Authorization header with your bearer token when establishing the WebSocket connection:

Header	Format	Example
`Authorization`	`Bearer <token>`	`Bearer eyJhbGciOiJSU0EtT0FFU...`

⚠️ Important Format Requirements

The token must be prefixed with Bearer (note the space after "Bearer")
The header name is Authorization (case-sensitive in some libraries)
The header must be included in the initial WebSocket handshake request

Token Management

Your authentication token:

✅ Is validated during the initial WebSocket handshake (before session starts)
✅ Is validated once per connection (not for each message)
✅ Can be used for multiple simultaneous connections (up to your subscription limits)
✅ Remains valid for the duration specified in your subscription

💡 Token Expiry During Session

If your token expires during an active session, the current session will continue until you disconnect. You will need a valid token to establish a new connection.

Authentication Error Responses

If authentication fails, you will receive an error message immediately after connection:

Missing Authorization Header

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "Authentication required: Missing Authorization header",
  "code": 4001,
  "timestamp": 1730406000.123
}

Invalid Token Format

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "Authentication required: Invalid Authorization header format",
  "code": 4001,
  "timestamp": 1730406000.123
}

Authentication Failed (Invalid/Expired Token)

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "token_expired",
  "code": 401,
  "timestamp": 1730406000.123,
  "details": {
    "error": true,
    "message": "token_expired"
  }
}

Insufficient Permissions

{
  "message": "Error",
  "type": "authentication_error",
  "reason": "Insufficient permissions for ailivecaptioning service",
  "code": 403,
  "timestamp": 1730406000.123,
  "details": {
    "error": true,
    "message": "subscription_required"
  }
}

Obtaining an Authentication Token

To use the KWIKmotion AI Live Captions service, you must first subscribe and obtain an authentication token:

New Subscriptions: Contact sales@whitepeaks.fr to purchase access to the service
Existing Customers: Contact your White Peaks Solutions account manager
Technical Support: For technical assistance, email support@whitepeaks.fr

After subscribing, you will receive your unique authentication token via email.

🔐 Security Best Practices

Store tokens securely: Use environment variables or secrets management systems
Never commit tokens: Do not include tokens in your source code or version control
Rotate tokens periodically: Contact your account manager for token rotation
Use separate tokens: Request different tokens for dev/staging/production environments
Monitor authentication: Log authentication failures in your application for security auditing
Secure transmission: Always use secure connections (wss://) in production

🎵

Audio Format

Required Specifications

Parameter	Value	Description
Sample Rate	16,000 Hz	16 kHz (recommended)
Channels	1 (mono)	Mono audio required
Bit Depth	16-bit	Standard PCM
Encoding	PCM S16LE	Signed 16-bit little-endian

Audio Chunk Sizes

The system supports flexible, dynamic chunk sizes. Send one audio chunk at a time:

Duration	Bytes (16kHz mono)	Use Case
2 seconds	64,000	Low latency
4 seconds	128,000	Balanced
6 seconds	192,000	Recommended
10 seconds	320,000	Longer context

Formula: Bytes = sample_rate × duration × 2
At 16kHz: Bytes = 16,000 × duration × 2
Example: 6 seconds = 16,000 × 6 × 2 = 192,000 bytes

Supported Encodings

pcm_s16le - 16-bit signed little-endian (recommended)
pcm_f32le - 32-bit float little-endian
mulaw - 8-bit μ-law (telephony)

📨

Message Protocol

Client → Server Messages

1. StartRecognition Required JSON

Initialize a new transcription session.

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_s16le",
    "sample_rate": 16000
  },
  "transcription_config": {
    "language": "en"
  },
  "translation_config": {
    "target_languages": ["fr", "es", "de"]
  }
}

🔐 Authentication: Authentication is handled via the Authorization header during the WebSocket handshake, not in the message body. See the Authentication section for details.

Parameters

Field	Type	Required	Description
`message`	string	✅	Must be "StartRecognition"
`audio_format.type`	string	✅	"raw" or "file"
`audio_format.encoding`	string	✅	"pcm_s16le", "pcm_f32le", or "mulaw"
`audio_format.sample_rate`	integer	✅	Sample rate in Hz (16000 recommended)
`transcription_config.language`	string	✅	Source language ISO 639-1 code
`translation_config.target_languages`	string[]	❌	Target language codes (optional)

2. Binary Audio Data Required BINARY

Send one audio chunk at a time in binary format.

Audio Sending Pattern:

Send binary audio chunks sequentially:

Each chunk: Your chosen duration (2-10 seconds recommended)
Format: Raw PCM audio bytes
Processing: Each chunk processed independently with automatic word boundary handling

Example: 6-second chunks

# At 16kHz mono 16-bit:
Chunk duration: 6 seconds = 192,000 bytes

# Send each chunk sequentially:
Chunk 0: Send 192,000 bytes
Chunk 1: Send 192,000 bytes
Chunk 2: Send 192,000 bytes
...

Example: Variable durations

# You can vary chunk sizes:
Chunk 0: 4 seconds = 128,000 bytes
Chunk 1: 6 seconds = 192,000 bytes
Chunk 2: 2 seconds = 64,000 bytes
...

💡 Flexible Durations: You can use any duration from 1-10 seconds. 6 seconds is recommended for optimal balance of latency and accuracy.

3. EndOfStream Required JSON

Signal end of audio stream.

{
  "message": "EndOfStream",
  "last_seq_no": 10
}

Field	Type	Required	Description
`message`	string	✅	Must be "EndOfStream"
`last_seq_no`	integer	✅	Number of segments sent

Server → Client Messages

1. RecognitionStarted JSON

{
  "message": "RecognitionStarted",
  "session_id": "session_1761130389621"
}

2. AudioAdded JSON

{
  "message": "AudioAdded",
  "seq_no": 0
}

Sent after each audio chunk is received and queued for processing.

3. AddTranscript JSON

{
  "message": "AddTranscript",
  "metadata": {
    "transcript": "This is the transcribed text",
    "start_time": 0.0,
    "end_time": 4.0,
    "language": "en",
    "chunk_index": 0,
    "word_count": 10
  }
}

4. AddTranslation JSON

{
  "message": "AddTranslation",
  "metadata": {
    "translation": "C'est le texte traduit",
    "target_language": "fr",
    "start_time": 0.0,
    "end_time": 4.0
  }
}

You'll receive one per target language per transcript.

5. EndOfTranscript JSON

{
  "message": "EndOfTranscript",
  "reason": "no_more_audio"
}

6. Error JSON

{
  "message": "Error",
  "type": "invalid_model",
  "reason": "Unsupported language: xyz",
  "code": 4004,
  "timestamp": 1729728000.123
}

Error Codes

Code	Type	Description
4001	invalid_message	Malformed message or invalid input
4004	invalid_model	Unsupported language code
1008	policy_violation	Server at capacity (session limit)
1011	internal_error	Server processing error

🌍

Supported Languages

The API supports 50+ languages for both transcription and translation using ISO 639-1 codes.

🏆 Language Performance Highlights:

Arabic: Industry-leading WER < 6% - The most advanced Arabic ASR available
English: Exceptional accuracy with broadcast-quality recognition
French: Superior performance for European French and Canadian French
Dutch: Excellent accuracy for Netherlands and Belgian Dutch
Custom Languages: We can train custom ASR models for your specific language needs - Contact us

Language	ISO Code	Native Name
Afrikaans	`af`	Afrikaans
Arabic	`ar`	العربية
Armenian	`hy`	Հայերեն
Azerbaijani	`az`	Azərbaycan
Belarusian	`be`	Беларуская
Bosnian	`bs`	Bosanski
Bulgarian	`bg`	Български
Catalan	`ca`	Català
Chinese (Simplified)	`zh`	中文
Croatian	`hr`	Hrvatski
Czech	`cs`	Čeština
Danish	`da`	Dansk
Dutch	`nl`	Nederlands
English	`en`	English
Estonian	`et`	Eesti
Finnish	`fi`	Suomi
French	`fr`	Français
Galician	`gl`	Galego
German	`de`	Deutsch
Greek	`el`	Ελληνικά
Hebrew	`he`	עברית
Hindi	`hi`	हिन्दी
Hungarian	`hu`	Magyar
Icelandic	`is`	Íslenska
Indonesian	`id`	Bahasa Indonesia
Italian	`it`	Italiano
Japanese	`ja`	日本語
Kannada	`kn`	ಕನ್ನಡ
Kazakh	`kk`	Қазақ
Korean	`ko`	한국어
Kurdish (Kurmanji)	`kmr`	Kurdî
Kurdish (Sorani)	`ckb`	کوردی
Latvian	`lv`	Latviešu
Lithuanian	`lt`	Lietuvių
Macedonian	`mk`	Македонски
Malay	`ms`	Bahasa Melayu
Maori	`mi`	Māori
Marathi	`mr`	मराठी
Nepali	`ne`	नेपाली
Norwegian	`no`	Norsk
Persian (Farsi)	`fa`	فارسی
Polish	`pl`	Polski
Portuguese	`pt`	Português
Romanian	`ro`	Română
Russian	`ru`	Русский
Serbian	`sr`	Српски
Slovak	`sk`	Slovenčina
Slovenian	`sl`	Slovenščina
Spanish	`es`	Español
Swahili	`sw`	Kiswahili
Swedish	`sv`	Svenska
Tagalog	`tl`	Tagalog
Tamil	`ta`	தமிழ்
Thai	`th`	ไทย
Turkish	`tr`	Türkçe
Ukrainian	`uk`	Українська
Urdu	`ur`	اردو
Vietnamese	`vi`	Tiếng Việt
Welsh	`cy`	Cymraeg

All languages support both transcription and translation to/from any other supported language.

💻

Implementation Examples

Python Example (Step-by-Step)

#!/usr/bin/env python3
import asyncio
import websockets
import json

async def transcribe_and_translate():
    uri = "wss://your-server-address:PORT"
    token = "YOUR_TOKEN_HERE"  # Your authentication token
    
    # Authentication is required - add Authorization header
    headers = {
        'Authorization': f'Bearer {token}'
    }
    
    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Step 1: Start recognition
        await ws.send(json.dumps({
            "message": "StartRecognition",
            "audio_format": {
                "type": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 16000
            },
            "transcription_config": {
                "language": "en"  # English
            },
            "translation_config": {
                "target_languages": ["fr", "es", "de"]
            }
        }))
        
        # Step 2: Wait for confirmation
        response = json.loads(await ws.recv())
        print(f"Session ID: {response['session_id']}")
        
        # Step 3: Read audio file (16kHz mono PCM)
        with open("audio.raw", "rb") as f:
            audio_data = f.read()
        
        # Step 4: Configure chunk size (example: 6 seconds)
        chunk_duration = 6  # Choose any duration (2-10 seconds)
        chunk_size = 16000 * chunk_duration * 2  # 192,000 bytes for 6s
        
        # Step 5: Send audio chunks sequentially
        chunk_num = 0
        for i in range(0, len(audio_data), chunk_size):
            # Send audio chunk
            chunk = audio_data[i:i + chunk_size]
            await ws.send(chunk)
            print(f"Sent chunk {chunk_num}: {len(chunk)} bytes")
            chunk_num += 1
        
        # Step 6: Listen for results
        while True:
            msg = json.loads(await ws.recv())
            
            if msg["message"] == "AddTranscript":
                print(f"Transcript: {msg['metadata']['transcript']}")
            
            elif msg["message"] == "AddTranslation":
                lang = msg['metadata']['target_language']
                text = msg['metadata']['translation']
                print(f"Translation ({lang}): {text}")
            
            elif msg["message"] == "EndOfTranscript":
                break
        
        # Step 7: End session
        await ws.send(json.dumps({
            "message": "EndOfStream",
            "last_seq_no": chunk_num
        }))

asyncio.run(transcribe_and_translate())

JavaScript Example (Browser)

Note: Browser WebSocket API doesn't support custom headers. For browser-based implementations, authentication token should be passed as a query parameter or use a server-side proxy.

// Option 1: Token as query parameter
const token = 'YOUR_TOKEN_HERE';
const ws = new WebSocket(`wss://your-server-address:PORT?token=${token}`);

ws.onopen = () => {
    // Start recognition
    ws.send(JSON.stringify({
        message: 'StartRecognition',
        audio_format: {
            type: 'raw',
            encoding: 'pcm_s16le',
            sample_rate: 16000
        },
        transcription_config: {
            language: 'en'
        },
        translation_config: {
            target_languages: ['fr', 'es']
        }
    }));
};

ws.onmessage = (event) => {
    const msg = JSON.parse(event.data);
    
    if (msg.message === 'RecognitionStarted') {
        console.log('Session:', msg.session_id);
        
        // Send audio chunks (6s each)
        sendAudioChunks(audioBuffer);
    }
    else if (msg.message === 'AddTranscript') {
        console.log('Transcript:', msg.metadata.transcript);
    }
    else if (msg.message === 'AddTranslation') {
        console.log(`Translation (${msg.metadata.target_language}):`, msg.metadata.translation);
    }
};

function sendAudioChunks(audioBuffer) {
    const chunkSize = 16000 * 6 * 2;  // 6 seconds = 192,000 bytes
    let chunkNum = 0;
    
    for (let i = 0; i < audioBuffer.byteLength; i += chunkSize) {
        // Send audio chunk
        const chunk = audioBuffer.slice(i, i + chunkSize);
        ws.send(chunk);
        console.log(`Sent chunk ${chunkNum}: ${chunk.byteLength} bytes`);
        chunkNum++;
    }
    
    // End session
    ws.send(JSON.stringify({
        message: 'EndOfStream',
        last_seq_no: chunkNum
    }));
}

⚠️

Error Handling

Error Types

Type	Code	Description	Solution
invalid_message	4001	Malformed JSON or invalid format	Fix message structure
invalid_model	4004	Unsupported language code	Use valid ISO 639-1 code
internal_error	1011	Server processing error	Retry or contact support

Common Issues

No transcripts received

Verify audio format matches configuration (16kHz, mono, PCM S16LE)
Ensure you're sending binary audio data (not base64 or JSON)
Check audio contains speech (not silence)
Verify chunk size is reasonable (2-10 seconds recommended)

Translation not working

Ensure target_languages array is not empty
Use valid ISO 639-1 codes (lowercase, 2-letter)
Verify transcript is not empty

📊

Performance

Latency

Operation	Typical Latency
Transcription	< 1.5 seconds
Translation (per language)	< 0.5 seconds
Total (with 3 translations)	< 2.0 seconds

Limits

Connection Keepalive: Ping every 20 seconds, 30-second timeout if no response
Network Bandwidth: ~340 kbps upload recommended (for 6-second chunks)

❓

FAQ

Q: What chunk duration should I use?

A: 6 seconds is recommended for optimal balance. Shorter (2-4s) for lower latency, longer (8-10s) for better context.

Q: Can I vary chunk sizes during a session?

A: Yes, the system adapts to different chunk sizes dynamically.

Q: How does word boundary handling work?

A: The system automatically handles word boundaries between chunks using 1-second audio buffering and smart deduplication.

Q: How many languages can I translate to simultaneously?

A: No hard limit, but 3-5 languages recommended for optimal performance.

Q: What's the transcription accuracy?

A: For broadcast-quality audio, expect >95% word accuracy. For Arabic specifically, we achieve a Word Error Rate (WER) of less than 6%, making it the most advanced Arabic transcription system available. We also excel in English, French, and Dutch.

Q: Can you support additional languages not listed?

A: Yes! We can train custom ASR (Automatic Speech Recognition) models for specific languages tailored to your needs. Contact our technical team at support@whitepeaks.fr to discuss custom language model training.

Q: Can I use this for live streaming?

A: Yes! Designed for real-time applications with sub-2-second total latency.

🔒

Privacy & Compliance

GDPR Compliance

KWIKmotion AI Live Captions is fully compliant with the General Data Protection Regulation (GDPR). We prioritize your data privacy and security:

No Audio Storage: We do not store, record, or keep copies of audio data sent to our service. Audio is processed in real-time and immediately discarded after transcription.
No Text Storage: Generated transcripts and translations are not stored on our servers. All text processing occurs in memory and is delivered directly to you via WebSocket.
No Logging of Content: We do not log or retain the actual content of your transcripts or translations for any purpose.
Session Data Only: We only retain minimal session metadata (connection timestamps, session IDs) necessary for service operation, which is automatically purged after session termination.
Data Processing Location: Audio and text processing occurs in real-time on our servers and is immediately discarded after transmission to your client.
AI-Powered Processing: Transcription and translation are performed using artificial intelligence (AI) models. We do not hold any responsibility or guarantee the authenticity, accuracy, or completeness of the generated content. AI-generated content may contain errors or inaccuracies.
Delicate Content Notice: If you are processing sensitive, legal, medical, or other delicate content, we strongly recommend that you inform your audience that transcription and translation services are provided via AI technology and may not be 100% accurate. It is your responsibility to review and verify any AI-generated content before use.

ISO 27001 Information Security Management

Our service adheres to ISO 27001 standards for information security management:

Session Separation for Audio: Each audio session is completely isolated from other sessions. Audio data from one session cannot access or interfere with audio data from another session, ensuring complete data isolation and privacy.
Access Controls: Authentication and authorization mechanisms protect your sessions.
Real-Time Processing: Audio and text are processed in-memory only, with no persistent storage.
Regular Security Audits: Our infrastructure undergoes regular security assessments and compliance reviews.

🛡️ Your Data, Your Control

All audio and text data flows directly through our system without retention. You maintain full control over your data at all times. For questions about our privacy practices, contact: support@whitepeaks.fr

📞

Contact & Support

White Peaks Solutions SAS

Technical Support: support@whitepeaks.fr
Sales & Licensing: sales@whitepeaks.fr
White Peaks Website: https://www.whitepeaks.fr
KWIKmotion Website: https://www.kwikmotion.com

🚀 Interactive API Reference

WebSocket API Reference
View the complete API specification with all message types, parameters, and examples.

📥 Download OpenAPI Spec: live-captions-openapi.json ← Import for reference