🐙 Octopus: Towards Building the Arabic Speech LLM Suite
📢 Overview
Octopus is a bilingual Audio-Language Model (Audio-LLM) family developed to understand, transcribe, translate, and reason over Arabic and English speech.
It unifies audio, text, and reasoning within one multimodal framework, supporting:
- Automatic Speech Recognition (ASR) for Arabic & English 🗣️
- Speech Translation (Arabic → English and vice versa) 🌍
- Arabic Dialect Identification (DID) 🏷️
The lightweight variant, TinyOctopus, maintains the same modular design but is optimized for efficiency on smaller GPUs.
🧩 Architecture
Core Components
The Octopus family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.
Audio Encoders
- Distil-Whisper (distil-large-v3) → lightweight frozen encoder producing compact speech embeddings.
- Whisper-large-v3 → high-capacity encoder for robust transcription and multilingual coverage.
- BEATs (Microsoft) → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
Alignment & Fusion
- Cross-Attention Projection Layer → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
Language / Decoder Models
- DeepSeek 1.5B → efficient generative decoder for reasoning, dialogue, and translation.
- LLaMA 3.2 1B → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
- ALLaM 13B → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.
Together these components enable the Octopus line—from TinyOctopus (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full ALLaM-Octopus (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
📚 Training Datasets
The Octopus models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling ≈25,000 hours of high-quality data for ASR, translation, and dialect identification.
| Task / Domain | Dataset | Train (h) | Dev (h) | Description |
|---|---|---|---|---|
| ASR (Arabic) | QASR | 1,880.5 | 9.6 | Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. |
| In-house Arabic Corpus | 13,392.1 | 142.7 | Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. | |
| ASR (English) | LibriSpeech | 960.0 | 10.5 | Read English corpus for ASR benchmarking. |
| TED-LIUM | 453.8 | 1.6 | English TED-talk recordings for spontaneous speech recognition. | |
| ASR (Ar–En Code Switching) | Synthetic (In-house TTS) | 119.5 | – | Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. |
| Translation (Ar→En) | Translated QASR (via GPT-4o) | 1,858.4 | 9.6 | QASR corpus automatically translated to English for parallel supervision. |
| Translated In-house Arabic (via GPT-4o) | 7,229.2 | 141.9 | In-house Arabic dataset machine-translated to English via GPT-4o. | |
| Dialect Identification | ADI17 | 2,241.5 | 19.0 | YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. |
Total Coverage: ≈25,000 hours of speech across Arabic, English, and mixed-language domains — enabling broad generalization for ASR, translation, and dialect identification.
These datasets jointly provide:
- Balanced representation across dialects.
- Both natural and synthetic speech sources for enhanced robustness.
- Parallel Arabic–English pairs enabling bilingual text generation and translation.
🧮 Model Weights & Resources
The full set of model weights (including large checkpoints) is publicly available here:
➡️ Octopus Model Weights
⚙️ Installation & Usage
💻 Install Dependencies
pip install -r requirements.txt
Inference
from inference import transcribe
audio_path = "path/to/audio.wav" # Replace with your actual audio file
output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"
print("Generated Text:", output)
🧪 Evaluation Results
🎙️ ASR Performance (WER ↓)
| Dataset | Ar-Octopus | Bilingual-Octopus | Trans-Octopus | Whisper-large-v3 | SeamlessM4T |
|---|---|---|---|---|---|
| MGB2 (Arabic) | 16.5 | 6.5 | 15.2 | 6.8 | 13.3 | 5.9 | 16.2 | 7.9 | 17.2 | 8.4 |
| test-clean (English) | 82.5 | 92.4 | 2.6 | 1.4 | 67.3 | 79.4 | 2.86 | 0.98 | 2.68 | 0.88 |
| test-other (English) | 86.9 | 95.1 | 5.1 | 3.4 | 71.5 | 87.8 | 5.00 | 2.05 | 5.07 | 1.94 |
| tedlium (English) | 101.9 | 77.4 | 5.1 | 3.9 | 85.2 | 63.6 | 11.9 | 4.4 | 86.5 | 62.2 |
| Escwa (Code-Switched) | 42.5 | 26.3 | 40.8 | 27.1 | 41.8 | 25.1 | 47.3 | 31.0 | 52.0 | 35.3 |
| Mixat-ALL (Code-Switched) | 22.0 | 9.0 | 23.4 | 10.3 | 34.1 | 10.6 | 29.0 | 15.0 | 32.8 | 16.9 |
| Mixat-CS (Code-Switched) | 26.4 | 12.4 | 28.5 | 14.9 | 27.8 | 13.3 | 34.8 | 20.6 | 38.2 | 21.8 |
| In-house Long-form | 25.4 | 13.0 | 24.9 | 12.5 | 24.1 | 12.1 | 26.7 | 15.2 | 29.3 | 18.6 |
+86 % English improvement observed with the addition of language-tokens for bilingual and translation variants.
🪶 Tiny-Octopus & Fine-Tuning (WER ↓)
| Dataset | TinyOctopus LLaMA-3 1B | Fine-tuned LLaMA-3 1B | TinyOctopus DeepSeek 1.5B | Fine-tuned DeepSeek 1.5B |
|---|---|---|---|---|
| MGB2 (Arabic) | 22.6 | 15.7 | 16.1 | 9.5 | 23.2 | 15.8 | 15.5 | 9.2 |
| test-clean (English) | 7.5 | 5.7 | 3.1 | 1.3 | 7.7 | 5.8 | 7.6 | 5.7 |
| test-other (English) | 11.3 | 8.0 | 6.9 | 3.5 | 11.5 | 8.2 | 11.3 | 8.0 |
| Escwa (Code-Switched) | 42.5 | 26.9 | 40.3 | 24.4 | 43.6 | 27.8 | 41.8 | 26.3 |
| Mixat-All | 35.2 | 19.6 | 34.1 | 19.3 | 37.1 | 21.1 | 35.5 | 19.9 |
| Mixat-CS | 40.2 | 24.2 | 36.2 | 21.4 | 41.2 | 25.2 | 39.9 | 24.2 |
| In-house Long-files | 44.3 | 29.1 | 42.8 | 26.9 | 47.0 | 32.7 | 43.7 | 31.5 |
Code-Switch TTS augmentation yielded ≈ 20 % WER reduction across multilingual evaluation sets.
🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)
| Model / System | CoVoST2 (Ar→En) | FLEURS (Ar→En) |
|---|---|---|
| Whisper-large-v3 | 28.8 / 0.53 | 15.1 / 0.47 |
| SeamlessM4T | 33.7 / 0.55 | 23.9 / 0.56 |
| Trans-Octopus | 38.6 / 0.64 | 23.2 / 0.58 |
| TO-LLaMA-1B | 33.9 / 0.61 | 20.5 / 0.53 |
| TO-DeepSeek-1.5B | 33.6 / 0.61 | 20.8 / 0.53 |
Trans-Octopus achieves the best BLEU and BERT-F1 on CoVoST2 and competitive results on FLEURS, surpassing SeamlessM4T in low-resource conditions.
🏷️ Dialect Identification
For dialect identification, the Tiny-Octopus models achieved 87 – 89 % accuracy across all 17 dialects in ADI-17.
The confusion matrices reveal clear separation among Gulf, Levantine, North-African, and Egyptian clusters — showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.
Examples
Example 1: Arabic Speech Recognition
🎵 Audio Input (Arabic):
📝 User Prompt:
Transcribe the audio or قم بتفريغ المقطع الصوتي
💡 System Response:
أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس
🎵 Audio Input (English):
📝 User Prompt:
Transcribe the audio or قم بتفريغ المقطع الصوتي
💡 System Response:
NO IT'S NOT TOO SOON
Example 2: Arabic to English Translation
🎵 Audio Input:
📝 User Prompt:
Translate the following Arabic speech into English or قم بترجمة المقطع للإنجليزية
💡 System Response:
I took a loan a certain amount of money to pay off the debt
Example 3: Dialect Identification
🎵 Audio Input:
📝 User Prompt:
Identify the dialect of the given speech or ماهي لهجة المتحدث؟
💡 System Response:
KSA
Model tree for ArabicSpeech/Octopus
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B