KBLab
/

kb-whisper-medium

@@ -6,9 +6,22 @@ license: apache-2.0
 datasets:
 - KBLab/rixvox-v2
 ---
-## KB-Whisper Medium
-The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech.
 ### Usage
@@ -41,4 +54,49 @@ generate_kwargs = {"task": "transcribe", "language": "sv"}
 res = pipe("audio.mp3",
            chunk_length_s=30,
            generate_kwargs={"task": "transcribe", "language": "sv"})
-```

 datasets:
 - KBLab/rixvox-v2
 ---
+## KB-Whisper Medium
+The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size).
+| Model size  |   | FLEURS | CommonVoice | NST  |
+|------------|---------|--------|-------------|------|
+| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
+|            | OpenAI  | 59.2   | 67.8   | 85.2   |
+| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **KBLab**   | **9.1**   | **8.7**   | **7.8**   |
+|            | OpenAI  | 39.6   | 52.1   | 53.4   |
+| [small](https://huggingface.co/KBLab/kb-whisper-small)      | **KBLab**   | **7.3**   | **6.4**   | **6.6**   |
+|            | OpenAI  | 20.6   | 26.4   | 26.4   |
+| [medium](https://huggingface.co/KBLab/kb-whisper-medium)     | **KBLab**   | **6.6**   | **5.4**   | **5.8**   |
+|            | OpenAI  | 12.1   | 15.8   | 17.1   |
+| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
+|            | OpenAI  | 7.8    | 9.5    | 11.3    |
 ### Usage
 res = pipe("audio.mp3",
            chunk_length_s=30,
            generate_kwargs={"task": "transcribe", "language": "sv"})
+```
+### Training data
+Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
+Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).
+| Dataset      | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
+|-------------|--------------------------|--------------|
+| Subtitles   | 34,261                   | 3,110        |
+| Riksdag     | 21,949                   | 5,119        |
+| ISOF        | 54                       | 54           |
+| NST         | 250                      | 250          |
+| **Total**   | **56,514**               | **8,533**    |
+The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can load these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model's tag is named `standard`.
+### Evaluation
+| Model size  |  | FLEURS | CommonVoice | NST  |
+|------------|---------|--------|-------------|------|
+| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
+|            | OpenAI  | 59.2   | 67.8   | 85.2   |
+| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **KBLab**   | **9.1**   | **8.7**   | **7.8**   |
+|            | OpenAI  | 39.6   | 52.1   | 53.4   |
+| [small](https://huggingface.co/KBLab/kb-whisper-small)      | **KBLab**   | **7.3**   | **6.4**   | **6.6**   |
+|            | OpenAI  | 20.6   | 26.4   | 26.4   |
+| [medium](https://huggingface.co/KBLab/kb-whisper-medium)     | **KBLab**   | **6.6**   | **5.4**   | **5.8**   |
+|            | OpenAI  | 12.1   | 15.8   | 17.1   |
+| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
+|            | OpenAI  | 7.8    | 9.5    | 11.3    |
+| Model size  |   | FLEURS | CommonVoice | NST  |
+|------------|---------|--------|-------------|------|
+| tiny       | KBLab   | **76.6**  | **73.7**  | **74.3**  |
+|            | OpenAI  | 26.9   | 21.1   | 24.0   |
+| base       | KBLab   | **83.2**   | **79.9**   | **78.3**   |
+|            | OpenAI  | 41.1   | 32.5   | 36.9   |
+| small      | KBLab   | **86.6**   | **83.5**   | **79.6**   |
+|            | OpenAI  | 64.0   | 56.5   | 58.2   |
+| medium     | KBLab   | **87.6**   | **85.0**   | **80.2**   |
+|            | OpenAI  | 77.1   | 70.1   | 68.9   |
+| large-v3   | KBLab   | **89.8**   | **87.2**   | **81.1**   |
+|            | OpenAI  | 84.9    | 79.1    | 75.1    |