argmaxinc
/

whisperkit-coreml

 ---
+pretty_name: "WhisperKit ASR Evaluation Results"
+tags:
+- whisper
+- whisperkit
+- coreml
+- asr
+- quantized
 ---
+# WhisperKit Evaluation Results
+## Dataset: `librispeech`
+### Quality Evaluation
+|                                                                                                                                                                         |   WER |   QoI (%) |   File Size (MB) |
+|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
+| [WhisperOpenAIAPI/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2)               |  2.85 |     100   |             3100 |
+| [WhisperKit/openai_whisper-large-v2](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2)                           |  3.28 |      96.6 |             3100 |
+| [WhisperKit/openai_whisper-large-v2_1050MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_1050MB)             |  3.32 |      95   |             1050 |
+| [WhisperKit/openai_whisper-large-v2_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo)               |  3.24 |      96.6 |             3100 |
+| [WhisperKit/openai_whisper-large-v2_turbo_1022MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v2_turbo_1022MB) |  3.33 |      94.9 |             1022 |
+| [whisper.cpp/openai_whisper-large-v2-q5_0](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/whisper.cpp/openai_whisper-large-v2-q5_0)               |  2.8  |      96.6 |             1080 |
+| [WhisperKit/openai_whisper-small](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-small)                                 |  3.98 |      82.9 |              483 |
+| [WhisperKit/openai_whisper-base](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-base)                                   |  6.11 |      67.1 |              145 |
+| [WhisperKit/openai_whisper-tiny](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-tiny)                                   |  8.94 |      52.4 |               66 |
+| [WhisperKit/openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3)                           |  2.48 |      95.2 |             3100 |
+| [WhisperKit/openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/WhisperKit/openai_whisper-large-v3_turbo)               |  2.44 |      95.4 |             3100 |
+| [openai_whisper-large-v3_turbo_1018MB](https://huggingface.co/argmaxinc/whisperkit-coreml-staging/tree/main/openai_whisper-large-v3_turbo_1018MB)                       |  2.49 |      94.8 |             1018 |
+### Quality-of-Inference (QoI) Certification
+We believe that rigorously measuring the quality of inference is necessary for developers and
+enterprises to make informed decisions when opting to use optimized or compressed variants of
+Whisper models in production. The current measurements are between reference and optimized
+WhisperKit models. We are going to extend the scope of this measurement to other Whisper
+implementations soon so developers can certify the behavior change (if any) caused by
+alternating use of WhisperKit with (or migration from) these implementations.
+In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
+which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
+backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
+where per-example known behavior changes after a code/model update and causes divergence in
+downstream code or breaks the user experience itself (even if dataset averages might stay flat
+across updates). Pseudocode for `qoi`:
+```python
+qoi = []
+for example in dataset:
+    no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
+    qoi.append(no_regression)
+qoi = (sum(qoi) / len(qoi)) * 100.
+```
+We define the reference model as the default float16 precision Core ML model that is generated by
+whisperkittools. This reference model matches the accuracy of the original PyTorch model
+on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
+as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
+(120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
+their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
+same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
+for details.
+### Reproducing Results
+Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
+Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
+we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
+run identical [evaluation jobs](#evaluation)
+locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
+evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
+1 day to complete the same evaluation.
+Glossary:
+- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
+as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
+- `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
+`_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.