Update README.md
Browse files
README.md
CHANGED
@@ -26,7 +26,7 @@ developed through the collaboration bewteen
|
|
26 |
[Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech).
|
27 |
Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
|
28 |
we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model for Japanese and English ASR, while we translate the
|
29 |
-
transcription into English and Japanese by
|
30 |
We employ [ReazonSpeech](https://huggingface.co/datasets/japanese-asr/ja_asr.reazon_speech_all) for Japanese ASR and Japanese speech to English text translation,
|
31 |
and [Multilingual LibriSpeech](https://huggingface.co/datasets/japanese-asr/en_asr.mls) for English ASR and English speech to Japanese text translation.
|
32 |
Kotoba-whisper-bilingual's loss objective consists of cross-entropy on both of ASR and translation tasks, while KL divergence loss only for ASR task.
|
@@ -40,15 +40,15 @@ it inherits the benefit of the improved latency compared to [openai/whisper-larg
|
|
40 |
## Evaluation
|
41 |
|
42 |
We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisper models, and cascaded models for translation.
|
43 |
-
Worth noting that kotoba-whisper-bilingual is the only model that can do Japanese and English ASR and translation between Japanese and English
|
44 |
-
OpenAI whisper is not trained for English to Japanese translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
|
45 |
distil whisper is English ASR only).
|
46 |
|
47 |
### Speech2Text Translation (Japanese->English)
|
48 |
|
49 |
| model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
|
50 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
51 |
-
| [kotoba-tech/kotoba-whisper-bilingual-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 73.9 | 98.7 |
|
52 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 64.3 | 67.1 |
|
53 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 65.4 | 68.9 |
|
54 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 65.6 | 67.4 |
|
@@ -66,7 +66,7 @@ distil whisper is English ASR only).
|
|
66 |
|
67 |
| model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
|
68 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
69 |
-
| [kotoba-tech/kotoba-whisper-bilingual-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 69.1 | 74.4 |
|
70 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 62.4 | 63.5 |
|
71 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 64.4 | 67.2 |
|
72 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 62.4 | 62.9 |
|
@@ -84,7 +84,7 @@ distil whisper is English ASR only).
|
|
84 |
|
85 |
| model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
86 |
|:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
|
87 |
-
| [kotoba-tech/kotoba-whisper-bilingual-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 9.8 | 9.3 | 16.8 |
|
88 |
| [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) | 9.2 | 8.4 | 11.6 |
|
89 |
| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.4 | 8.5 | 12.2 |
|
90 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 8.5 | 7.1 | 14.9 |
|
@@ -102,7 +102,7 @@ distil whisper is English ASR only).
|
|
102 |
|
103 |
| model | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
|
104 |
|:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
|
105 |
-
| [kotoba-tech/kotoba-whisper-bilingual-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 16.7 | 15.3 | 2.4 | 4.1 | 8.3 |
|
106 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 17.9 | 14.9 | 2.1 | 3.8 | 12.7 |
|
107 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 18.9 | 16.7 | 2.3 | 4.9 | 7.7 |
|
108 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 18.8 | 14.9 | 2.6 | 4.2 | 7.7 |
|
@@ -119,7 +119,7 @@ Following table shows the mean inference time in second averaged over 10 trials
|
|
119 |
|
120 |
| model | 10 | 30 | 60 | 300 |
|
121 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
|
122 |
-
| [kotoba-tech/kotoba-whisper-bilingual-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 0.041 | 0.111 | 0.214 | 1.077 |
|
123 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 0.173 | 0.247 | 0.352 | 1.772 |
|
124 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 0.173 | 0.24 | 0.348 | 1.515 |
|
125 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17 | 0.245 | 0.348 | 1.882 |
|
|
|
26 |
[Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech).
|
27 |
Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
|
28 |
we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model for Japanese and English ASR, while we translate the
|
29 |
+
transcription into English and Japanese by external LLM to obtain training dataset for speech-to-text translation.
|
30 |
We employ [ReazonSpeech](https://huggingface.co/datasets/japanese-asr/ja_asr.reazon_speech_all) for Japanese ASR and Japanese speech to English text translation,
|
31 |
and [Multilingual LibriSpeech](https://huggingface.co/datasets/japanese-asr/en_asr.mls) for English ASR and English speech to Japanese text translation.
|
32 |
Kotoba-whisper-bilingual's loss objective consists of cross-entropy on both of ASR and translation tasks, while KL divergence loss only for ASR task.
|
|
|
40 |
## Evaluation
|
41 |
|
42 |
We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisper models, and cascaded models for translation.
|
43 |
+
**Worth noting that kotoba-whisper-bilingual is the only model that can do Japanese and English ASR and speech-to-text translation between Japanese and English**, as
|
44 |
+
OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
|
45 |
distil whisper is English ASR only).
|
46 |
|
47 |
### Speech2Text Translation (Japanese->English)
|
48 |
|
49 |
| model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
|
50 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
51 |
+
| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 73.9 | 98.7 |
|
52 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 64.3 | 67.1 |
|
53 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 65.4 | 68.9 |
|
54 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 65.6 | 67.4 |
|
|
|
66 |
|
67 |
| model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
|
68 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
69 |
+
| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 69.1 | 74.4 |
|
70 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 62.4 | 63.5 |
|
71 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 64.4 | 67.2 |
|
72 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 62.4 | 62.9 |
|
|
|
84 |
|
85 |
| model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
86 |
|:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
|
87 |
+
| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 9.8 | 9.3 | 16.8 |
|
88 |
| [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) | 9.2 | 8.4 | 11.6 |
|
89 |
| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.4 | 8.5 | 12.2 |
|
90 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 8.5 | 7.1 | 14.9 |
|
|
|
102 |
|
103 |
| model | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
|
104 |
|:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
|
105 |
+
| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 16.7 | 15.3 | 2.4 | 4.1 | 8.3 |
|
106 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 17.9 | 14.9 | 2.1 | 3.8 | 12.7 |
|
107 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 18.9 | 16.7 | 2.3 | 4.9 | 7.7 |
|
108 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 18.8 | 14.9 | 2.6 | 4.2 | 7.7 |
|
|
|
119 |
|
120 |
| model | 10 | 30 | 60 | 300 |
|
121 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
|
122 |
+
| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 0.041 | 0.111 | 0.214 | 1.077 |
|
123 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 0.173 | 0.247 | 0.352 | 1.772 |
|
124 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 0.173 | 0.24 | 0.348 | 1.515 |
|
125 |
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17 | 0.245 | 0.348 | 1.882 |
|