|  | --- | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | language: ja | 
					
						
						|  | tags: | 
					
						
						|  | - audio | 
					
						
						|  | - automatic-speech-recognition | 
					
						
						|  | - hf-asr-leaderboard | 
					
						
						|  | widget: | 
					
						
						|  | - example_title: CommonVoice 8.0 (Test Split) | 
					
						
						|  | src: >- | 
					
						
						|  | https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac | 
					
						
						|  | - example_title: JSUT Basic 5000 | 
					
						
						|  | src: >- | 
					
						
						|  | https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac | 
					
						
						|  | - example_title: ReazonSpeech (Test Split) | 
					
						
						|  | src: >- | 
					
						
						|  | https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac | 
					
						
						|  | pipeline_tag: automatic-speech-recognition | 
					
						
						|  | metrics: | 
					
						
						|  | - wer | 
					
						
						|  | model-index: | 
					
						
						|  | - name: kotoba-tech/kotoba-whisper-v1.0 | 
					
						
						|  | results: | 
					
						
						|  | - task: | 
					
						
						|  | type: automatic-speech-recognition | 
					
						
						|  | dataset: | 
					
						
						|  | name: CommonVoice_8.0 (Japanese) | 
					
						
						|  | type: japanese-asr/ja_asr.common_voice_8_0 | 
					
						
						|  | metrics: | 
					
						
						|  | - name: WER | 
					
						
						|  | type: WER | 
					
						
						|  | value: TBA | 
					
						
						|  | - name: CER | 
					
						
						|  | type: CER | 
					
						
						|  | value: TBA | 
					
						
						|  | - task: | 
					
						
						|  | type: automatic-speech-recognition | 
					
						
						|  | dataset: | 
					
						
						|  | name: ReazonSpeech (Test) | 
					
						
						|  | type: japanese-asr/ja_asr.reazonspeech_test | 
					
						
						|  | metrics: | 
					
						
						|  | - name: WER | 
					
						
						|  | type: WER | 
					
						
						|  | value: TBA | 
					
						
						|  | - name: CER | 
					
						
						|  | type: CER | 
					
						
						|  | value: TBA | 
					
						
						|  | - task: | 
					
						
						|  | type: automatic-speech-recognition | 
					
						
						|  | dataset: | 
					
						
						|  | name: JSUT Basic5000 | 
					
						
						|  | type: japanese-asr/ja_asr.jsut_basic5000 | 
					
						
						|  | metrics: | 
					
						
						|  | - name: WER | 
					
						
						|  | type: WER | 
					
						
						|  | value: TBA | 
					
						
						|  | - name: CER | 
					
						
						|  | type: CER | 
					
						
						|  | value: TBA | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # Kotoba-Whisper-v1.1 | 
					
						
						|  | _Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with | 
					
						
						|  | additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes | 
					
						
						|  | (i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). | 
					
						
						|  | These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0). | 
					
						
						|  | The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech) | 
					
						
						|  |  | 
					
						
						|  | ## Transformers Usage | 
					
						
						|  | Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first | 
					
						
						|  | install the latest version of Transformers. | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | pip install --upgrade pip | 
					
						
						|  | pip install --upgrade transformers accelerate | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Transcription | 
					
						
						|  | The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) | 
					
						
						|  | class to transcribe audio files as follows: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | import torch | 
					
						
						|  | from transformers import pipeline | 
					
						
						|  | from datasets import load_dataset | 
					
						
						|  |  | 
					
						
						|  | # config | 
					
						
						|  | model_id = "kotoba-tech/kotoba-whisper-v1.1" | 
					
						
						|  | torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 | 
					
						
						|  | device = "cuda:0" if torch.cuda.is_available() else "cpu" | 
					
						
						|  | model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} | 
					
						
						|  | generate_kwargs = {"language": "japanese", "task": "transcribe"} | 
					
						
						|  |  | 
					
						
						|  | # load model | 
					
						
						|  | pipe = pipeline( | 
					
						
						|  | model=model_id, | 
					
						
						|  | torch_dtype=torch_dtype, | 
					
						
						|  | device=device, | 
					
						
						|  | model_kwargs=model_kwargs, | 
					
						
						|  | chunk_length_s=15, | 
					
						
						|  | batch_size=16 | 
					
						
						|  | ) | 
					
						
						|  |  | 
					
						
						|  | # load sample audio | 
					
						
						|  | dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test") | 
					
						
						|  | sample = dataset[0]["audio"] | 
					
						
						|  |  | 
					
						
						|  | # run inference | 
					
						
						|  | result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs) | 
					
						
						|  | print(result) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: | 
					
						
						|  | ```diff | 
					
						
						|  | - result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs) | 
					
						
						|  | + result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Transcription with Prompt | 
					
						
						|  | Kotoba-whisper can generate transcription with prompting as below: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | import re | 
					
						
						|  | import torch | 
					
						
						|  | from transformers import pipeline | 
					
						
						|  | from datasets import load_dataset | 
					
						
						|  |  | 
					
						
						|  | # config | 
					
						
						|  | model_id = "kotoba-tech/kotoba-whisper-v1.1" | 
					
						
						|  | torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 | 
					
						
						|  | device = "cuda:0" if torch.cuda.is_available() else "cpu" | 
					
						
						|  | model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} | 
					
						
						|  | generate_kwargs = {"language": "japanese", "task": "transcribe"} | 
					
						
						|  |  | 
					
						
						|  | # load model | 
					
						
						|  | pipe = pipeline( | 
					
						
						|  | model=model_id, | 
					
						
						|  | torch_dtype=torch_dtype, | 
					
						
						|  | device=device, | 
					
						
						|  | model_kwargs=model_kwargs, | 
					
						
						|  | chunk_length_s=15, | 
					
						
						|  | batch_size=16 | 
					
						
						|  | ) | 
					
						
						|  |  | 
					
						
						|  | # load sample audio | 
					
						
						|  | dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test") | 
					
						
						|  |  | 
					
						
						|  | # --- Without prompt --- | 
					
						
						|  | text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text'] | 
					
						
						|  | print(text) | 
					
						
						|  | # 81歳、力強い走りに変わってきます。 | 
					
						
						|  |  | 
					
						
						|  | # --- With prompt ---: Let's change `81` to `91`. | 
					
						
						|  | prompt = "91歳" | 
					
						
						|  | generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device) | 
					
						
						|  | text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text'] | 
					
						
						|  | # currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it | 
					
						
						|  | text = re.sub(rf"\A\s*{prompt}\s*", "", text) | 
					
						
						|  | print(text) | 
					
						
						|  | # あっぶったでもスルガさん、91歳、力強い走りに変わってきます。 | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Flash Attention 2 | 
					
						
						|  | We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) | 
					
						
						|  | if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention): | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | pip install flash-attn --no-build-isolation | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`: | 
					
						
						|  |  | 
					
						
						|  | ```diff | 
					
						
						|  | - model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} | 
					
						
						|  | + model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {} | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Acknowledgements | 
					
						
						|  | * [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3). | 
					
						
						|  | * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration. | 
					
						
						|  | * Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper). | 
					
						
						|  | * [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech). |