Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -123,53 +123,49 @@ class to transcribe short-form audio files (< 30-seconds) as follows: | |
| 123 |  | 
| 124 | 
             
            ```python
         | 
| 125 | 
             
            import torch
         | 
| 126 | 
            -
            from transformers import  | 
| 127 | 
             
            from datasets import load_dataset, Audio
         | 
| 128 |  | 
| 129 | 
             
            # config
         | 
| 130 | 
             
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 131 | 
             
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 132 | 
             
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
|  | |
|  | |
| 133 |  | 
| 134 | 
             
            # load model
         | 
| 135 | 
            -
            model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
         | 
| 136 | 
            -
            model.to(device)
         | 
| 137 | 
            -
            processor = AutoProcessor.from_pretrained(model_id)
         | 
| 138 | 
             
            pipe = pipeline(
         | 
| 139 | 
             
                "automatic-speech-recognition",
         | 
| 140 | 
            -
                model= | 
| 141 | 
            -
                tokenizer=processor.tokenizer,
         | 
| 142 | 
            -
                feature_extractor=processor.feature_extractor,
         | 
| 143 | 
            -
                max_new_tokens=128,
         | 
| 144 | 
             
                torch_dtype=torch_dtype,
         | 
| 145 | 
             
                device=device,
         | 
|  | |
| 146 | 
             
            )
         | 
| 147 |  | 
| 148 | 
             
            # load sample audio & downsample to 16kHz
         | 
| 149 | 
             
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
| 150 | 
            -
            dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
         | 
| 151 | 
             
            sample = dataset[0]["audio"]
         | 
| 152 |  | 
| 153 | 
             
            # run inference
         | 
| 154 | 
            -
            result = pipe(sample)
         | 
| 155 | 
             
            print(result["text"])
         | 
| 156 | 
             
            ```
         | 
| 157 |  | 
| 158 | 
             
            - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline (make sure the audio is sampled in 16kHz):
         | 
| 159 | 
             
            ```diff
         | 
| 160 | 
            -
            - result = pipe(sample)
         | 
| 161 | 
            -
            + result = pipe("audio.mp3")
         | 
| 162 | 
             
            ```
         | 
| 163 |  | 
| 164 | 
             
            - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
         | 
| 165 | 
             
            ```python
         | 
| 166 | 
            -
            result = pipe(sample, return_timestamps=True)
         | 
| 167 | 
             
            print(result["chunks"])
         | 
| 168 | 
             
            ```
         | 
| 169 |  | 
| 170 | 
            -
             | 
| 171 | 
            -
            Kotoba-whisper is designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered 
         | 
| 172 | 
             
            inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
         | 
|  | |
| 173 | 
             
            The sequential long-form algorithm should be used in either of the following scenarios:
         | 
| 174 |  | 
| 175 | 
             
            1. Transcription accuracy is the most important factor, and latency is less of a consideration
         | 
| @@ -180,41 +176,6 @@ described [below](#chunked-long-form). For a detailed explanation of the differe | |
| 180 | 
             
            the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf). The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) 
         | 
| 181 | 
             
            class can be used to transcribe long audio files with the sequential algorithm as follows: 
         | 
| 182 |  | 
| 183 | 
            -
            ```python
         | 
| 184 | 
            -
            import torch
         | 
| 185 | 
            -
            import numpy as np
         | 
| 186 | 
            -
            from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
         | 
| 187 | 
            -
            from datasets import load_dataset
         | 
| 188 | 
            -
             | 
| 189 | 
            -
            # config
         | 
| 190 | 
            -
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 191 | 
            -
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 192 | 
            -
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
| 193 | 
            -
             | 
| 194 | 
            -
            # load model
         | 
| 195 | 
            -
            model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
         | 
| 196 | 
            -
            model.to(device)
         | 
| 197 | 
            -
            processor = AutoProcessor.from_pretrained(model_id)
         | 
| 198 | 
            -
            pipe = pipeline(
         | 
| 199 | 
            -
                "automatic-speech-recognition",
         | 
| 200 | 
            -
                model=model,
         | 
| 201 | 
            -
                tokenizer=processor.tokenizer,
         | 
| 202 | 
            -
                feature_extractor=processor.feature_extractor,
         | 
| 203 | 
            -
                max_new_tokens=128,
         | 
| 204 | 
            -
                torch_dtype=torch_dtype,
         | 
| 205 | 
            -
                device=device,
         | 
| 206 | 
            -
            )
         | 
| 207 | 
            -
             | 
| 208 | 
            -
            # load sample audio (concatenate instances to create a long audio)
         | 
| 209 | 
            -
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
| 210 | 
            -
            dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
         | 
| 211 | 
            -
            sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate'], "path": "tmp"}
         | 
| 212 | 
            -
             | 
| 213 | 
            -
            # run inference
         | 
| 214 | 
            -
            result = pipe(sample)
         | 
| 215 | 
            -
            print(result["text"])
         | 
| 216 | 
            -
            ```
         | 
| 217 | 
            -
             | 
| 218 |  | 
| 219 | 
             
            ### Chunked Long-Form
         | 
| 220 | 
             
            This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, 
         | 
| @@ -224,37 +185,33 @@ is optimal. To activate batching over long audio files, pass the argument `batch | |
| 224 |  | 
| 225 | 
             
            ```python
         | 
| 226 | 
             
            import torch
         | 
| 227 | 
            -
            from transformers import  | 
| 228 | 
             
            from datasets import load_dataset
         | 
| 229 |  | 
| 230 | 
             
            # config
         | 
| 231 | 
             
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 232 | 
             
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 233 | 
             
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
|  | |
|  | |
| 234 |  | 
| 235 | 
             
            # load model
         | 
| 236 | 
            -
            model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
         | 
| 237 | 
            -
            model.to(device)
         | 
| 238 | 
            -
            processor = AutoProcessor.from_pretrained(model_id)
         | 
| 239 | 
             
            pipe = pipeline(
         | 
| 240 | 
             
                "automatic-speech-recognition",
         | 
| 241 | 
            -
                model= | 
| 242 | 
            -
                tokenizer=processor.tokenizer,
         | 
| 243 | 
            -
                feature_extractor=processor.feature_extractor,
         | 
| 244 | 
            -
                max_new_tokens=128,
         | 
| 245 | 
            -
                chunk_length_s=25,
         | 
| 246 | 
            -
                batch_size=16,
         | 
| 247 | 
             
                torch_dtype=torch_dtype,
         | 
| 248 | 
             
                device=device,
         | 
|  | |
|  | |
|  | |
| 249 | 
             
            )
         | 
| 250 |  | 
| 251 | 
             
            # load sample audio (concatenate instances to create a long audio)
         | 
| 252 | 
             
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
| 253 | 
            -
             | 
| 254 | 
            -
            sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate'], "path": "tmp"}
         | 
| 255 |  | 
| 256 | 
             
            # run inference
         | 
| 257 | 
            -
            result = pipe(sample)
         | 
| 258 | 
             
            print(result["text"])
         | 
| 259 | 
             
            ```
         | 
| 260 |  | 
| @@ -263,34 +220,41 @@ Kotoba-whisper can generate transcription with prompting as below: | |
| 263 |  | 
| 264 | 
             
            ```python
         | 
| 265 | 
             
            import torch
         | 
| 266 | 
            -
            from transformers import  | 
| 267 | 
             
            from datasets import load_dataset, Audio
         | 
| 268 |  | 
| 269 | 
             
            # config
         | 
| 270 | 
             
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 271 | 
             
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 272 | 
             
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
|  | |
|  | |
| 273 |  | 
| 274 | 
             
            # load model
         | 
| 275 | 
            -
             | 
| 276 | 
            -
             | 
| 277 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 278 |  | 
| 279 | 
             
            # load sample audio & downsample to 16kHz
         | 
| 280 | 
             
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
| 281 | 
            -
            dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
         | 
| 282 | 
            -
            input_features = processor(dataset[10]["audio"]["array"], return_tensors="pt").input_features
         | 
| 283 |  | 
| 284 | 
             
            # --- Without prompt ---
         | 
| 285 | 
            -
             | 
| 286 | 
            -
            print( | 
| 287 | 
            -
            #  | 
| 288 |  | 
| 289 | 
             
            # --- With prompt ---: Let's change `81` to `91`.
         | 
| 290 | 
            -
             | 
| 291 | 
            -
             | 
| 292 | 
            -
             | 
| 293 | 
            -
            #  | 
|  | |
|  | |
| 294 | 
             
            ```
         | 
| 295 |  | 
| 296 | 
             
            ### Additional Speed & Memory Improvements
         | 
| @@ -310,31 +274,8 @@ pip install flash-attn --no-build-isolation | |
| 310 | 
             
            Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
         | 
| 311 |  | 
| 312 | 
             
            ```diff
         | 
| 313 | 
            -
            -  | 
| 314 | 
            -
            +  | 
| 315 | 
            -
            ```
         | 
| 316 | 
            -
             | 
| 317 | 
            -
            #### Torch Scale-Product-Attention (SDPA)
         | 
| 318 | 
            -
             | 
| 319 | 
            -
            If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html). 
         | 
| 320 | 
            -
            This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check 
         | 
| 321 | 
            -
            whether you have a compatible PyTorch version, run the following Python code snippet:
         | 
| 322 | 
            -
             | 
| 323 | 
            -
            ```python
         | 
| 324 | 
            -
            from transformers.utils import is_torch_sdpa_available
         | 
| 325 | 
            -
             | 
| 326 | 
            -
            print(is_torch_sdpa_available())
         | 
| 327 | 
            -
            ```
         | 
| 328 | 
            -
             | 
| 329 | 
            -
            If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it 
         | 
| 330 | 
            -
            returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
         | 
| 331 | 
            -
             | 
| 332 | 
            -
            Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying 
         | 
| 333 | 
            -
            `attn_implementation="sdpa"` as follows:
         | 
| 334 | 
            -
             | 
| 335 | 
            -
            ```diff
         | 
| 336 | 
            -
            - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
         | 
| 337 | 
            -
            + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
         | 
| 338 | 
             
            ```
         | 
| 339 |  | 
| 340 |  | 
|  | |
| 123 |  | 
| 124 | 
             
            ```python
         | 
| 125 | 
             
            import torch
         | 
| 126 | 
            +
            from transformers import pipeline
         | 
| 127 | 
             
            from datasets import load_dataset, Audio
         | 
| 128 |  | 
| 129 | 
             
            # config
         | 
| 130 | 
             
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 131 | 
             
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 132 | 
             
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
| 133 | 
            +
            model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
         | 
| 134 | 
            +
            generate_kwargs = {"language": "japanese", "task": "transcribe"}
         | 
| 135 |  | 
| 136 | 
             
            # load model
         | 
|  | |
|  | |
|  | |
| 137 | 
             
            pipe = pipeline(
         | 
| 138 | 
             
                "automatic-speech-recognition",
         | 
| 139 | 
            +
                model=model_id,
         | 
|  | |
|  | |
|  | |
| 140 | 
             
                torch_dtype=torch_dtype,
         | 
| 141 | 
             
                device=device,
         | 
| 142 | 
            +
                model_kwargs=model_kwargs
         | 
| 143 | 
             
            )
         | 
| 144 |  | 
| 145 | 
             
            # load sample audio & downsample to 16kHz
         | 
| 146 | 
             
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
|  | |
| 147 | 
             
            sample = dataset[0]["audio"]
         | 
| 148 |  | 
| 149 | 
             
            # run inference
         | 
| 150 | 
            +
            result = pipe(sample, generate_kwargs=generate_kwargs)
         | 
| 151 | 
             
            print(result["text"])
         | 
| 152 | 
             
            ```
         | 
| 153 |  | 
| 154 | 
             
            - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline (make sure the audio is sampled in 16kHz):
         | 
| 155 | 
             
            ```diff
         | 
| 156 | 
            +
            - result = pipe(sample, generate_kwargs=generate_kwargs)
         | 
| 157 | 
            +
            + result = pipe("audio.mp3", generate_kwargs=generate_kwargs)
         | 
| 158 | 
             
            ```
         | 
| 159 |  | 
| 160 | 
             
            - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
         | 
| 161 | 
             
            ```python
         | 
| 162 | 
            +
            result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
         | 
| 163 | 
             
            print(result["chunks"])
         | 
| 164 | 
             
            ```
         | 
| 165 |  | 
| 166 | 
            +
            ***Sequential Long-Form:*** Kotoba-whisper is designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered 
         | 
|  | |
| 167 | 
             
            inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
         | 
| 168 | 
            +
            As default, if long audio files are passed to the model, it will transcribes with the sequential long-form transcription.
         | 
| 169 | 
             
            The sequential long-form algorithm should be used in either of the following scenarios:
         | 
| 170 |  | 
| 171 | 
             
            1. Transcription accuracy is the most important factor, and latency is less of a consideration
         | 
|  | |
| 176 | 
             
            the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf). The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) 
         | 
| 177 | 
             
            class can be used to transcribe long audio files with the sequential algorithm as follows: 
         | 
| 178 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 179 |  | 
| 180 | 
             
            ### Chunked Long-Form
         | 
| 181 | 
             
            This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, 
         | 
|  | |
| 185 |  | 
| 186 | 
             
            ```python
         | 
| 187 | 
             
            import torch
         | 
| 188 | 
            +
            from transformers import pipeline
         | 
| 189 | 
             
            from datasets import load_dataset
         | 
| 190 |  | 
| 191 | 
             
            # config
         | 
| 192 | 
             
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 193 | 
             
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 194 | 
             
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
| 195 | 
            +
            model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
         | 
| 196 | 
            +
            generate_kwargs = {"language": "japanese", "task": "transcribe"}
         | 
| 197 |  | 
| 198 | 
             
            # load model
         | 
|  | |
|  | |
|  | |
| 199 | 
             
            pipe = pipeline(
         | 
| 200 | 
             
                "automatic-speech-recognition",
         | 
| 201 | 
            +
                model=model_id,
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 202 | 
             
                torch_dtype=torch_dtype,
         | 
| 203 | 
             
                device=device,
         | 
| 204 | 
            +
                model_kwargs=model_kwargs,
         | 
| 205 | 
            +
                chunk_length_s=25,
         | 
| 206 | 
            +
                batch_size=16
         | 
| 207 | 
             
            )
         | 
| 208 |  | 
| 209 | 
             
            # load sample audio (concatenate instances to create a long audio)
         | 
| 210 | 
             
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
| 211 | 
            +
            sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate']}
         | 
|  | |
| 212 |  | 
| 213 | 
             
            # run inference
         | 
| 214 | 
            +
            result = pipe(sample, generate_kwargs=generate_kwargs)
         | 
| 215 | 
             
            print(result["text"])
         | 
| 216 | 
             
            ```
         | 
| 217 |  | 
|  | |
| 220 |  | 
| 221 | 
             
            ```python
         | 
| 222 | 
             
            import torch
         | 
| 223 | 
            +
            from transformers import pipeline
         | 
| 224 | 
             
            from datasets import load_dataset, Audio
         | 
| 225 |  | 
| 226 | 
             
            # config
         | 
| 227 | 
             
            model_id = "kotoba-tech/kotoba-whisper-v1.0"
         | 
| 228 | 
             
            torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
         | 
| 229 | 
             
            device = "cuda:0" if torch.cuda.is_available() else "cpu"
         | 
| 230 | 
            +
            model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
         | 
| 231 | 
            +
            generate_kwargs = {"language": "japanese", "task": "transcribe"}
         | 
| 232 |  | 
| 233 | 
             
            # load model
         | 
| 234 | 
            +
            pipe = pipeline(
         | 
| 235 | 
            +
                "automatic-speech-recognition",
         | 
| 236 | 
            +
                model=model_id,
         | 
| 237 | 
            +
                torch_dtype=torch_dtype,
         | 
| 238 | 
            +
                device=device,
         | 
| 239 | 
            +
                model_kwargs=model_kwargs
         | 
| 240 | 
            +
            )
         | 
| 241 | 
            +
             | 
| 242 |  | 
| 243 | 
             
            # load sample audio & downsample to 16kHz
         | 
| 244 | 
             
            dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
         | 
|  | |
|  | |
| 245 |  | 
| 246 | 
             
            # --- Without prompt ---
         | 
| 247 | 
            +
            result = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)
         | 
| 248 | 
            +
            print(result['text'])
         | 
| 249 | 
            +
            # 81歳、力強い走りに変わってきます。
         | 
| 250 |  | 
| 251 | 
             
            # --- With prompt ---: Let's change `81` to `91`.
         | 
| 252 | 
            +
            prompt = "91歳"
         | 
| 253 | 
            +
            generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
         | 
| 254 | 
            +
            result = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)
         | 
| 255 | 
            +
            result['text'] = result['text'][1 + len(prompt) + 1:]  # prompt has been added at the beginning of the output now, so remove it.
         | 
| 256 | 
            +
            print(result['text'])
         | 
| 257 | 
            +
            # あっぶったでもスルガさん、91歳、力強い走りに変わってきます。
         | 
| 258 | 
             
            ```
         | 
| 259 |  | 
| 260 | 
             
            ### Additional Speed & Memory Improvements
         | 
|  | |
| 274 | 
             
            Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
         | 
| 275 |  | 
| 276 | 
             
            ```diff
         | 
| 277 | 
            +
            - model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
         | 
| 278 | 
            +
            + model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 279 | 
             
            ```
         | 
| 280 |  | 
| 281 |  | 

