jonasaise's picture
Upload fine-tuned Icelandic Whisper LoRA adapter v1
4548a00 verified
---
language: is
license: mit # Or your chosen license for the adapter, e.g., apache-2.0
library_name: peft
tags:
- openai
- whisper
- whisper-large-v3
- automatic-speech-recognition
- asr
- icelandic
- lora
- peft
- speech
base_model: openai/whisper-large-v3
datasets:
- language-and-voice-lab/raddromur_icelandic_speech_22_09 # Fictitious ID for clarity, actual data is local
- language-and-voice-lab/samromur_milljon
metrics:
- wer
- cer
model-index:
- name: whisper-large-v3-lora-is
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Samrómur Milljón (female_18to49_yrs subset)
type: language-and-voice-lab/samromur_milljon
config: is
split: female_18to49_yrs (1000 samples)
metrics:
- name: WER
type: wer
value: 33.07 # From your results
- name: CER
type: cer
value: 10.59 # From your results
---
# LoRA Fine-tuned Whisper Large v3 for Icelandic ASR
This repository contains a LoRA (Low-Rank Adaptation) adapter for the `openai/whisper-large-v3` model, fine-tuned for Automatic Speech Recognition (ASR) in Icelandic.
The fine-tuning was performed on the "Raddrómur Icelandic Speech 22.09" corpus, and the adapter was evaluated on a subset of the "Samrómur Milljón" dataset.
## Model Description
* **Base Model:** `openai/whisper-large-v3`
* **Fine-tuning Method:** LoRA (Parameter-Efficient Fine-Tuning) using the `peft` library.
* **Language:** Icelandic (is)
* **Task:** Automatic Speech Recognition (transcription)
## Fine-tuning Data
* **Dataset Name:** Raddrómur Icelandic Speech 22.09
* **Source:** Language and Voice Laboratory (LVL) at Reykjavík University (RU)
* **Description:** Approximately 49 hours of Icelandic speech sourced from radio podcasts (primarily RÚV). The audio is 16kHz mono FLAC, with transcriptions automatically aligned.
* **License:** [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
## Evaluation
The fine-tuned adapter was evaluated against the base `openai/whisper-large-v3` model on a 1000-sample subset of the `female_18to49_yrs` split from the `language-and-voice-lab/samromur_milljon` dataset.
**Evaluation Metrics (Lower is Better):**
| Model | WER (%) | CER (%) |
| :------------------- | :-----: | :-----: |
| Base Model | 34.15 | 11.05 |
| Fine-tuned Adapter | 33.07 | 10.59 |
*(Note: No stereo files were detected in the evaluation subset. Evaluation error flags were False for both, indicating successful completion.)*
**Comparison Plot:**
possibly
**Interpretation:** The fine-tuned LoRA adapter demonstrates a modest improvement over the base `whisper-large-v3` model on this specific Icelandic evaluation subset. The Word Error Rate (WER) was reduced by approximately 1.08 points (absolute), and the Character Error Rate (CER) was reduced by approximately 0.46 points (absolute). Further evaluation on larger or different test sets could provide more comprehensive insights.
## How to Use
This LoRA adapter is intended to be used with the base `openai/whisper-large-v3` model.
First, ensure you have the necessary libraries installed:
```bash
# Using pip
pip install transformers peft torch accelerate soundfile librosa
# Or using uv
uv pip install transformers peft torch accelerate soundfile librosa
```
Then, you can load the base model and apply the LoRA adapter from the Hugging Face Hub like this:
```python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa # Or your preferred audio loading library
import numpy as np
# --- Configuration ---
BASE_MODEL_ID = "openai/whisper-large-v3"
# Replace with your actual Hugging Face Hub ID for the adapter
# For example, if you pushed it to "jonasaise/whisper-large-v3-lora-is"
ADAPTER_HUB_ID = "jonasaise/your-repo-name" # <--- CHANGE THIS
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Use the precision your model was trained/evaluated with
MODEL_PRECISION = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
TARGET_LANGUAGE = "is"
TASK = "transcribe"
# --- 1. Load Processor ---
try:
processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID, language=TARGET_LANGUAGE, task=TASK)
except Exception as e:
print(f"Error loading processor: {e}")
# Fallback if processor isn't found with base model ID (less common for Whisper)
# processor = WhisperProcessor.from_pretrained(ADAPTER_HUB_ID, language=TARGET_LANGUAGE, task=TASK)
# --- 2. Load Base Model ---
print(f"Loading base model: {BASE_MODEL_ID}...")
base_model = WhisperForConditionalGeneration.from_pretrained(
BASE_MODEL_ID,
torch_dtype=MODEL_PRECISION,
low_cpu_mem_usage=True,
attn_implementation="sdpa" # Recommended for speed if supported, or remove/use "eager"
)
print("Base model loaded.")
# --- 3. Load LoRA Adapter ---
print(f"Loading LoRA adapter from: {ADAPTER_HUB_ID}...")
# This loads the adapter weights and applies them to the base model
model = PeftModel.from_pretrained(base_model, ADAPTER_HUB_ID)
model = model.to(DEVICE)
model.eval() # Set to evaluation mode
print("LoRA adapter loaded and applied. Model is on device:", model.device)
# --- 4. Prepare Your Audio ---
# Replace "path/to/your/icelandic_audio.wav" with the actual path to your audio file
AUDIO_FILE_PATH = "path/to/your/icelandic_audio.wav" # <--- CHANGE THIS
try:
# Load audio and resample to 16kHz mono
speech_array, sampling_rate = librosa.load(AUDIO_FILE_PATH, sr=16000, mono=True)
print(f"Audio loaded and resampled to 16kHz mono. Duration: {len(speech_array)/sampling_rate:.2f}s")
except Exception as e:
print(f"Error loading audio file {AUDIO_FILE_PATH}: {e}")
exit()
# Process audio to get input features
input_features = processor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
# Ensure input_features are on the correct device and precision
# Note: Autocast during generation will handle precision, but explicit cast can also be done
input_features = input_features.to(DEVICE) # Move to device
if MODEL_PRECISION == torch.bfloat16:
input_features = input_features.to(torch.bfloat16)
elif MODEL_PRECISION == torch.float16:
input_features = input_features.to(torch.float16)
print("Input features prepared.")
# --- 5. Generate Transcription ---
# Configure generation parameters
# Use the model's existing generation_config as a base
generation_config = model.generation_config
generation_config.language = TARGET_LANGUAGE
generation_config.task = TASK
generation_config.forced_decoder_ids = None # Let processor handle this based on task/language
generation_config.suppress_tokens = [] # Clear any suppressed tokens
print("Generating transcription...")
with torch.inference_mode(): # Disables gradient calculations for inference
with torch.autocast(device_type=DEVICE, dtype=MODEL_PRECISION, enabled=torch.cuda.is_available()): # Enable autocast for mixed precision
predicted_ids = model.generate(input_features, generation_config=generation_config)
# --- 6. Decode Transcription ---
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("-" * 30)
print(f"Transcription: {transcription}")
print("-" * 30)
```
## Training Procedure
This section details the setup and hyperparameters used for fine-tuning the LoRA adapter.
### Data Preprocessing
The fine-tuning script (`finetune_whisper_ice_lora.py`) performs the following preprocessing steps on the Raddrómur dataset:
1. Loads audio file paths and transcriptions from the `metadata.tsv` file.
2. Constructs full paths to audio files, accounting for the nested directory structure (e.g., `<DATA_DIR>/speech/<podcast_name_dir>/<podcast_id_dir>/<filename.flac>`).
3. Casts audio to 16kHz mono (though Raddrómur is already in this format).
4. Splits the dataset into training and test/validation sets (e.g., 90/10 split).
5. Uses the `WhisperProcessor` to:
* Convert audio arrays into log-Mel input features.
* Tokenize the Icelandic transcriptions into label IDs.
6. A `DataCollatorSpeechSeq2SeqWithPadding` is used to dynamically pad sequences within each batch.
### Fine-tuning Hyperparameters & Setup
The model was fine-tuned using the following configuration:
* **Base Model:** `openai/whisper-large-v3`
* **Fine-tuning Method:** LoRA (Low-Rank Adaptation) using `peft`.
* `r` (Rank of LoRA matrices): 32 (example, *adjust if different*)
* `lora_alpha`: 64 (example, *adjust if different*)
* `target_modules`: `["q_proj", "v_proj"]` (example, *adjust if different*)
* `lora_dropout`: 0.05 (example, *adjust if different*)
* **Precision:** BFloat16 (`bf16=True` in `Seq2SeqTrainingArguments`).
* **Optimizer:** AdamW 8-bit (`optim="adamw_8bit"` in `Seq2SeqTrainingArguments`, requires `bitsandbytes`).
* **Learning Rate:** e.g., `1e-5` (*adjust to your actual value*).
* **Batch Size (Per Device):** e.g., `4` (*adjust to your final successful value*).
* **Gradient Accumulation Steps:** e.g., `8` (*adjust to your final successful value*).
* **Effective Batch Size:** (Per-Device Batch Size) \* (Gradient Accumulation Steps) \* (Number of GPUs)
* **Number of Epochs:** 3 (or `max_steps` if that was used).
* **Warmup Steps:** e.g., 10% of total steps (*adjust to your actual value*).
* **Attention Implementation:** Scaled Dot Product Attention (`attn_implementation="sdpa"` during model loading).
* **Gradient Checkpointing:** Enabled (`model.gradient_checkpointing_enable()`).
* **Logging:** Weights & Biases (`report_to=["wandb"]`).
* **Evaluation Strategy during Training:** Evaluated every `eval_steps` (e.g., 36 steps, *adjust to your final value*).
* **Language & Task:** Icelandic (`is`), Transcribe (`transcribe`).
### Compute Infrastructure
* **Hardware:** NVIDIA DGX A100 (initially targeting 5 GPUs, final successful training run used 2 GPUs - `6,7`).
* **Software:**
* Python 3.10
* PyTorch
* `transformers`
* `datasets`
* `peft`
* `accelerate` (via `torchrun`)
* `uv` (for environment management)
## Intended Use
This fine-tuned LoRA adapter is intended to improve the performance of `openai/whisper-large-v3` for transcribing general Icelandic speech. It is particularly suited for:
* Transcribing Icelandic audio content similar in nature to radio podcasts (the primary source of the Raddrómur fine-tuning data).
* Use cases where improved accuracy on Icelandic specific vocabulary, names, and nuances is desired over the base multilingual model.
* Applications requiring efficient fine-tuning and deployment, leveraging the small footprint of LoRA adapters.
## Limitations and Bias
* **Domain Specificity:** The fine-tuning dataset (Raddrómur) primarily consists of relatively clean radio podcast speech. Performance on other domains of Icelandic speech (e.g., highly noisy environments, strong accents not represented in Raddrómur, spontaneous conversational speech, children's speech beyond what might be in Samrómur Children, if that was used for training the original ASR systems that verified Samrómur Milljón) may vary.
* **Base Model Biases:** The base `openai/whisper-large-v3` model has its own inherent limitations and potential biases (e.g., demographic performance differences, sensitivity to certain audio characteristics). These may still be present or be amplified/mitigated to some extent by this fine-tuning.
* **Evaluation Subset:** The reported evaluation metrics are based on a 1000-sample subset of a specific demographic split (`female_18to49_yrs`) from the Samrómur Milljón dataset. Performance might differ on the full dataset, other splits, or other Icelandic evaluation benchmarks.
* **LoRA Limitations:** While parameter-efficient, LoRA fine-tunes only a small subset of the model's parameters. It might not capture all the nuances that full fine-tuning could, but offers a significant reduction in computational cost.
### Recommendations
Users should be aware of the above limitations. It is recommended to:
* Test the model on a diverse set of Icelandic audio relevant to the specific application before deployment.
* Consider further fine-tuning or domain adaptation if performance on a specific out-of-domain task is critical.
* Be mindful of potential biases when using the model in sensitive applications.
## License
* **This Adapter:** [Your Chosen License for the Adapter - e.g., MIT, Apache 2.0]
* **Base Model (`openai/whisper-large-v3`):** The license of the original Whisper model applies to the base weights.
* **Datasets Used:**
* Raddrómur Icelandic Speech 22.09: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
* Samrómur Milljón: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
## Acknowledgements
* The Language and Voice Laboratory (LVL) at Reykjavík University for creating the Raddrómur and Samrómur Milljón datasets.
* The Language Technology Programme for Icelandic 2019-2023, managed by Almannarómur and funded by the Icelandic Ministry of Education, Science and Culture, for funding the dataset creation.
* OpenAI for the Whisper model.
* Hugging Face for the `transformers`, `datasets`, `evaluate`, `peft`, and `accelerate` libraries.
* The Weights & Biases platform for experiment tracking.
* Astral for the `uv` tool.
## Citations
If you use this adapter or build upon this work, please consider citing the original datasets and the base model:
1. **Raddrómur Dataset:**
Mena, Carlos et al. "Raddrómur Icelandic Speech 22.09". Web Download. Reykjavik University: Language and Voice Lab, 2022.
2. **Samrómur Milljón Dataset:**
```bibtex
@inproceedings{mena2024samromur,
title={Samr{\'o}mur Millj{\'o}n: An ASR Corpus of One Million Verified Read Prompts in Icelandic},
author={Mena, Carlos Daniel Hernandez and Gunnarsson, {\TH}orsteinn Da{\dh}i and Gu{\dh}nason, J{\'o}n},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages={14305--14312},
year={2024}
}
```
3. **Whisper Model:**
```bibtex
@inproceedings{radford2023robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
booktitle={International Conference on Machine Learning},
pages={28492--28518},
year={2023},
organization={PMLR}
}
```