|
|
--- |
|
|
language: |
|
|
- en |
|
|
- es |
|
|
- fr |
|
|
- it |
|
|
- de |
|
|
- pt |
|
|
library_name: nemo |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_8_0 |
|
|
- MLCommons/peoples_speech |
|
|
- librispeech_asr |
|
|
thumbnail: null |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- speech |
|
|
- audio |
|
|
- FastConformer |
|
|
- Conformer |
|
|
- pytorch |
|
|
- NeMo |
|
|
- hf-asr-leaderboard |
|
|
- ctc |
|
|
- entity-tagging |
|
|
- speaker-attributes |
|
|
license: cc-by-4.0 |
|
|
--- |
|
|
|
|
|
# Meta ASR English |
|
|
|
|
|
This model is a fine-tuned version of NVIDIA's Parakeet CTC 0.6B model, enhanced with entity tagging, speaker attributes, and multi-language support for European languages. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Parakeet CTC 0.6B (FastConformer) |
|
|
- **Fine-tuned on**: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech |
|
|
- **Languages**: English, Spanish, French, Italian, German, Portuguese |
|
|
- **Additional Features**: Entity tagging, speaker attributes (age, gender, emotion), and intent detection |
|
|
|
|
|
## Output Format |
|
|
|
|
|
The model provides rich transcriptions including: |
|
|
- Entity tags (PERSON_NAME, ORGANIZATION, etc.) |
|
|
- Speaker attributes (AGE, GENDER, EMOTION) |
|
|
- Intent classification |
|
|
- Language-specific transcription |
|
|
|
|
|
Example output: |
|
|
``` |
|
|
ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import nemo.collections.asr as nemo_asr |
|
|
|
|
|
# Load model |
|
|
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1') |
|
|
|
|
|
# Transcribe audio |
|
|
transcription = asr_model.transcribe(['path/to/audio.wav']) |
|
|
print(transcription[0]) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on: |
|
|
- CommonVoice dataset (6 European languages) |
|
|
- People's Speech English corpus |
|
|
- Indian accented English |
|
|
- LibriSpeech corpus (en, es, fr, it, pt) |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license. |
|
|
|
|
|
## References |
|
|
|
|
|
[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084) |
|
|
[2] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |
|
|
|