WhissleAI
/

masr-en-0.6b

Automatic Speech Recognition

hf-asr-leaderboard

speaker-attributes

Model card Files Files and versions

ksingla025 commited on Jun 1

Commit

4b69b24

·

verified ·

1 Parent(s): 81fab41

Upload README.md

Files changed (1) hide show

README.md +86 -0

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+---
+language:
+- en
+- es
+- fr
+- it
+- de
+- pt
+library_name: nemo
+datasets:
+- mozilla-foundation/common_voice_8_0
+- MLCommons/peoples_speech
+- librispeech_asr
+thumbnail: null
+tags:
+- automatic-speech-recognition
+- speech
+- audio
+- FastConformer
+- Conformer
+- pytorch
+- NeMo
+- hf-asr-leaderboard
+- ctc
+- entity-tagging
+- speaker-attributes
+license: cc-by-4.0
+---
+# Meta ASR English
+This model is a fine-tuned version of NVIDIA's Parakeet CTC 0.6B model, enhanced with entity tagging, speaker attributes, and multi-language support for European languages.
+## Model Details
+- **Base Model**: Parakeet CTC 0.6B (FastConformer)
+- **Fine-tuned on**: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech
+- **Languages**: English, Spanish, French, Italian, German, Portuguese
+- **Additional Features**: Entity tagging, speaker attributes (age, gender, emotion), and intent detection
+## Output Format
+The model provides rich transcriptions including:
+- Entity tags (PERSON_NAME, ORGANIZATION, etc.)
+- Speaker attributes (AGE, GENDER, EMOTION)
+- Intent classification
+- Language-specific transcription
+Example output:
+```
+ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM
+```
+## Usage
+```python
+import nemo.collections.asr as nemo_asr
+# Load model
+asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1')
+# Transcribe audio
+transcription = asr_model.transcribe(['path/to/audio.wav'])
+print(transcription[0])
+```
+## Training Data
+The model was fine-tuned on:
+- CommonVoice dataset (6 European languages)
+- People's Speech English corpus
+- Indian accented English
+- LibriSpeech corpus (en, es, fr, it, pt)
+## Model Architecture
+Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss.
+## License
+This model is licensed under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
+## References
+[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
+[2] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)