Automatic Speech Recognition
NeMo
PyTorch
speech
audio
FastConformer
Conformer
NeMo
hf-asr-leaderboard
ctc
entity-tagging
speaker-attributes
ksingla025 commited on
Commit
4b69b24
·
verified ·
1 Parent(s): 81fab41

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ - fr
6
+ - it
7
+ - de
8
+ - pt
9
+ library_name: nemo
10
+ datasets:
11
+ - mozilla-foundation/common_voice_8_0
12
+ - MLCommons/peoples_speech
13
+ - librispeech_asr
14
+ thumbnail: null
15
+ tags:
16
+ - automatic-speech-recognition
17
+ - speech
18
+ - audio
19
+ - FastConformer
20
+ - Conformer
21
+ - pytorch
22
+ - NeMo
23
+ - hf-asr-leaderboard
24
+ - ctc
25
+ - entity-tagging
26
+ - speaker-attributes
27
+ license: cc-by-4.0
28
+ ---
29
+
30
+ # Meta ASR English
31
+
32
+ This model is a fine-tuned version of NVIDIA's Parakeet CTC 0.6B model, enhanced with entity tagging, speaker attributes, and multi-language support for European languages.
33
+
34
+ ## Model Details
35
+
36
+ - **Base Model**: Parakeet CTC 0.6B (FastConformer)
37
+ - **Fine-tuned on**: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech
38
+ - **Languages**: English, Spanish, French, Italian, German, Portuguese
39
+ - **Additional Features**: Entity tagging, speaker attributes (age, gender, emotion), and intent detection
40
+
41
+ ## Output Format
42
+
43
+ The model provides rich transcriptions including:
44
+ - Entity tags (PERSON_NAME, ORGANIZATION, etc.)
45
+ - Speaker attributes (AGE, GENDER, EMOTION)
46
+ - Intent classification
47
+ - Language-specific transcription
48
+
49
+ Example output:
50
+ ```
51
+ ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM
52
+ ```
53
+
54
+ ## Usage
55
+
56
+ ```python
57
+ import nemo.collections.asr as nemo_asr
58
+
59
+ # Load model
60
+ asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1')
61
+
62
+ # Transcribe audio
63
+ transcription = asr_model.transcribe(['path/to/audio.wav'])
64
+ print(transcription[0])
65
+ ```
66
+
67
+ ## Training Data
68
+
69
+ The model was fine-tuned on:
70
+ - CommonVoice dataset (6 European languages)
71
+ - People's Speech English corpus
72
+ - Indian accented English
73
+ - LibriSpeech corpus (en, es, fr, it, pt)
74
+
75
+ ## Model Architecture
76
+
77
+ Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss.
78
+
79
+ ## License
80
+
81
+ This model is licensed under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
82
+
83
+ ## References
84
+
85
+ [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
86
+ [2] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)