WhissleAI
/

meta-stt-en-0.6b

Automatic Speech Recognition

hf-asr-leaderboard

speaker-attributes

Model card Files Files and versions

meta-stt-en-0.6b / README.md

ksingla025's picture

Upload README.md

4a6d319 verified 5 months ago

|

2.24 kB

	---
	language:
	- en
	- es
	- fr
	- it
	- de
	- pt
	library_name: nemo
	datasets:
	- mozilla-foundation/common_voice_8_0
	- MLCommons/peoples_speech
	- librispeech_asr
	thumbnail: null
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- FastConformer
	- Conformer
	- pytorch
	- NeMo
	- hf-asr-leaderboard
	- ctc
	- entity-tagging
	- speaker-attributes
	license: cc-by-4.0
	---

	# Meta ASR English

	This model is a fine-tuned version of NVIDIA's Parakeet CTC 0.6B model, enhanced with entity tagging, speaker attributes, and multi-language support for European languages.

	## Model Details

	- Base Model: Parakeet CTC 0.6B (FastConformer)
	- Fine-tuned on: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech
	- Languages: English, Spanish, French, Italian, German, Portuguese
	- Additional Features: Entity tagging, speaker attributes (age, gender, emotion), and intent detection

	## Output Format

	The model provides rich transcriptions including:
	- Entity tags (PERSON_NAME, ORGANIZATION, etc.)
	- Speaker attributes (AGE, GENDER, EMOTION)
	- Intent classification
	- Language-specific transcription

	Example output:
	```
	ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM
	```

	## Usage

	```python
	import nemo.collections.asr as nemo_asr

	# Load model
	asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1')

	# Transcribe audio
	transcription = asr_model.transcribe(['path/to/audio.wav'])
	print(transcription[0])
	```

	## Training Data

	The model was fine-tuned on:
	- CommonVoice dataset (6 European languages)
	- People's Speech English corpus
	- Indian accented English
	- LibriSpeech corpus (en, es, fr, it, pt)

	## Model Architecture

	Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss.

	## License

	This model is licensed under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.

	## References

	[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
	[2] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)