metadata

language:
  - en
  - es
  - fr
  - it
  - de
  - pt
library_name: nemo
datasets:
  - mozilla-foundation/common_voice_8_0
  - MLCommons/peoples_speech
  - librispeech_asr
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - FastConformer
  - Conformer
  - pytorch
  - NeMo
  - hf-asr-leaderboard
  - ctc
  - entity-tagging
  - speaker-attributes
license: cc-by-4.0

Meta ASR English

This model is a fine-tuned version of ASR-CTC model enhanced with entity tagging, speaker attributes, and multi-language support for European languages.

Model Details

Fine-tuned on: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech
Languages: English, Spanish, French, Italian, German, Portuguese
Additional Features: Entity tagging, speaker attributes (age, gender, emotion), and intent detection

Output Format

The model provides rich transcriptions including:

Entity tags (PERSON_NAME, ORGANIZATION, etc.)
Speaker attributes (AGE, GENDER, EMOTION)
Intent classification
Language-specific transcription

Example output:

ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM

Usage

import nemo.collections.asr as nemo_asr

# Load model
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1')

# Transcribe audio
transcription = asr_model.transcribe(['path/to/audio.wav'])
print(transcription[0])

Training Data

The model was fine-tuned on:

CommonVoice dataset (6 European languages)
People's Speech English corpus
Indian accented English
LibriSpeech corpus (en, es, fr, it, pt)

Model Architecture

Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss.

License

This model is licensed under the CC-BY-4.0 license.

References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] NVIDIA NeMo Toolkit