|  | --- | 
					
						
						|  | library_name: nemo | 
					
						
						|  | --- | 
					
						
						|  | # CHiME8 DASR NeMo Baseline Models | 
					
						
						|  |  | 
					
						
						|  | - The model files in this repository are the models used in this paper [The CHiME-7 Challenge: System Description and Performance of | 
					
						
						|  | NeMo Team’s DASR System](https://arxiv.org/pdf/2310.12378.pdf). | 
					
						
						|  | - These models are needed to execute the CHiME8-DASR baseline [CHiME8-DASR-Baseline NeMo](https://github.com/chimechallenge/C8DASR-Baseline-NeMo/tree/main/scripts/chime8) | 
					
						
						|  | - VAD, Diarization and ASR models are all based on [NVIDIA NeMo Conversational AI Toolkits](https://github.com/NVIDIA/NeMo). | 
					
						
						|  |  | 
					
						
						|  | ## 1. Voice Activity Detection (VAD) Model: | 
					
						
						|  | ### **[**MarbleNet_frame_VAD_chime7_Acrobat.nemo**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/MarbleNet_frame_VAD_chime7_Acrobat.nemo)** | 
					
						
						|  | - This model is based on [NeMo MarbleNet VAD model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/models.html#marblenet-vad). | 
					
						
						|  | - For validation, we use dataset comprises the CHiME-6 development subset as well as 50 hours of simulated audio data. | 
					
						
						|  | - The simulated data is generated using the [NeMo multi-speaker data simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb) | 
					
						
						|  | on [VoxCeleb1&2 datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) | 
					
						
						|  | - The multi-speaker data simulation results in a total of 2,000 hours of audio, of which approximately 30% is silence. | 
					
						
						|  | - The Model training incorporates [SpecAugment](https://arxiv.org/abs/1904.08779) and noise augmentation through [MUSAN noise dataset](https://arxiv.org/abs/1510.08484). | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## 2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2) | 
					
						
						|  | ### **[**MSDD_v2_PALO_100ms_intrpl_3scales.nemo**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/MSDD_v2_PALO_100ms_intrpl_3scales.nemo)** | 
					
						
						|  |  | 
					
						
						|  | Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD). | 
					
						
						|  | - MSDD Reference: [Park et al. (2022)](https://arxiv.org/pdf/2203.15974.pdf) | 
					
						
						|  | - MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor. | 
					
						
						|  | - TitaNet Reference: [Koluguri et al. (2022)](https://arxiv.org/abs/2110.04410) | 
					
						
						|  | - TitaNet Model is included in [MSDD-v2 .nemo checkpoint file](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/MSDD_v2_PALO_100ms_intrpl_3scales.nemo). | 
					
						
						|  | - Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384. | 
					
						
						|  | - This neural model generates logit values indicating speaker existence. | 
					
						
						|  | - Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets. | 
					
						
						|  | - LibriSpeech Reference: [OpenSLR Download](https://www.openslr.org/12),[LibriSpeech, Panayotov et al. (2015)](https://ieeexplore.ieee.org/document/7178964) | 
					
						
						|  | - MUSAN noise is also used for adding additive background noise, focusing on music and broadband noise. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## 3. Automatic Speech Recognition (ASR) model | 
					
						
						|  | ### **[**FastConformerXL-RNNT-chime7-GSS-finetuned.nemo**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/FastConformerXL-RNNT-chime7-GSS-finetuned.nemo)** | 
					
						
						|  | - This ASR model is based on [NeMo FastConformer XL model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer). | 
					
						
						|  | - Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model. | 
					
						
						|  | - Model Reference: [Gulati et al. (2020)](https://arxiv.org/abs/2005.08100) | 
					
						
						|  | - The model was initialized using a publicly available NeMo checkpoint. | 
					
						
						|  | - NeMo Checkpoint: [NGC Model Card: Conformer Transducer XL](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge) | 
					
						
						|  | - This model was then fine-tuned on the CHiME-7 train and dev set, which includes the CHiME-6 and Mixer6 training subsets, after processing the data through the multi-channel ASR front-end, utilizing ground-truth diarization. | 
					
						
						|  | - Fine-Tuning Details: | 
					
						
						|  | - Fine-tuning Duration: 35,000 updates | 
					
						
						|  | - Batch Size: 128 | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## 4. Language Model for ASR Decoding: KenLM Model | 
					
						
						|  | ### **[**ASR_LM_chime7_only.kenlm**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/ASR_LM_chime7_only.kenlm)** | 
					
						
						|  |  | 
					
						
						|  | - This KenLM model is trained solely on CHiME7-DASR datasets (Mixer6, CHiME6, DipCo). | 
					
						
						|  | - We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens. | 
					
						
						|  | - This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets. | 
					
						
						|  | - SentencePiece: [Kudo and Richardson (2018)](https://arxiv.org/abs/1808.06226) | 
					
						
						|  | - KenLM: [KenLM GitRepo](https://github.com/kpu/kenlm) | 
					
						
						|  | - The token sets of our ASR and LM models were matched to ensure consistency. | 
					
						
						|  | - To combine several N-gram models with equal weights, we used the OpenGrm library. | 
					
						
						|  | - OpenGrm: [Roark et al. (2012)](https://aclanthology.org/P12-3011/) | 
					
						
						|  | - MAES decoding was employed for the transducer, which accelerates the decoding process. | 
					
						
						|  | - MAES Decoding: [Kim et al. (2020)](https://ieeexplore.ieee.org/document/9250505) | 
					
						
						|  | - As expected, integrating the beam-search decoder with the language model significantly enhances the performance of the end-to-end model compared to its pure counterpart. | 
					
						
						|  |  |