|  | --- | 
					
						
						|  | license: cc-by-4.0 | 
					
						
						|  | library_name: nemo | 
					
						
						|  | datasets: | 
					
						
						|  | - fisher_english | 
					
						
						|  | - NIST_SRE_2004-2010 | 
					
						
						|  | - librispeech | 
					
						
						|  | - ami_meeting_corpus | 
					
						
						|  | - voxconverse_v0.3 | 
					
						
						|  | - icsi | 
					
						
						|  | - aishell4 | 
					
						
						|  | - dihard_challenge-3-dev | 
					
						
						|  | - NIST_SRE_2000-Disc8_split1 | 
					
						
						|  | - Alimeeting-train | 
					
						
						|  | - DiPCo | 
					
						
						|  | thumbnail: null | 
					
						
						|  | tags: | 
					
						
						|  | - speaker-diarization | 
					
						
						|  | - speaker-recognition | 
					
						
						|  | - speech | 
					
						
						|  | - audio | 
					
						
						|  | - Transformer | 
					
						
						|  | - FastConformer | 
					
						
						|  | - Conformer | 
					
						
						|  | - NEST | 
					
						
						|  | - pytorch | 
					
						
						|  | - NeMo | 
					
						
						|  | widget: | 
					
						
						|  | - example_title: Librispeech sample 1 | 
					
						
						|  | src: https://cdn-media.huggingface.co/speech_samples/sample1.flac | 
					
						
						|  | - example_title: Librispeech sample 2 | 
					
						
						|  | src: https://cdn-media.huggingface.co/speech_samples/sample2.flac | 
					
						
						|  | model-index: | 
					
						
						|  | - name: diar_streaming_sortformer_4spk-v2 | 
					
						
						|  | results: | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: DIHARD III Eval (1-4 spk) | 
					
						
						|  | type: dihard3-eval-1to4spks | 
					
						
						|  | config: with_overlap_collar_0.0s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: eval-1to4spks | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 13.24 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: DIHARD III Eval (5-9 spk) | 
					
						
						|  | type: dihard3-eval-5to9spks | 
					
						
						|  | config: with_overlap_collar_0.0s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: eval-5to9spks | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 42.56 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: DIHARD III Eval (full) | 
					
						
						|  | type: dihard3-eval | 
					
						
						|  | config: with_overlap_collar_0.0s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: eval | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 18.91 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: CALLHOME (NIST-SRE-2000 Disc8) part2 (2 spk) | 
					
						
						|  | type: CALLHOME-part2-2spk | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: part2-2spk | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 6.57 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: CALLHOME (NIST-SRE-2000 Disc8) part2 (3 spk) | 
					
						
						|  | type: CALLHOME-part2-3spk | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: part2-3spk | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 10.05 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: CALLHOME (NIST-SRE-2000 Disc8) part2 (4 spk) | 
					
						
						|  | type: CALLHOME-part2-4spk | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: part2-4spk | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 12.44 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: CALLHOME (NIST-SRE-2000 Disc8) part2 (5 spk) | 
					
						
						|  | type: CALLHOME-part2-5spk | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: part2-5spk | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 21.68 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: CALLHOME (NIST-SRE-2000 Disc8) part2 (6 spk) | 
					
						
						|  | type: CALLHOME-part2-6spk | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: part2-6spk | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 28.74 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: CALLHOME (NIST-SRE-2000 Disc8) part2 (full) | 
					
						
						|  | type: CALLHOME-part2 | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: part2 | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 10.70 | 
					
						
						|  | - task: | 
					
						
						|  | name: Speaker Diarization | 
					
						
						|  | type: speaker-diarization-with-post-processing | 
					
						
						|  | dataset: | 
					
						
						|  | name: call_home_american_english_speech | 
					
						
						|  | type: CHAES_2spk_109sessions | 
					
						
						|  | config: with_overlap_collar_0.25s | 
					
						
						|  | input_buffer_lenght: 1.04s | 
					
						
						|  | split: ch109 | 
					
						
						|  | metrics: | 
					
						
						|  | - name: Test DER | 
					
						
						|  | type: der | 
					
						
						|  | value: 4.88 | 
					
						
						|  | metrics: | 
					
						
						|  | - der | 
					
						
						|  | pipeline_tag: audio-classification | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | # Streaming Sortformer Diarizer 4spk v2 | 
					
						
						|  |  | 
					
						
						|  | <style> | 
					
						
						|  | img { | 
					
						
						|  | display: inline; | 
					
						
						|  | } | 
					
						
						|  | </style> | 
					
						
						|  |  | 
					
						
						|  | [](#model-architecture) | 
					
						
						|  | | [](#model-architecture) | 
					
						
						|  | <!-- | [](#datasets) --> | 
					
						
						|  |  | 
					
						
						|  | This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. | 
					
						
						|  |  | 
					
						
						|  | <div align="center"> | 
					
						
						|  | <img src="figures/sortformer_intro.png" width="750" /> | 
					
						
						|  | </div> | 
					
						
						|  |  | 
					
						
						|  | [Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers. | 
					
						
						|  | <div align="center"> | 
					
						
						|  | <img src="figures/streaming_sortformer_ani.gif" width="1400" /> | 
					
						
						|  | </div> | 
					
						
						|  |  | 
					
						
						|  | Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker. | 
					
						
						|  |  | 
					
						
						|  | ## Model Architecture | 
					
						
						|  |  | 
					
						
						|  | Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate speaker-cache. At each step, speaker cache is filtered to only retain the high-quality speaker cache vectors. | 
					
						
						|  |  | 
					
						
						|  | <div align="center"> | 
					
						
						|  | <img src="figures/streaming_steps.png" width="1400" /> | 
					
						
						|  | </div> | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for | 
					
						
						|  | Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192, | 
					
						
						|  | and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2]. | 
					
						
						|  |  | 
					
						
						|  | <div align="center"> | 
					
						
						|  | <img src="figures/sortformer-v1-model.png" width="450" /> | 
					
						
						|  | </div> | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## NVIDIA NeMo | 
					
						
						|  |  | 
					
						
						|  | To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[6]. We recommend you install it after you've installed Cython and latest PyTorch version. | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | apt-get update && apt-get install -y libsndfile1 ffmpeg | 
					
						
						|  | pip install Cython packaging | 
					
						
						|  | pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr] | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## How to Use this Model | 
					
						
						|  |  | 
					
						
						|  | The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. | 
					
						
						|  |  | 
					
						
						|  | ### Loading the Model | 
					
						
						|  |  | 
					
						
						|  | ```python3 | 
					
						
						|  | from nemo.collections.asr.models import SortformerEncLabelModel | 
					
						
						|  |  | 
					
						
						|  | # load model from Hugging Face model card directly (You need a Hugging Face token) | 
					
						
						|  | diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2") | 
					
						
						|  |  | 
					
						
						|  | # If you have a downloaded model in "/path/to/diar_streaming_sortformer_4spk-v2.nemo", load model from a downloaded file | 
					
						
						|  | diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_streaming_sortformer_4spk-v2.nemo", map_location='cuda', strict=False) | 
					
						
						|  |  | 
					
						
						|  | # switch to inference mode | 
					
						
						|  | diar_model.eval() | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Input Format | 
					
						
						|  | Input to Sortformer can be an individual audio file: | 
					
						
						|  | ```python3 | 
					
						
						|  | audio_input="/path/to/multispeaker_audio1.wav" | 
					
						
						|  | ``` | 
					
						
						|  | or a list of paths to audio files: | 
					
						
						|  | ```python3 | 
					
						
						|  | audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"] | 
					
						
						|  | ``` | 
					
						
						|  | or a jsonl manifest file: | 
					
						
						|  | ```python3 | 
					
						
						|  | audio_input="/path/to/multispeaker_manifest.json" | 
					
						
						|  | ``` | 
					
						
						|  | where each line is a dictionary containing the following fields: | 
					
						
						|  | ```yaml | 
					
						
						|  | # Example of a line in `multispeaker_manifest.json` | 
					
						
						|  | { | 
					
						
						|  | "audio_filepath": "/path/to/multispeaker_audio1.wav",  # path to the input audio file | 
					
						
						|  | "offset": 0, # offset (start) time of the input audio | 
					
						
						|  | "duration": 600,  # duration of the audio, can be set to `null` if using NeMo main branch | 
					
						
						|  | } | 
					
						
						|  | { | 
					
						
						|  | "audio_filepath": "/path/to/multispeaker_audio2.wav", | 
					
						
						|  | "offset": 900, | 
					
						
						|  | "duration": 580, | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Setting up Streaming Configuration | 
					
						
						|  |  | 
					
						
						|  | Streaming configuration is defined by the following parameters, all measured in **80ms frames**: | 
					
						
						|  | * **CHUNK_SIZE**: The number of frames in a processing chunk. | 
					
						
						|  | * **RIGHT_CONTEXT**: The number of future frames attached after the chunk. | 
					
						
						|  | * **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue. | 
					
						
						|  | * **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache. | 
					
						
						|  | * **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache. | 
					
						
						|  |  | 
					
						
						|  | Here are recommended configurations for different scenarios: | 
					
						
						|  | | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** | | 
					
						
						|  | | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- | | 
					
						
						|  | | very high latency | 30.4s       | 0.002   | 340            | 40                | 40            | 300               | 188                    | | 
					
						
						|  | | high latency      | 10.0s       | 0.005   | 124            | 1                 | 124           | 124               | 188                    | | 
					
						
						|  | | low latency       | 1.04s       | 0.093   | 6              | 7                 | 188           | 144               | 188                    | | 
					
						
						|  | | ultra low latency | 0.32s       | 0.180   | 3              | 1                 | 188           | 144               | 188                    | | 
					
						
						|  |  | 
					
						
						|  | For clarity on the metrics used in the table: | 
					
						
						|  | * **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time. | 
					
						
						|  | * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU. | 
					
						
						|  |  | 
					
						
						|  | To set streaming configuration, use: | 
					
						
						|  | ```python3 | 
					
						
						|  | diar_model.sortformer_modules.chunk_len = CHUNK_SIZE | 
					
						
						|  | diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT | 
					
						
						|  | diar_model.sortformer_modules.fifo_len = FIFO_SIZE | 
					
						
						|  | diar_model.sortformer_modules.spkcache_update_period = UPDATE_PERIOD | 
					
						
						|  | diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE | 
					
						
						|  | diar_model.sortformer_modules._check_streaming_parameters() | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Getting Diarization Results | 
					
						
						|  | To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use: | 
					
						
						|  | ```python3 | 
					
						
						|  | predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1) | 
					
						
						|  | ``` | 
					
						
						|  | To obtain tensors of speaker activity probabilities, use: | 
					
						
						|  | ```python3 | 
					
						
						|  | predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ### Input | 
					
						
						|  |  | 
					
						
						|  | This model accepts single-channel (mono) audio sampled at 16,000 Hz. | 
					
						
						|  | - The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal. | 
					
						
						|  | - For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix. | 
					
						
						|  |  | 
					
						
						|  | ### Output | 
					
						
						|  |  | 
					
						
						|  | The output of the model is an T x S matrix, where: | 
					
						
						|  | - S is the maximum number of speakers (in this model, S = 4). | 
					
						
						|  | - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio. | 
					
						
						|  | Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range.  For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Train and evaluate Sortformer diarizer using NeMo | 
					
						
						|  | ### Training | 
					
						
						|  |  | 
					
						
						|  | Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4. | 
					
						
						|  | The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml). | 
					
						
						|  |  | 
					
						
						|  | ### Inference | 
					
						
						|  |  | 
					
						
						|  | Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset. | 
					
						
						|  |  | 
					
						
						|  | ### Technical Limitations | 
					
						
						|  |  | 
					
						
						|  | - The model operates in a streaming mode (online mode). | 
					
						
						|  | - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers. | 
					
						
						|  | - While the model is designed for long-form audio and can handle recordings that are several hours long, performance may degrade on very long recordings. | 
					
						
						|  | - The model was trained on publicly available speech datasets, primarily in English. As a result: | 
					
						
						|  | * Performance may degrade on non-English speech. | 
					
						
						|  | * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions. | 
					
						
						|  |  | 
					
						
						|  | ## Datasets | 
					
						
						|  |  | 
					
						
						|  | Sortformer was trained on a combination of 2445 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7]. | 
					
						
						|  | All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes. | 
					
						
						|  | Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ### Training Datasets (Real conversations) | 
					
						
						|  | - Fisher English (LDC) | 
					
						
						|  | - AMI Meeting Corpus | 
					
						
						|  | - VoxConverse-v0.3 | 
					
						
						|  | - ICSI | 
					
						
						|  | - AISHELL-4 | 
					
						
						|  | - Third DIHARD Challenge Development (LDC) | 
					
						
						|  | - 2000 NIST Speaker Recognition Evaluation, split1 (LDC) | 
					
						
						|  | - DiPCo | 
					
						
						|  | - AliMeeting | 
					
						
						|  |  | 
					
						
						|  | ### Training Datasets (Used to simulate audio mixtures) | 
					
						
						|  | - 2004-2010 NIST Speaker Recognition Evaluation (LDC) | 
					
						
						|  | - Librispeech | 
					
						
						|  |  | 
					
						
						|  | ## Performance | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ### Evaluation data specifications | 
					
						
						|  |  | 
					
						
						|  | | **Dataset**                | **Number of speakers** | **Number of Sessions** | | 
					
						
						|  | |----------------------------|------------------------|------------------------| | 
					
						
						|  | | **DIHARD III Eval <=4spk** | 1-4                    | 219                    | | 
					
						
						|  | | **DIHARD III Eval >=5spk** | 5-9                    | 40                     | | 
					
						
						|  | | **DIHARD III Eval full**   | 1-9                    | 259                    | | 
					
						
						|  | | **CALLHOME-part2 2spk**    | 2                      | 148                    | | 
					
						
						|  | | **CALLHOME-part2 3spk**    | 3                      | 74                     | | 
					
						
						|  | | **CALLHOME-part2 4spk**    | 4                      | 20                     | | 
					
						
						|  | | **CALLHOME-part2 5spk**    | 5                      | 5                      | | 
					
						
						|  | | **CALLHOME-part2 6spk**    | 6                      | 3                      | | 
					
						
						|  | | **CALLHOME-part2 full**    | 2-6                    | 250                    | | 
					
						
						|  | | **CH109**                  | 2                      | 109                    | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ### Diarization Error Rate (DER) | 
					
						
						|  |  | 
					
						
						|  | * All evaluations include overlapping speech. | 
					
						
						|  | * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109. | 
					
						
						|  | * Post-Processing (PP) is optimized on two different held-out dataset splits. | 
					
						
						|  | - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval | 
					
						
						|  | - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109 | 
					
						
						|  |  | 
					
						
						|  | | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** | | 
					
						
						|  | |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------| | 
					
						
						|  | | 30.4s       | no   | 14.63                      | 40.74                      | 19.68                    | 6.27                    | 10.27                   | 12.30                   | 19.08                   | 28.09                   | 10.50                   | 5.03      | | 
					
						
						|  | | 30.4s       | yes  | 13.45                      | 41.40                      | 18.85                    | 5.34                    | 9.22                    | 11.29                   | 18.84                   | 27.29                   | 9.54                    | 4.61      | | 
					
						
						|  | | 10.0s       | no   | 14.90                      | 41.06                      | 19.96                    | 6.96                    | 11.05                   | 12.93                   | 20.47                   | 28.10                   | 11.21                   | 5.28      | | 
					
						
						|  | | 10.0s       | yes  | 13.75                      | 41.41                      | 19.10                    | 6.05                    | 9.88                    | 11.72                   | 19.66                   | 27.37                   | 10.15                   | 4.80      | | 
					
						
						|  | | 1.04s       | no   | 14.49                      | 42.22                      | 19.85                    | 7.51                    | 11.45                   | 13.75                   | 23.22                   | 29.22                   | 11.89                   | 5.37      | | 
					
						
						|  | | 1.04s       | yes  | 13.24                      | 42.56                      | 18.91                    | 6.57                    | 10.05                   | 12.44                   | 21.68                   | 28.74                   | 10.70                   | 4.88      | | 
					
						
						|  | | 0.32s       | no   | 14.64                      | 43.47                      | 20.19                    | 8.63                    | 12.91                   | 16.19                   | 29.40                   | 30.60                   | 13.57                   | 6.46      | | 
					
						
						|  | | 0.32s       | yes  | 13.44                      | 43.73                      | 19.28                    | 6.91                    | 10.45                   | 13.70                   | 27.04                   | 28.58                   | 11.38                   | 5.27      | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## NVIDIA Riva: Deployment | 
					
						
						|  |  | 
					
						
						|  | Streaming Sortformer is deployed via NVIDIA RIVA ASR - [Speech Recognition with Speaker Diarization](https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html#speech-recognition-with-speaker-diarization) | 
					
						
						|  |  | 
					
						
						|  | [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. | 
					
						
						|  | Additionally, Riva provides: | 
					
						
						|  |  | 
					
						
						|  | * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours | 
					
						
						|  | * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization | 
					
						
						|  | * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support | 
					
						
						|  |  | 
					
						
						|  | For more information on NVIDIA RIVA, see the [list of supported models](https://huggingface.co/models?other=Riva) is here. | 
					
						
						|  | Also check out the [Riva live demo](https://developer.nvidia.com/riva#demos). | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## References | 
					
						
						|  | [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656) | 
					
						
						|  |  | 
					
						
						|  | [2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446) | 
					
						
						|  |  | 
					
						
						|  | [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106) | 
					
						
						|  |  | 
					
						
						|  | [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084) | 
					
						
						|  |  | 
					
						
						|  | [5] [Attention is all you need](https://arxiv.org/abs/1706.03762) | 
					
						
						|  |  | 
					
						
						|  | [6] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) | 
					
						
						|  |  | 
					
						
						|  | [7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371) | 
					
						
						|  |  | 
					
						
						|  | ## Licence | 
					
						
						|  |  | 
					
						
						|  | License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license. | 
					
						
						|  |  |