nvidia
/

diar_streaming_sortformer_4spk-v2

@@ -178,7 +178,7 @@ img {
 </style>
 [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
-| [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
 <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
 This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
@@ -230,7 +230,7 @@ The model is available for use in the NeMo Framework[6], and can be used as a pr
 ### Loading the Model
-```python
 from nemo.collections.asr.models import SortformerEncLabelModel
 # load model from Hugging Face model card directly (You need a Hugging Face token)
@@ -245,15 +245,15 @@ diar_model.eval()
 ### Input Format
 Input to Sortformer can be an individual audio file:
-```python
 audio_input="/path/to/multispeaker_audio1.wav"
 ```
 or a list of paths to audio files:
-```python
 audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
 ```
 or a jsonl manifest file:
-```python
 audio_input="/path/to/multispeaker_manifest.json"
 ```
 where each line is a dictionary containing the following fields:
@@ -271,6 +271,45 @@ where each line is a dictionary containing the following fields:
 }
 ```
 ### Input
@@ -345,17 +384,6 @@ Data collection methods vary across individual datasets. For example, the above
 | **CALLHOME-part2 full**    | 2-6                    | 250                    |
 | **CH109**                  | 2                      | 109                    |
-### Latency setups and Real Time Factor (RTF)
-* **Configuration Parameters**: Each setup is defined by its **Chunk Size**, **Right Context**, **FIFO Queue**, **Update Period**, and **Speaker Cache**. The value for each parameter represents the number of 80ms frames.
-* **Latency**: Refers to **Input Buffer Latency**, calculated as **Chunk Size** + **Right Context**. This value excludes computational processing time.
-* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
-| **Latency** | **Chunk Size** | **Right Context** | **FIFO Queue** | **Update Period** | **Speaker Cache** | **RTF** |
-|-------------|----------------|-------------------|----------------|-------------------|-------------------|---------|
-| 10.0s       | 124            | 1                 | 124            | 124               | 188               | 0.005   |
-| 1.04s       | 6              | 7                 | 188            | 144               | 188               | 0.093   |
-| 0.32s       | 3              | 1                 | 188            | 144               | 188               | 0.180   |
 ### Diarization Error Rate (DER)

 </style>
 [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
+| [![Model size](https://img.shields.io/badge/Params-117M-lightgrey#model-badge)](#model-architecture)
 <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
 This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
 ### Loading the Model
+```python3
 from nemo.collections.asr.models import SortformerEncLabelModel
 # load model from Hugging Face model card directly (You need a Hugging Face token)
 ### Input Format
 Input to Sortformer can be an individual audio file:
+```python3
 audio_input="/path/to/multispeaker_audio1.wav"
 ```
 or a list of paths to audio files:
+```python3
 audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
 ```
 or a jsonl manifest file:
+```python3
 audio_input="/path/to/multispeaker_manifest.json"
 ```
 where each line is a dictionary containing the following fields:
 }
 ```
+### Setting up Streaming Configuration
+Streaming configuration is defined by the following parameters, all measured in **80ms frames**:
+* **CHUNK_SIZE**: The number of frames in a processing chunk.
+* **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
+* **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
+* **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
+* **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
+Here are recommended configurations for different scenarios:
+| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
+| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
+| high latency      | 10.0s       | 0.005   | 124            | 1                 | 124           | 124               | 188                    |
+| low latency       | 1.04s       | 0.093   | 6              | 7                 | 188           | 144               | 188                    |
+| ultra low latency | 0.32s       | 0.180   | 3              | 1                 | 188           | 144               | 188                    |
+For clarity on the metrics used in the table:
+* **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
+* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
+To set streaming configuration, use:
+```python3
+diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
+diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
+diar_model.sortformer_modules.fifo_len = FIFO_SIZE
+diar_model.sortformer_modules.spkcache_refresh_rate = UPDATE_PERIOD
+diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
+```
+### Getting Diarization Results
+To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
+```python3
+predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
+```
+To obtain tensors of speaker activity probabilities, use:
+```python3
+predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
+```
 ### Input
 | **CALLHOME-part2 full**    | 2-6                    | 250                    |
 | **CH109**                  | 2                      | 109                    |
 ### Diarization Error Rate (DER)