Update README.md
Browse files
README.md
CHANGED
|
@@ -178,7 +178,7 @@ img {
|
|
| 178 |
</style>
|
| 179 |
|
| 180 |
[](#model-architecture)
|
| 181 |
-
| [](#datasets) -->
|
| 183 |
|
| 184 |
This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
|
|
@@ -230,7 +230,7 @@ The model is available for use in the NeMo Framework[6], and can be used as a pr
|
|
| 230 |
|
| 231 |
### Loading the Model
|
| 232 |
|
| 233 |
-
```
|
| 234 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 235 |
|
| 236 |
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
|
@@ -245,15 +245,15 @@ diar_model.eval()
|
|
| 245 |
|
| 246 |
### Input Format
|
| 247 |
Input to Sortformer can be an individual audio file:
|
| 248 |
-
```
|
| 249 |
audio_input="/path/to/multispeaker_audio1.wav"
|
| 250 |
```
|
| 251 |
or a list of paths to audio files:
|
| 252 |
-
```
|
| 253 |
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
| 254 |
```
|
| 255 |
or a jsonl manifest file:
|
| 256 |
-
```
|
| 257 |
audio_input="/path/to/multispeaker_manifest.json"
|
| 258 |
```
|
| 259 |
where each line is a dictionary containing the following fields:
|
|
@@ -271,6 +271,45 @@ where each line is a dictionary containing the following fields:
|
|
| 271 |
}
|
| 272 |
```
|
| 273 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
|
| 275 |
### Input
|
| 276 |
|
|
@@ -345,17 +384,6 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 345 |
| **CALLHOME-part2 full** | 2-6 | 250 |
|
| 346 |
| **CH109** | 2 | 109 |
|
| 347 |
|
| 348 |
-
### Latency setups and Real Time Factor (RTF)
|
| 349 |
-
|
| 350 |
-
* **Configuration Parameters**: Each setup is defined by its **Chunk Size**, **Right Context**, **FIFO Queue**, **Update Period**, and **Speaker Cache**. The value for each parameter represents the number of 80ms frames.
|
| 351 |
-
* **Latency**: Refers to **Input Buffer Latency**, calculated as **Chunk Size** + **Right Context**. This value excludes computational processing time.
|
| 352 |
-
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
| 353 |
-
|
| 354 |
-
| **Latency** | **Chunk Size** | **Right Context** | **FIFO Queue** | **Update Period** | **Speaker Cache** | **RTF** |
|
| 355 |
-
|-------------|----------------|-------------------|----------------|-------------------|-------------------|---------|
|
| 356 |
-
| 10.0s | 124 | 1 | 124 | 124 | 188 | 0.005 |
|
| 357 |
-
| 1.04s | 6 | 7 | 188 | 144 | 188 | 0.093 |
|
| 358 |
-
| 0.32s | 3 | 1 | 188 | 144 | 188 | 0.180 |
|
| 359 |
|
| 360 |
### Diarization Error Rate (DER)
|
| 361 |
|
|
|
|
| 178 |
</style>
|
| 179 |
|
| 180 |
[](#model-architecture)
|
| 181 |
+
| [](#model-architecture)
|
| 182 |
<!-- | [](#datasets) -->
|
| 183 |
|
| 184 |
This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
|
|
|
|
| 230 |
|
| 231 |
### Loading the Model
|
| 232 |
|
| 233 |
+
```python3
|
| 234 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 235 |
|
| 236 |
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
|
|
|
| 245 |
|
| 246 |
### Input Format
|
| 247 |
Input to Sortformer can be an individual audio file:
|
| 248 |
+
```python3
|
| 249 |
audio_input="/path/to/multispeaker_audio1.wav"
|
| 250 |
```
|
| 251 |
or a list of paths to audio files:
|
| 252 |
+
```python3
|
| 253 |
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
| 254 |
```
|
| 255 |
or a jsonl manifest file:
|
| 256 |
+
```python3
|
| 257 |
audio_input="/path/to/multispeaker_manifest.json"
|
| 258 |
```
|
| 259 |
where each line is a dictionary containing the following fields:
|
|
|
|
| 271 |
}
|
| 272 |
```
|
| 273 |
|
| 274 |
+
### Setting up Streaming Configuration
|
| 275 |
+
|
| 276 |
+
Streaming configuration is defined by the following parameters, all measured in **80ms frames**:
|
| 277 |
+
* **CHUNK_SIZE**: The number of frames in a processing chunk.
|
| 278 |
+
* **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
|
| 279 |
+
* **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
|
| 280 |
+
* **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
|
| 281 |
+
* **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
|
| 282 |
+
|
| 283 |
+
Here are recommended configurations for different scenarios:
|
| 284 |
+
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
| 285 |
+
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
| 286 |
+
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
| 287 |
+
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
| 288 |
+
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
| 289 |
+
|
| 290 |
+
For clarity on the metrics used in the table:
|
| 291 |
+
* **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
|
| 292 |
+
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
| 293 |
+
|
| 294 |
+
To set streaming configuration, use:
|
| 295 |
+
```python3
|
| 296 |
+
diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
|
| 297 |
+
diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
|
| 298 |
+
diar_model.sortformer_modules.fifo_len = FIFO_SIZE
|
| 299 |
+
diar_model.sortformer_modules.spkcache_refresh_rate = UPDATE_PERIOD
|
| 300 |
+
diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
### Getting Diarization Results
|
| 304 |
+
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
| 305 |
+
```python3
|
| 306 |
+
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
| 307 |
+
```
|
| 308 |
+
To obtain tensors of speaker activity probabilities, use:
|
| 309 |
+
```python3
|
| 310 |
+
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
|
| 314 |
### Input
|
| 315 |
|
|
|
|
| 384 |
| **CALLHOME-part2 full** | 2-6 | 250 |
|
| 385 |
| **CH109** | 2 | 109 |
|
| 386 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 387 |
|
| 388 |
### Diarization Error Rate (DER)
|
| 389 |
|