Update README.md (#3)
Browse files- Update README.md (ded15229032d2c71e3a611bfb5c503e56dd7ccb7)
Co-authored-by: Ivan Medennikov <[email protected]>
README.md
CHANGED
@@ -45,7 +45,7 @@ model-index:
|
|
45 |
metrics:
|
46 |
- name: Test DER
|
47 |
type: der
|
48 |
-
value: 13.
|
49 |
- task:
|
50 |
name: Speaker Diarization
|
51 |
type: speaker-diarization-with-post-processing
|
@@ -58,7 +58,7 @@ model-index:
|
|
58 |
metrics:
|
59 |
- name: Test DER
|
60 |
type: der
|
61 |
-
value: 42.
|
62 |
- task:
|
63 |
name: Speaker Diarization
|
64 |
type: speaker-diarization-with-post-processing
|
@@ -71,7 +71,7 @@ model-index:
|
|
71 |
metrics:
|
72 |
- name: Test DER
|
73 |
type: der
|
74 |
-
value: 18.
|
75 |
- task:
|
76 |
name: Speaker Diarization
|
77 |
type: speaker-diarization-with-post-processing
|
@@ -84,7 +84,7 @@ model-index:
|
|
84 |
metrics:
|
85 |
- name: Test DER
|
86 |
type: der
|
87 |
-
value: 6.
|
88 |
- task:
|
89 |
name: Speaker Diarization
|
90 |
type: speaker-diarization-with-post-processing
|
@@ -97,7 +97,7 @@ model-index:
|
|
97 |
metrics:
|
98 |
- name: Test DER
|
99 |
type: der
|
100 |
-
value: 10.
|
101 |
- task:
|
102 |
name: Speaker Diarization
|
103 |
type: speaker-diarization-with-post-processing
|
@@ -110,7 +110,7 @@ model-index:
|
|
110 |
metrics:
|
111 |
- name: Test DER
|
112 |
type: der
|
113 |
-
value: 12.
|
114 |
- task:
|
115 |
name: Speaker Diarization
|
116 |
type: speaker-diarization-with-post-processing
|
@@ -123,7 +123,7 @@ model-index:
|
|
123 |
metrics:
|
124 |
- name: Test DER
|
125 |
type: der
|
126 |
-
value:
|
127 |
- task:
|
128 |
name: Speaker Diarization
|
129 |
type: speaker-diarization-with-post-processing
|
@@ -136,7 +136,7 @@ model-index:
|
|
136 |
metrics:
|
137 |
- name: Test DER
|
138 |
type: der
|
139 |
-
value:
|
140 |
- task:
|
141 |
name: Speaker Diarization
|
142 |
type: speaker-diarization-with-post-processing
|
@@ -149,7 +149,7 @@ model-index:
|
|
149 |
metrics:
|
150 |
- name: Test DER
|
151 |
type: der
|
152 |
-
value: 10.
|
153 |
- task:
|
154 |
name: Speaker Diarization
|
155 |
type: speaker-diarization-with-post-processing
|
@@ -162,7 +162,7 @@ model-index:
|
|
162 |
metrics:
|
163 |
- name: Test DER
|
164 |
type: der
|
165 |
-
value:
|
166 |
metrics:
|
167 |
- der
|
168 |
pipeline_tag: audio-classification
|
@@ -187,7 +187,7 @@ This model is a streaming version of Sortformer diarizer. [Sortformer](https://a
|
|
187 |
<img src="figures/sortformer_intro.png" width="750" />
|
188 |
</div>
|
189 |
|
190 |
-
[Streaming Sortformer](https://arxiv.org/abs/
|
191 |
<div align="center">
|
192 |
<img src="figures/streaming_sortformer_ani.gif" width="1400" />
|
193 |
</div>
|
@@ -205,7 +205,7 @@ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate
|
|
205 |
|
206 |
Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
|
207 |
Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
|
208 |
-
and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/
|
209 |
|
210 |
<div align="center">
|
211 |
<img src="figures/sortformer-v1-model.png" width="450" />
|
@@ -283,6 +283,7 @@ Streaming configuration is defined by the following parameters, all measured in
|
|
283 |
Here are recommended configurations for different scenarios:
|
284 |
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
285 |
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
|
|
286 |
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
287 |
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
288 |
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
@@ -390,17 +391,19 @@ Data collection methods vary across individual datasets. For example, the above
|
|
390 |
* All evaluations include overlapping speech.
|
391 |
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
|
392 |
* Post-Processing (PP) is optimized on two different held-out dataset splits.
|
393 |
-
- [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/
|
394 |
-
- [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/
|
395 |
|
396 |
| **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
397 |
|-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
398 |
-
|
|
399 |
-
|
|
400 |
-
|
|
401 |
-
|
|
402 |
-
|
|
403 |
-
|
|
|
|
|
|
404 |
|
405 |
|
406 |
## NVIDIA Riva: Deployment
|
@@ -419,7 +422,7 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
|
|
419 |
## References
|
420 |
[1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
421 |
|
422 |
-
[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/
|
423 |
|
424 |
[3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
|
425 |
|
|
|
45 |
metrics:
|
46 |
- name: Test DER
|
47 |
type: der
|
48 |
+
value: 13.24
|
49 |
- task:
|
50 |
name: Speaker Diarization
|
51 |
type: speaker-diarization-with-post-processing
|
|
|
58 |
metrics:
|
59 |
- name: Test DER
|
60 |
type: der
|
61 |
+
value: 42.56
|
62 |
- task:
|
63 |
name: Speaker Diarization
|
64 |
type: speaker-diarization-with-post-processing
|
|
|
71 |
metrics:
|
72 |
- name: Test DER
|
73 |
type: der
|
74 |
+
value: 18.91
|
75 |
- task:
|
76 |
name: Speaker Diarization
|
77 |
type: speaker-diarization-with-post-processing
|
|
|
84 |
metrics:
|
85 |
- name: Test DER
|
86 |
type: der
|
87 |
+
value: 6.57
|
88 |
- task:
|
89 |
name: Speaker Diarization
|
90 |
type: speaker-diarization-with-post-processing
|
|
|
97 |
metrics:
|
98 |
- name: Test DER
|
99 |
type: der
|
100 |
+
value: 10.05
|
101 |
- task:
|
102 |
name: Speaker Diarization
|
103 |
type: speaker-diarization-with-post-processing
|
|
|
110 |
metrics:
|
111 |
- name: Test DER
|
112 |
type: der
|
113 |
+
value: 12.44
|
114 |
- task:
|
115 |
name: Speaker Diarization
|
116 |
type: speaker-diarization-with-post-processing
|
|
|
123 |
metrics:
|
124 |
- name: Test DER
|
125 |
type: der
|
126 |
+
value: 21.68
|
127 |
- task:
|
128 |
name: Speaker Diarization
|
129 |
type: speaker-diarization-with-post-processing
|
|
|
136 |
metrics:
|
137 |
- name: Test DER
|
138 |
type: der
|
139 |
+
value: 28.74
|
140 |
- task:
|
141 |
name: Speaker Diarization
|
142 |
type: speaker-diarization-with-post-processing
|
|
|
149 |
metrics:
|
150 |
- name: Test DER
|
151 |
type: der
|
152 |
+
value: 10.70
|
153 |
- task:
|
154 |
name: Speaker Diarization
|
155 |
type: speaker-diarization-with-post-processing
|
|
|
162 |
metrics:
|
163 |
- name: Test DER
|
164 |
type: der
|
165 |
+
value: 4.88
|
166 |
metrics:
|
167 |
- der
|
168 |
pipeline_tag: audio-classification
|
|
|
187 |
<img src="figures/sortformer_intro.png" width="750" />
|
188 |
</div>
|
189 |
|
190 |
+
[Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
|
191 |
<div align="center">
|
192 |
<img src="figures/streaming_sortformer_ani.gif" width="1400" />
|
193 |
</div>
|
|
|
205 |
|
206 |
Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
|
207 |
Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
|
208 |
+
and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2].
|
209 |
|
210 |
<div align="center">
|
211 |
<img src="figures/sortformer-v1-model.png" width="450" />
|
|
|
283 |
Here are recommended configurations for different scenarios:
|
284 |
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
285 |
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
286 |
+
| very high latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
|
287 |
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
288 |
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
289 |
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
|
|
391 |
* All evaluations include overlapping speech.
|
392 |
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
|
393 |
* Post-Processing (PP) is optimized on two different held-out dataset splits.
|
394 |
+
- [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
|
395 |
+
- [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
|
396 |
|
397 |
| **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
398 |
|-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
399 |
+
| 30.4s | no | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
|
400 |
+
| 30.4s | yes | 13.45 | 41.40 | 18.85 | 5.34 | 9.22 | 11.29 | 18.84 | 27.29 | 9.54 | 4.61 |
|
401 |
+
| 10.0s | no | 14.90 | 41.06 | 19.96 | 6.96 | 11.05 | 12.93 | 20.47 | 28.10 | 11.21 | 5.28 |
|
402 |
+
| 10.0s | yes | 13.75 | 41.41 | 19.10 | 6.05 | 9.88 | 11.72 | 19.66 | 27.37 | 10.15 | 4.80 |
|
403 |
+
| 1.04s | no | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
|
404 |
+
| 1.04s | yes | 13.24 | 42.56 | 18.91 | 6.57 | 10.05 | 12.44 | 21.68 | 28.74 | 10.70 | 4.88 |
|
405 |
+
| 0.32s | no | 14.64 | 43.47 | 20.19 | 8.63 | 12.91 | 16.19 | 29.40 | 30.60 | 13.57 | 6.46 |
|
406 |
+
| 0.32s | yes | 13.44 | 43.73 | 19.28 | 6.91 | 10.45 | 13.70 | 27.04 | 28.58 | 11.38 | 5.27 |
|
407 |
|
408 |
|
409 |
## NVIDIA Riva: Deployment
|
|
|
422 |
## References
|
423 |
[1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
424 |
|
425 |
+
[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
|
426 |
|
427 |
[3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
|
428 |
|