taejinp imedennikov commited on
Commit
4b8741e
·
verified ·
1 Parent(s): 8a9bb29

Update README.md (#3)

Browse files

- Update README.md (ded15229032d2c71e3a611bfb5c503e56dd7ccb7)


Co-authored-by: Ivan Medennikov <[email protected]>

Files changed (1) hide show
  1. README.md +24 -21
README.md CHANGED
@@ -45,7 +45,7 @@ model-index:
45
  metrics:
46
  - name: Test DER
47
  type: der
48
- value: 13.32
49
  - task:
50
  name: Speaker Diarization
51
  type: speaker-diarization-with-post-processing
@@ -58,7 +58,7 @@ model-index:
58
  metrics:
59
  - name: Test DER
60
  type: der
61
- value: 42.61
62
  - task:
63
  name: Speaker Diarization
64
  type: speaker-diarization-with-post-processing
@@ -71,7 +71,7 @@ model-index:
71
  metrics:
72
  - name: Test DER
73
  type: der
74
- value: 18.97
75
  - task:
76
  name: Speaker Diarization
77
  type: speaker-diarization-with-post-processing
@@ -84,7 +84,7 @@ model-index:
84
  metrics:
85
  - name: Test DER
86
  type: der
87
- value: 6.43
88
  - task:
89
  name: Speaker Diarization
90
  type: speaker-diarization-with-post-processing
@@ -97,7 +97,7 @@ model-index:
97
  metrics:
98
  - name: Test DER
99
  type: der
100
- value: 10.26
101
  - task:
102
  name: Speaker Diarization
103
  type: speaker-diarization-with-post-processing
@@ -110,7 +110,7 @@ model-index:
110
  metrics:
111
  - name: Test DER
112
  type: der
113
- value: 12.40
114
  - task:
115
  name: Speaker Diarization
116
  type: speaker-diarization-with-post-processing
@@ -123,7 +123,7 @@ model-index:
123
  metrics:
124
  - name: Test DER
125
  type: der
126
- value: 24.41
127
  - task:
128
  name: Speaker Diarization
129
  type: speaker-diarization-with-post-processing
@@ -136,7 +136,7 @@ model-index:
136
  metrics:
137
  - name: Test DER
138
  type: der
139
- value: 27.78
140
  - task:
141
  name: Speaker Diarization
142
  type: speaker-diarization-with-post-processing
@@ -149,7 +149,7 @@ model-index:
149
  metrics:
150
  - name: Test DER
151
  type: der
152
- value: 10.79
153
  - task:
154
  name: Speaker Diarization
155
  type: speaker-diarization-with-post-processing
@@ -162,7 +162,7 @@ model-index:
162
  metrics:
163
  - name: Test DER
164
  type: der
165
- value: 5.09
166
  metrics:
167
  - der
168
  pipeline_tag: audio-classification
@@ -187,7 +187,7 @@ This model is a streaming version of Sortformer diarizer. [Sortformer](https://a
187
  <img src="figures/sortformer_intro.png" width="750" />
188
  </div>
189
 
190
- [Streaming Sortformer](https://arxiv.org/abs/25XX.XXXXX)[2] approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
191
  <div align="center">
192
  <img src="figures/streaming_sortformer_ani.gif" width="1400" />
193
  </div>
@@ -205,7 +205,7 @@ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate
205
 
206
  Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
207
  Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
208
- and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/25XX.XXXXX)[2].
209
 
210
  <div align="center">
211
  <img src="figures/sortformer-v1-model.png" width="450" />
@@ -283,6 +283,7 @@ Streaming configuration is defined by the following parameters, all measured in
283
  Here are recommended configurations for different scenarios:
284
  | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
285
  | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
 
286
  | high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
287
  | low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
288
  | ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
@@ -390,17 +391,19 @@ Data collection methods vary across individual datasets. For example, the above
390
  * All evaluations include overlapping speech.
391
  * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
392
  * Post-Processing (PP) is optimized on two different held-out dataset splits.
393
- - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml) for DIHARD III Eval
394
- - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml) for CALLHOME-part2 and CH109
395
 
396
  | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
397
  |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
398
- | 10.0s | no | 14.79 | 41.06 | 19.88 | 6.80 | 11.27 | 12.21 | 21.12 | 27.84 | 11.10 | 5.27 |
399
- | 10.0s | yes | 13.67 | 41.45 | 19.02 | 6.06 | 10.01 | 11.22 | 20.34 | 26.97 | 10.09 | 4.82 |
400
- | 1.04s | no | 14.57 | 42.12 | 19.89 | 7.35 | 11.57 | 13.83 | 25.81 | 29.06 | 12.00 | 5.59 |
401
- | 1.04s | yes | 13.32 | 42.61 | 18.97 | 6.43 | 10.26 | 12.40 | 24.41 | 27.78 | 10.79 | 5.09 |
402
- | 0.32s | no | 14.63 | 43.76 | 20.25 | 8.60 | 13.23 | 16.08 | 28.10 | 30.63 | 13.66 | 6.60 |
403
- | 0.32s | yes | 13.43 | 43.98 | 19.32 | 6.86 | 10.84 | 13.64 | 25.78 | 28.58 | 11.50 | 5.41 |
 
 
404
 
405
 
406
  ## NVIDIA Riva: Deployment
@@ -419,7 +422,7 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
419
  ## References
420
  [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
421
 
422
- [2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/25XX.XXXXX)
423
 
424
  [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
425
 
 
45
  metrics:
46
  - name: Test DER
47
  type: der
48
+ value: 13.24
49
  - task:
50
  name: Speaker Diarization
51
  type: speaker-diarization-with-post-processing
 
58
  metrics:
59
  - name: Test DER
60
  type: der
61
+ value: 42.56
62
  - task:
63
  name: Speaker Diarization
64
  type: speaker-diarization-with-post-processing
 
71
  metrics:
72
  - name: Test DER
73
  type: der
74
+ value: 18.91
75
  - task:
76
  name: Speaker Diarization
77
  type: speaker-diarization-with-post-processing
 
84
  metrics:
85
  - name: Test DER
86
  type: der
87
+ value: 6.57
88
  - task:
89
  name: Speaker Diarization
90
  type: speaker-diarization-with-post-processing
 
97
  metrics:
98
  - name: Test DER
99
  type: der
100
+ value: 10.05
101
  - task:
102
  name: Speaker Diarization
103
  type: speaker-diarization-with-post-processing
 
110
  metrics:
111
  - name: Test DER
112
  type: der
113
+ value: 12.44
114
  - task:
115
  name: Speaker Diarization
116
  type: speaker-diarization-with-post-processing
 
123
  metrics:
124
  - name: Test DER
125
  type: der
126
+ value: 21.68
127
  - task:
128
  name: Speaker Diarization
129
  type: speaker-diarization-with-post-processing
 
136
  metrics:
137
  - name: Test DER
138
  type: der
139
+ value: 28.74
140
  - task:
141
  name: Speaker Diarization
142
  type: speaker-diarization-with-post-processing
 
149
  metrics:
150
  - name: Test DER
151
  type: der
152
+ value: 10.70
153
  - task:
154
  name: Speaker Diarization
155
  type: speaker-diarization-with-post-processing
 
162
  metrics:
163
  - name: Test DER
164
  type: der
165
+ value: 4.88
166
  metrics:
167
  - der
168
  pipeline_tag: audio-classification
 
187
  <img src="figures/sortformer_intro.png" width="750" />
188
  </div>
189
 
190
+ [Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
191
  <div align="center">
192
  <img src="figures/streaming_sortformer_ani.gif" width="1400" />
193
  </div>
 
205
 
206
  Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
207
  Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
208
+ and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2].
209
 
210
  <div align="center">
211
  <img src="figures/sortformer-v1-model.png" width="450" />
 
283
  Here are recommended configurations for different scenarios:
284
  | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
285
  | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
286
+ | very high latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
287
  | high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
288
  | low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
289
  | ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
 
391
  * All evaluations include overlapping speech.
392
  * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
393
  * Post-Processing (PP) is optimized on two different held-out dataset splits.
394
+ - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
395
+ - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
396
 
397
  | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
398
  |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
399
+ | 30.4s | no | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
400
+ | 30.4s | yes | 13.45 | 41.40 | 18.85 | 5.34 | 9.22 | 11.29 | 18.84 | 27.29 | 9.54 | 4.61 |
401
+ | 10.0s | no | 14.90 | 41.06 | 19.96 | 6.96 | 11.05 | 12.93 | 20.47 | 28.10 | 11.21 | 5.28 |
402
+ | 10.0s | yes | 13.75 | 41.41 | 19.10 | 6.05 | 9.88 | 11.72 | 19.66 | 27.37 | 10.15 | 4.80 |
403
+ | 1.04s | no | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
404
+ | 1.04s | yes | 13.24 | 42.56 | 18.91 | 6.57 | 10.05 | 12.44 | 21.68 | 28.74 | 10.70 | 4.88 |
405
+ | 0.32s | no | 14.64 | 43.47 | 20.19 | 8.63 | 12.91 | 16.19 | 29.40 | 30.60 | 13.57 | 6.46 |
406
+ | 0.32s | yes | 13.44 | 43.73 | 19.28 | 6.91 | 10.45 | 13.70 | 27.04 | 28.58 | 11.38 | 5.27 |
407
 
408
 
409
  ## NVIDIA Riva: Deployment
 
422
  ## References
423
  [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
424
 
425
+ [2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
426
 
427
  [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
428