Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,9 @@ padding: 0;
|
|
| 18 |
| [](#model-architecture)
|
| 19 |
| [](#datasets)
|
| 20 |
|
| 21 |
-
The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis.
|
|
|
|
|
|
|
| 22 |
|
| 23 |
| Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
|
| 24 |
|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
|
|
@@ -102,7 +104,7 @@ The NeMo Audio Codec is trained on a total of 28.7k hrs of speech data from 105
|
|
| 102 |
|
| 103 |
## Performance
|
| 104 |
|
| 105 |
-
We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR
|
| 106 |
|
| 107 |
| Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
|
| 108 |
|:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
|
|
|
|
| 18 |
| [](#model-architecture)
|
| 19 |
| [](#datasets)
|
| 20 |
|
| 21 |
+
The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis.
|
| 22 |
+
|
| 23 |
+
The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
|
| 24 |
|
| 25 |
| Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
|
| 26 |
|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
|
|
|
|
| 104 |
|
| 105 |
## Performance
|
| 106 |
|
| 107 |
+
We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and [SI-SDR](https://arxiv.org/abs/1811.02508) for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
|
| 108 |
|
| 109 |
| Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
|
| 110 |
|:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
|