utter-project
/

mHuBERT-147

Feature Extraction

Model card Files Files and versions

mzboito commited on Jun 4, 2024

Commit

fd7154c

·

verified ·

1 Parent(s): 2b9a070

Update README.md

Files changed (1) hide show

README.md +15 -14

README.md CHANGED Viewed

@@ -125,7 +125,7 @@ language:
 ## mHuBERT-147 models
-mHuBERT-147 are compact and competitive multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages.
 This repository contains:
 * Fairseq checkpoint (original);
@@ -133,19 +133,6 @@ This repository contains:
 * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
-# Citing
-```
-@inproceedings{boito2024mhubert,
-author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
-title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
-year=2024,
-booktitle={Interspeech 2024},
-}
-```
 # Additional Information
@@ -159,6 +146,7 @@ Please note that since training, there were CommonVoice removal requests. This m
 **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
 # Datasets Included
 For ASR/ST/TTS datasets, only train set is used.
@@ -178,6 +166,19 @@ For ASR/ST/TTS datasets, only train set is used.
 * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
 * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
 # Funding
 This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631.

 ## mHuBERT-147 models
+mHuBERT-147 are compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages.
 This repository contains:
 * Fairseq checkpoint (original);
 * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr).
 # Additional Information
 **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue).
 # Datasets Included
 For ASR/ST/TTS datasets, only train set is used.
 * [VoxLingua107](https://bark.phon.ioc.ee/voxlingua107/)
 * [VoxPopuli](https://github.com/facebookresearch/voxpopuli/)
+# Citing
+```
+@inproceedings{boito2024mhubert,
+author={Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu},
+title={{mHuBERT-147: A Compact Multilingual HuBERT Model}},
+year=2024,
+booktitle={Interspeech 2024},
+}
+```
 # Funding
 This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) under grant number 101070631.