Upload

Browse files

Files changed (6) hide show

.gitignore +1 -0
README.template.md +209 -0
dataset_script.py +247 -0
generate_datasets.py +96 -0
languages.ftl +161 -0
test.py +5 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ common_voice_*

README.template.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+pretty_name: {{NAME}}
+annotations_creators:
+- crowdsourced
+language_creators:
+- crowdsourced
+languages:
+{{LANGUAGES}}
+licenses:
+- cc0-1.0
+multilinguality:
+- multilingual
+size_categories:
+{{SIZES}}
+source_datasets:
+- extended|common_voice
+task_categories:
+- speech-processing
+task_ids:
+- automatic-speech-recognition
+paperswithcode_id: common-voice
+extra_gated_prompt: "By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset."
+---
+# Dataset Card for {{NAME}}
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+## Dataset Description
+- **Homepage:** https://commonvoice.mozilla.org/en/datasets
+- **Repository:** https://github.com/common-voice/common-voice
+- **Paper:** https://arxiv.org/abs/1912.06670
+- **Leaderboard:** https://paperswithcode.com/dataset/common-voice
+- **Point of Contact:** [Anton Lozhkov](mailto:[email protected])
+### Dataset Summary
+The Common Voice dataset consists of a unique MP3 and corresponding text file.
+Many of the {{TOTAL_HRS}} recorded hours in the dataset also include demographic metadata like age, sex, and accent
+that can help improve the accuracy of speech recognition engines.
+The dataset currently consists of {{VAL_HRS}} validated hours in {{NUM_LANGS}} languages, but more voices and languages are always added.
+Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing.
+### Supported Tasks and Leaderboards
+The results for models trained on the Common Voice datasets are available via the
+[Papers with Code Leaderboards](https://paperswithcode.com/dataset/common-voice)
+### Languages
+```
+{{LANGUAGES_HUMAN}}
+```
+## Dataset Structure
+### Data Instances
+A typical data point comprises the `path` to the audio file and its `sentence`.
+Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`.
+```python
+{
+  'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5',
+  'path': 'et/clips/common_voice_et_18318995.mp3',
+  'audio': {
+    'path': 'et/clips/common_voice_et_18318995.mp3',
+    'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
+    'sampling_rate': 48000
+  },
+  'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.',
+  'up_votes': 2,
+  'down_votes': 0,
+  'age': 'twenties',
+  'gender': 'male',
+  'accent': '',
+  'locale': 'et',
+  'segment': ''
+}
+```
+### Data Fields
+`client_id` (`string`): An id for which client (voice) made the recording
+`path` (`string`): The path to the audio file
+`audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
+`sentence` (`string`): The sentence the user was prompted to speak
+`up_votes` (`int64`): How many upvotes the audio file has received from reviewers
+`down_votes` (`int64`): How many downvotes the audio file has received from reviewers
+`age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`)
+`gender` (`string`): The gender of the speaker
+`accent` (`string`): Accent of the speaker
+`locale` (`string`): The locale of the speaker
+`segment` (`string`): Usually an empty field
+### Data Splits
+The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other.
+The validated data is data that has been validated with reviewers and recieved upvotes that the data is of high quality.
+The invalidated data is data has been invalidated by reviewers
+and received downvotes indicating that the data is of low quality.
+The reported data is data that has been reported, for different reasons.
+The other data is data that has not yet been reviewed.
+The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train.
+## Dataset Creation
+### Curation Rationale
+[Needs More Information]
+### Source Data
+#### Initial Data Collection and Normalization
+[Needs More Information]
+#### Who are the source language producers?
+[Needs More Information]
+### Annotations
+#### Annotation process
+[Needs More Information]
+#### Who are the annotators?
+[Needs More Information]
+### Personal and Sensitive Information
+The dataset consists of people who have donated their voice online.  You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
+## Considerations for Using the Data
+### Social Impact of Dataset
+The dataset consists of people who have donated their voice online.  You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
+### Discussion of Biases
+[More Information Needed]
+### Other Known Limitations
+[More Information Needed]
+## Additional Information
+### Dataset Curators
+[More Information Needed]
+### Licensing Information
+Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
+### Citation Information
+```
+@inproceedings{commonvoice:2020,
+  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
+  title = {Common Voice: A Massively-Multilingual Speech Corpus},
+  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
+  pages = {4211--4215},
+  year = 2020
+}
+```

dataset_script.py ADDED Viewed

	@@ -0,0 +1,247 @@

+# coding=utf-8
+# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Common Voice Dataset"""
+import csv
+import os
+import urllib
+import datasets
+import requests
+from datasets.tasks import AutomaticSpeechRecognition
+from datasets.utils.py_utils import size_str
+from huggingface_hub import HfApi, HfFolder
+from .languages import LANGUAGES
+from .release_stats import STATS
+_CITATION = """\
+@inproceedings{commonvoice:2020,
+  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
+  title = {Common Voice: A Massively-Multilingual Speech Corpus},
+  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
+  pages = {4211--4215},
+  year = 2020
+}
+"""
+_HOMEPAGE = "https://commonvoice.mozilla.org/en/datasets"
+_LICENSE = "https://creativecommons.org/publicdomain/zero/1.0/"
+_API_URL = "https://commonvoice.mozilla.org/api/v1"
+class CommonVoiceConfig(datasets.BuilderConfig):
+    """BuilderConfig for CommonVoice."""
+    def __init__(self, name, version, **kwargs):
+        self.language = kwargs.pop("language", None)
+        self.release_date = kwargs.pop("release_date", None)
+        self.num_clips = kwargs.pop("num_clips", None)
+        self.num_speakers = kwargs.pop("num_speakers", None)
+        self.validated_hr = kwargs.pop("validated_hr", None)
+        self.total_hr = kwargs.pop("total_hr", None)
+        self.size_bytes = kwargs.pop("size_bytes", None)
+        self.size_human = size_str(self.size_bytes)
+        description = (
+            f"Common Voice speech to text dataset in {self.language} released on {self.release_date}. "
+            f"The dataset comprises {self.validated_hr} hours of validated transcribed speech data "
+            f"out of {self.total_hr} hours in total from {self.num_speakers} speakers. "
+            f"The dataset contains {self.num_clips} audio clips and has a size of {self.size_human}."
+        )
+        super(CommonVoiceConfig, self).__init__(
+            name=name, version=datasets.Version(version), description=description, **kwargs
+        )
+class CommonVoice(datasets.GeneratorBasedBuilder):
+    DEFAULT_CONFIG_NAME = "en"
+    DEFAULT_WRITER_BATCH_SIZE = 1000
+    BUILDER_CONFIGS = [
+        CommonVoiceConfig(
+            name=lang,
+            version=STATS["version"],
+            language=LANGUAGES[lang],
+            release_date=STATS["date"],
+            num_clips=lang_stats["clips"],
+            num_speakers=lang_stats["users"],
+            validated_hr=float(lang_stats["validHrs"]),
+            total_hr=float(lang_stats["totalHrs"]),
+            size_bytes=int(lang_stats["size"]),
+        )
+        for lang, lang_stats in STATS["locales"].items()
+    ]
+    def _info(self):
+        total_languages = len(STATS["locales"])
+        total_valid_hours = STATS["totalValidHrs"]
+        description = (
+            "Common Voice is Mozilla's initiative to help teach machines how real people speak. "
+            f"The dataset currently consists of {total_valid_hours} validated hours of speech "
+            f" in {total_languages} languages, but more voices and languages are always added."
+        )
+        features = datasets.Features(
+            {
+                "client_id": datasets.Value("string"),
+                "path": datasets.Value("string"),
+                "audio": datasets.features.Audio(sampling_rate=48_000),
+                "sentence": datasets.Value("string"),
+                "up_votes": datasets.Value("int64"),
+                "down_votes": datasets.Value("int64"),
+                "age": datasets.Value("string"),
+                "gender": datasets.Value("string"),
+                "accent": datasets.Value("string"),
+                "locale": datasets.Value("string"),
+                "segment": datasets.Value("string"),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=description,
+            features=features,
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+            version=self.config.version,
+            # task_templates=[
+            #     AutomaticSpeechRecognition(audio_file_path_column="path", transcription_column="sentence")
+            # ],
+        )
+    def _get_bundle_url(self, locale, url_template):
+        # path = encodeURIComponent(path)
+        path = url_template.replace("{locale}", locale)
+        path = urllib.parse.quote(path.encode("utf-8"), safe="~()*!.'")
+        # use_cdn = self.config.size_bytes < 20 * 1024 * 1024 * 1024
+        # response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()
+        response = requests.get(f"{_API_URL}/bucket/dataset/{path}", timeout=10.0).json()
+        return response["url"]
+    def _log_download(self, locale, bundle_version, auth_token):
+        if isinstance(auth_token, bool):
+            auth_token = HfFolder().get_token()
+        whoami = HfApi().whoami(auth_token)
+        email = whoami["email"] if "email" in whoami else ""
+        payload = {"email": email, "locale": locale, "dataset": bundle_version}
+        requests.post(f"{_API_URL}/{locale}/downloaders", json=payload).json()
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        hf_auth_token = dl_manager.download_config.use_auth_token
+        if hf_auth_token is None:
+            raise ConnectionError("Please set use_auth_token=True or use_auth_token='<TOKEN>' to download this dataset")
+        bundle_url_template = STATS["bundleURLTemplate"]
+        bundle_version = bundle_url_template.split("/")[0]
+        dl_manager.download_config.ignore_url_params = True
+        self._log_download(self.config.name, bundle_version, hf_auth_token)
+        archive_path = dl_manager.download(self._get_bundle_url(self.config.name, bundle_url_template))
+        local_extracted_archive = dl_manager.extract(archive_path) if not dl_manager.is_streaming else None
+        if self.config.version < datasets.Version("5.0.0"):
+            path_to_data = ""
+        else:
+            path_to_data = "/".join([bundle_version, self.config.name])
+        path_to_clips = "/".join([path_to_data, "clips"]) if path_to_data else "clips"
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "train.tsv"]) if path_to_data else "train.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "test.tsv"]) if path_to_data else "test.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "dev.tsv"]) if path_to_data else "dev.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name="other",
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join([path_to_data, "other.tsv"]) if path_to_data else "other.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+            datasets.SplitGenerator(
+                name="invalidated",
+                gen_kwargs={
+                    "local_extracted_archive": local_extracted_archive,
+                    "archive_iterator": dl_manager.iter_archive(archive_path),
+                    "metadata_filepath": "/".join(
+                        [path_to_data, "invalidated.tsv"]) if path_to_data else "invalidated.tsv",
+                    "path_to_clips": path_to_clips,
+                },
+            ),
+        ]
+    def _generate_examples(self, local_extracted_archive, archive_iterator, metadata_filepath, path_to_clips):
+        """Yields examples."""
+        data_fields = list(self._info().features.keys())
+        metadata = {}
+        metadata_found = False
+        for path, f in archive_iterator:
+            if path == metadata_filepath:
+                metadata_found = True
+                lines = (line.decode("utf-8") for line in f)
+                reader = csv.DictReader(lines, delimiter="\t", quoting=csv.QUOTE_NONE)
+                for row in reader:
+                    # set absolute path for mp3 audio file
+                    if not row["path"].endswith(".mp3"):
+                        row["path"] += ".mp3"
+                    row["path"] = os.path.join(path_to_clips, row["path"])
+                    # accent -> accents in CV 8.0
+                    if "accents" in row:
+                        row["accent"] = row["accents"]
+                        del row["accents"]
+                    # if data is incomplete, fill with empty values
+                    for field in data_fields:
+                        if field not in row:
+                            row[field] = ""
+                    metadata[row["path"]] = row
+            elif path.startswith(path_to_clips):
+                assert metadata_found, "Found audio clips before the metadata TSV file."
+                if not metadata:
+                    break
+                if path in metadata:
+                    result = metadata[path]
+                    # set the audio feature and the path to the extracted file
+                    result["audio"] = {"path": path, "bytes": f.read()}
+                    result["path"] = os.path.join(local_extracted_archive, path) if local_extracted_archive else None
+                    yield path, result

generate_datasets.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import json
+import os
+import shutil
+import json
+import requests
+RELEASE_STATS_URL = (
+    "https://commonvoice.mozilla.org/dist/releases/{}.json"
+)
+VERSIONS = [
+    {"semver": "1.0.0", "name": "common_voice_1_0", "release": "cv-corpus-1"},
+    {"semver": "2.0.0", "name": "common_voice_2_0", "release": "cv-corpus-2"},
+    {"semver": "3.0.0", "name": "common_voice_3_0", "release": "cv-corpus-3"},
+    {"semver": "4.0.0", "name": "common_voice_4_0", "release": "cv-corpus-4-2019-12-10"},
+    {"semver": "5.0.0", "name": "common_voice_5_0", "release": "cv-corpus-5-2020-06-22"},
+    {"semver": "5.1.0", "name": "common_voice_5_1", "release": "cv-corpus-5.1-2020-06-22"},
+    {"semver": "6.0.0", "name": "common_voice_6_0", "release": "cv-corpus-6.0-2020-12-11"},
+    {"semver": "6.1.0", "name": "common_voice_6_1", "release": "cv-corpus-6.1-2020-12-11"},
+    {"semver": "7.0.0", "name": "common_voice_7_0", "release": "cv-corpus-7.0-2021-07-21"},
+    {"semver": "8.0.0", "name": "common_voice_8_0", "release": "cv-corpus-8.0-2022-01-19"},
+    {"semver": "9.0.0", "name": "common_voice_9_0", "release": "cv-corpus-9.0-2022-04-27"},
+]
+def num_to_size(num: int):
+    if num < 1000:
+        return "n<1K"
+    elif num < 10_000:
+        return "1K<n<10K"
+    elif num < 100_000:
+        return "10K<n<100K"
+    elif num < 1_000_000:
+        return "100K<n<1M"
+    elif num < 10_000_000:
+        return "1M<n<10M"
+    elif num < 100_000_000:
+        return "10M<n<100M"
+    elif num < 1_000_000_000:
+        return "100M<n<1B"
+def get_language_names():
+    # source: https://github.com/common-voice/common-voice/blob/release-v1.71.0/web/locales/en/messages.ftl
+    languages = {}
+    with open("languages.ftl") as fin:
+        for line in fin:
+            lang_code, lang_name = line.strip().split(" = ")
+            languages[lang_code] = lang_name
+    return languages
+def main():
+    language_names = get_language_names()
+    for version in VERSIONS:
+        stats_url = RELEASE_STATS_URL.format(version["release"])
+        release_stats = requests.get(stats_url).text
+        release_stats = json.loads(release_stats)
+        release_stats["version"] = version["semver"]
+        dataset_path = version["name"]
+        os.makedirs(dataset_path, exist_ok=True)
+        with open(f"{dataset_path}/release_stats.py", "w") as fout:
+            fout.write("STATS = " + str(release_stats))
+        with open(f"README.template.md", "r") as fin:
+            readme = fin.read()
+            readme = readme.replace("{{NAME}}", release_stats["name"])
+            locales = sorted(release_stats["locales"].keys())
+            languages = [f"- {loc}" for loc in locales]
+            readme = readme.replace("{{LANGUAGES}}", "\n".join(languages))
+            sizes = [f"  {loc}:\n  - {num_to_size(release_stats['locales'][loc]['clips'])}" for loc in locales]
+            readme = readme.replace("{{SIZES}}", "\n".join(sizes))
+            languages_human = sorted([language_names[loc] for loc in locales])
+            readme = readme.replace("{{LANGUAGES_HUMAN}}", ", ".join(languages_human))
+            readme = readme.replace("{{TOTAL_HRS}}", str(release_stats["totalHrs"]))
+            readme = readme.replace("{{VAL_HRS}}", str(release_stats["totalValidHrs"]))
+            readme = readme.replace("{{NUM_LANGS}}", str(len(locales)))
+        with open(f"{dataset_path}/README.md", "w") as fout:
+            fout.write(readme)
+        with open(f"{dataset_path}/languages.py", "w") as fout:
+            fout.write("LANGUAGES = " + str(language_names))
+        shutil.copy("dataset_script.py", f"{dataset_path}/{dataset_path}.py")
+if __name__ == "__main__":
+    main()

languages.ftl ADDED Viewed

	@@ -0,0 +1,161 @@

+ab = Abkhaz
+ace = Acehnese
+ady = Adyghe
+af = Afrikaans
+am = Amharic
+an = Aragonese
+ar = Arabic
+arn = Mapudungun
+as = Assamese
+ast = Asturian
+az = Azerbaijani
+ba = Bashkir
+bas = Basaa
+be = Belarusian
+bg = Bulgarian
+bn = Bengali
+br = Breton
+bs = Bosnian
+bxr = Buryat
+ca = Catalan
+cak = Kaqchikel
+ckb = Central Kurdish
+cnh = Hakha Chin
+co = Corsican
+cs = Czech
+cv = Chuvash
+cy = Welsh
+da = Danish
+de = German
+dsb = Sorbian, Lower
+dv = Dhivehi
+el = Greek
+en = English
+eo = Esperanto
+es = Spanish
+et = Estonian
+eu = Basque
+fa = Persian
+ff = Fulah
+fi = Finnish
+fo = Faroese
+fr = French
+fy-NL = Frisian
+ga-IE = Irish
+gl = Galician
+gn = Guarani
+gom = Goan Konkani
+ha = Hausa
+he = Hebrew
+hi = Hindi
+hr = Croatian
+hsb = Sorbian, Upper
+ht = Haitian
+hu = Hungarian
+hy-AM = Armenian
+hyw = Armenian Western
+ia = Interlingua
+id = Indonesian
+ie = Interlingue
+ig = Igbo
+is = Icelandic
+it = Italian
+izh = Izhorian
+ja = Japanese
+ka = Georgian
+kaa = Karakalpak
+kab = Kabyle
+kbd = Kabardian
+ki = Kikuyu
+kk = Kazakh
+km = Khmer
+kmr = Kurmanji Kurdish
+knn = Konkani (Devanagari)
+ko = Korean
+kpv = Komi-Zyrian
+kw = Cornish
+ky = Kyrgyz
+lb = Luxembourgish
+lg = Luganda
+lij = Ligurian
+lt = Lithuanian
+lv = Latvian
+mai = Maithili
+mdf = Moksha
+mg = Malagasy
+mhr = Meadow Mari
+mk = Macedonian
+ml = Malayalam
+mn = Mongolian
+mni = Meetei Lon
+mos = Mossi
+mr = Marathi
+mrj = Hill Mari
+ms = Malay
+mt = Maltese
+my = Burmese
+myv = Erzya
+nan-tw = Taiwanese (Minnan)
+nb-NO = Norwegian Bokmål
+ne-NP = Nepali
+nia = Nias
+nl = Dutch
+nn-NO = Norwegian Nynorsk
+nyn = Runyankole
+oc = Occitan
+or = Odia
+pa-IN = Punjabi
+pap-AW = Papiamento (Aruba)
+pl = Polish
+ps = Pashto
+pt = Portuguese
+quc = K'iche'
+quy = Quechua Chanka
+rm-sursilv = Romansh Sursilvan
+rm-vallader = Romansh Vallader
+ro = Romanian
+ru = Russian
+rw = Kinyarwanda
+sah = Sakha
+sat = Santali (Ol Chiki)
+sc = Sardinian
+scn = Sicilian
+shi = Shilha
+si = Sinhala
+sk = Slovak
+skr = Saraiki
+sl = Slovenian
+so = Somali
+sq = Albanian
+sr = Serbian
+sv-SE = Swedish
+sw = Swahili
+syr = Syriac
+ta = Tamil
+te = Telugu
+tg = Tajik
+th = Thai
+ti = Tigrinya
+tig = Tigre
+tk = Turkmen
+tl = Tagalog
+tok = Toki Pona
+tr = Turkish
+tt = Tatar
+tw = Twi
+ty = Tahitian
+uby = Ubykh
+udm = Udmurt
+ug = Uyghur
+uk = Ukrainian
+ur = Urdu
+uz = Uzbek
+vec = Venetian
+vi = Vietnamese
+vot = Votic
+yi = Yiddish
+yo = Yoruba
+yue = Cantonese
+zh-CN = Chinese (China)
+zh-HK = Chinese (Hong Kong)
+zh-TW = Chinese (Taiwan)

test.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from datasets import load_dataset
+dataset = load_dataset("./common_voice_9_0", "et", split="test", use_auth_token=True)
+print(dataset)
+print(dataset[100])