speech-test commited on
Commit
b4d59ba
·
1 Parent(s): 2deeec5
Files changed (6) hide show
  1. .gitignore +1 -0
  2. README.template.md +209 -0
  3. dataset_script.py +247 -0
  4. generate_datasets.py +96 -0
  5. languages.ftl +161 -0
  6. test.py +5 -0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ common_voice_*
README.template.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pretty_name: {{NAME}}
3
+ annotations_creators:
4
+ - crowdsourced
5
+ language_creators:
6
+ - crowdsourced
7
+ languages:
8
+ {{LANGUAGES}}
9
+ licenses:
10
+ - cc0-1.0
11
+ multilinguality:
12
+ - multilingual
13
+ size_categories:
14
+ {{SIZES}}
15
+ source_datasets:
16
+ - extended|common_voice
17
+ task_categories:
18
+ - speech-processing
19
+ task_ids:
20
+ - automatic-speech-recognition
21
+ paperswithcode_id: common-voice
22
+ extra_gated_prompt: "By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset."
23
+ ---
24
+
25
+ # Dataset Card for {{NAME}}
26
+
27
+ ## Table of Contents
28
+ - [Dataset Description](#dataset-description)
29
+ - [Dataset Summary](#dataset-summary)
30
+ - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
31
+ - [Languages](#languages)
32
+ - [Dataset Structure](#dataset-structure)
33
+ - [Data Instances](#data-instances)
34
+ - [Data Fields](#data-fields)
35
+ - [Data Splits](#data-splits)
36
+ - [Dataset Creation](#dataset-creation)
37
+ - [Curation Rationale](#curation-rationale)
38
+ - [Source Data](#source-data)
39
+ - [Annotations](#annotations)
40
+ - [Personal and Sensitive Information](#personal-and-sensitive-information)
41
+ - [Considerations for Using the Data](#considerations-for-using-the-data)
42
+ - [Social Impact of Dataset](#social-impact-of-dataset)
43
+ - [Discussion of Biases](#discussion-of-biases)
44
+ - [Other Known Limitations](#other-known-limitations)
45
+ - [Additional Information](#additional-information)
46
+ - [Dataset Curators](#dataset-curators)
47
+ - [Licensing Information](#licensing-information)
48
+ - [Citation Information](#citation-information)
49
+ - [Contributions](#contributions)
50
+
51
+ ## Dataset Description
52
+
53
+ - **Homepage:** https://commonvoice.mozilla.org/en/datasets
54
+ - **Repository:** https://github.com/common-voice/common-voice
55
+ - **Paper:** https://arxiv.org/abs/1912.06670
56
+ - **Leaderboard:** https://paperswithcode.com/dataset/common-voice
57
+ - **Point of Contact:** [Anton Lozhkov](mailto:[email protected])
58
+
59
+ ### Dataset Summary
60
+
61
+ The Common Voice dataset consists of a unique MP3 and corresponding text file.
62
+ Many of the {{TOTAL_HRS}} recorded hours in the dataset also include demographic metadata like age, sex, and accent
63
+ that can help improve the accuracy of speech recognition engines.
64
+
65
+ The dataset currently consists of {{VAL_HRS}} validated hours in {{NUM_LANGS}} languages, but more voices and languages are always added.
66
+ Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing.
67
+
68
+ ### Supported Tasks and Leaderboards
69
+
70
+ The results for models trained on the Common Voice datasets are available via the
71
+ [Papers with Code Leaderboards](https://paperswithcode.com/dataset/common-voice)
72
+
73
+ ### Languages
74
+
75
+ ```
76
+ {{LANGUAGES_HUMAN}}
77
+ ```
78
+
79
+ ## Dataset Structure
80
+
81
+ ### Data Instances
82
+
83
+ A typical data point comprises the `path` to the audio file and its `sentence`.
84
+ Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`.
85
+
86
+ ```python
87
+ {
88
+ 'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5',
89
+ 'path': 'et/clips/common_voice_et_18318995.mp3',
90
+ 'audio': {
91
+ 'path': 'et/clips/common_voice_et_18318995.mp3',
92
+ 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
93
+ 'sampling_rate': 48000
94
+ },
95
+ 'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.',
96
+ 'up_votes': 2,
97
+ 'down_votes': 0,
98
+ 'age': 'twenties',
99
+ 'gender': 'male',
100
+ 'accent': '',
101
+ 'locale': 'et',
102
+ 'segment': ''
103
+ }
104
+ ```
105
+
106
+ ### Data Fields
107
+
108
+ `client_id` (`string`): An id for which client (voice) made the recording
109
+
110
+ `path` (`string`): The path to the audio file
111
+
112
+ `audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
113
+
114
+ `sentence` (`string`): The sentence the user was prompted to speak
115
+
116
+ `up_votes` (`int64`): How many upvotes the audio file has received from reviewers
117
+
118
+ `down_votes` (`int64`): How many downvotes the audio file has received from reviewers
119
+
120
+ `age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`)
121
+
122
+ `gender` (`string`): The gender of the speaker
123
+
124
+ `accent` (`string`): Accent of the speaker
125
+
126
+ `locale` (`string`): The locale of the speaker
127
+
128
+ `segment` (`string`): Usually an empty field
129
+
130
+ ### Data Splits
131
+
132
+ The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other.
133
+
134
+ The validated data is data that has been validated with reviewers and recieved upvotes that the data is of high quality.
135
+
136
+ The invalidated data is data has been invalidated by reviewers
137
+ and received downvotes indicating that the data is of low quality.
138
+
139
+ The reported data is data that has been reported, for different reasons.
140
+
141
+ The other data is data that has not yet been reviewed.
142
+
143
+ The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train.
144
+
145
+ ## Dataset Creation
146
+
147
+ ### Curation Rationale
148
+
149
+ [Needs More Information]
150
+
151
+ ### Source Data
152
+
153
+ #### Initial Data Collection and Normalization
154
+
155
+ [Needs More Information]
156
+
157
+ #### Who are the source language producers?
158
+
159
+ [Needs More Information]
160
+
161
+ ### Annotations
162
+
163
+ #### Annotation process
164
+
165
+ [Needs More Information]
166
+
167
+ #### Who are the annotators?
168
+
169
+ [Needs More Information]
170
+
171
+ ### Personal and Sensitive Information
172
+
173
+ The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
174
+
175
+ ## Considerations for Using the Data
176
+
177
+ ### Social Impact of Dataset
178
+
179
+ The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
180
+
181
+ ### Discussion of Biases
182
+
183
+ [More Information Needed]
184
+
185
+ ### Other Known Limitations
186
+
187
+ [More Information Needed]
188
+
189
+ ## Additional Information
190
+
191
+ ### Dataset Curators
192
+
193
+ [More Information Needed]
194
+
195
+ ### Licensing Information
196
+
197
+ Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
198
+
199
+ ### Citation Information
200
+
201
+ ```
202
+ @inproceedings{commonvoice:2020,
203
+ author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
204
+ title = {Common Voice: A Massively-Multilingual Speech Corpus},
205
+ booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
206
+ pages = {4211--4215},
207
+ year = 2020
208
+ }
209
+ ```
dataset_script.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ Common Voice Dataset"""
16
+
17
+
18
+ import csv
19
+ import os
20
+ import urllib
21
+
22
+ import datasets
23
+ import requests
24
+ from datasets.tasks import AutomaticSpeechRecognition
25
+ from datasets.utils.py_utils import size_str
26
+ from huggingface_hub import HfApi, HfFolder
27
+
28
+ from .languages import LANGUAGES
29
+ from .release_stats import STATS
30
+
31
+ _CITATION = """\
32
+ @inproceedings{commonvoice:2020,
33
+ author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
34
+ title = {Common Voice: A Massively-Multilingual Speech Corpus},
35
+ booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
36
+ pages = {4211--4215},
37
+ year = 2020
38
+ }
39
+ """
40
+
41
+ _HOMEPAGE = "https://commonvoice.mozilla.org/en/datasets"
42
+
43
+ _LICENSE = "https://creativecommons.org/publicdomain/zero/1.0/"
44
+
45
+ _API_URL = "https://commonvoice.mozilla.org/api/v1"
46
+
47
+
48
+ class CommonVoiceConfig(datasets.BuilderConfig):
49
+ """BuilderConfig for CommonVoice."""
50
+
51
+ def __init__(self, name, version, **kwargs):
52
+ self.language = kwargs.pop("language", None)
53
+ self.release_date = kwargs.pop("release_date", None)
54
+ self.num_clips = kwargs.pop("num_clips", None)
55
+ self.num_speakers = kwargs.pop("num_speakers", None)
56
+ self.validated_hr = kwargs.pop("validated_hr", None)
57
+ self.total_hr = kwargs.pop("total_hr", None)
58
+ self.size_bytes = kwargs.pop("size_bytes", None)
59
+ self.size_human = size_str(self.size_bytes)
60
+ description = (
61
+ f"Common Voice speech to text dataset in {self.language} released on {self.release_date}. "
62
+ f"The dataset comprises {self.validated_hr} hours of validated transcribed speech data "
63
+ f"out of {self.total_hr} hours in total from {self.num_speakers} speakers. "
64
+ f"The dataset contains {self.num_clips} audio clips and has a size of {self.size_human}."
65
+ )
66
+ super(CommonVoiceConfig, self).__init__(
67
+ name=name, version=datasets.Version(version), description=description, **kwargs
68
+ )
69
+
70
+
71
+ class CommonVoice(datasets.GeneratorBasedBuilder):
72
+ DEFAULT_CONFIG_NAME = "en"
73
+ DEFAULT_WRITER_BATCH_SIZE = 1000
74
+
75
+ BUILDER_CONFIGS = [
76
+ CommonVoiceConfig(
77
+ name=lang,
78
+ version=STATS["version"],
79
+ language=LANGUAGES[lang],
80
+ release_date=STATS["date"],
81
+ num_clips=lang_stats["clips"],
82
+ num_speakers=lang_stats["users"],
83
+ validated_hr=float(lang_stats["validHrs"]),
84
+ total_hr=float(lang_stats["totalHrs"]),
85
+ size_bytes=int(lang_stats["size"]),
86
+ )
87
+ for lang, lang_stats in STATS["locales"].items()
88
+ ]
89
+
90
+ def _info(self):
91
+ total_languages = len(STATS["locales"])
92
+ total_valid_hours = STATS["totalValidHrs"]
93
+ description = (
94
+ "Common Voice is Mozilla's initiative to help teach machines how real people speak. "
95
+ f"The dataset currently consists of {total_valid_hours} validated hours of speech "
96
+ f" in {total_languages} languages, but more voices and languages are always added."
97
+ )
98
+ features = datasets.Features(
99
+ {
100
+ "client_id": datasets.Value("string"),
101
+ "path": datasets.Value("string"),
102
+ "audio": datasets.features.Audio(sampling_rate=48_000),
103
+ "sentence": datasets.Value("string"),
104
+ "up_votes": datasets.Value("int64"),
105
+ "down_votes": datasets.Value("int64"),
106
+ "age": datasets.Value("string"),
107
+ "gender": datasets.Value("string"),
108
+ "accent": datasets.Value("string"),
109
+ "locale": datasets.Value("string"),
110
+ "segment": datasets.Value("string"),
111
+ }
112
+ )
113
+
114
+ return datasets.DatasetInfo(
115
+ description=description,
116
+ features=features,
117
+ supervised_keys=None,
118
+ homepage=_HOMEPAGE,
119
+ license=_LICENSE,
120
+ citation=_CITATION,
121
+ version=self.config.version,
122
+ # task_templates=[
123
+ # AutomaticSpeechRecognition(audio_file_path_column="path", transcription_column="sentence")
124
+ # ],
125
+ )
126
+
127
+ def _get_bundle_url(self, locale, url_template):
128
+ # path = encodeURIComponent(path)
129
+ path = url_template.replace("{locale}", locale)
130
+ path = urllib.parse.quote(path.encode("utf-8"), safe="~()*!.'")
131
+ # use_cdn = self.config.size_bytes < 20 * 1024 * 1024 * 1024
132
+ # response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()
133
+ response = requests.get(f"{_API_URL}/bucket/dataset/{path}", timeout=10.0).json()
134
+ return response["url"]
135
+
136
+ def _log_download(self, locale, bundle_version, auth_token):
137
+ if isinstance(auth_token, bool):
138
+ auth_token = HfFolder().get_token()
139
+ whoami = HfApi().whoami(auth_token)
140
+ email = whoami["email"] if "email" in whoami else ""
141
+ payload = {"email": email, "locale": locale, "dataset": bundle_version}
142
+ requests.post(f"{_API_URL}/{locale}/downloaders", json=payload).json()
143
+
144
+ def _split_generators(self, dl_manager):
145
+ """Returns SplitGenerators."""
146
+ hf_auth_token = dl_manager.download_config.use_auth_token
147
+ if hf_auth_token is None:
148
+ raise ConnectionError("Please set use_auth_token=True or use_auth_token='<TOKEN>' to download this dataset")
149
+
150
+ bundle_url_template = STATS["bundleURLTemplate"]
151
+ bundle_version = bundle_url_template.split("/")[0]
152
+ dl_manager.download_config.ignore_url_params = True
153
+
154
+ self._log_download(self.config.name, bundle_version, hf_auth_token)
155
+ archive_path = dl_manager.download(self._get_bundle_url(self.config.name, bundle_url_template))
156
+ local_extracted_archive = dl_manager.extract(archive_path) if not dl_manager.is_streaming else None
157
+
158
+ if self.config.version < datasets.Version("5.0.0"):
159
+ path_to_data = ""
160
+ else:
161
+ path_to_data = "/".join([bundle_version, self.config.name])
162
+ path_to_clips = "/".join([path_to_data, "clips"]) if path_to_data else "clips"
163
+
164
+ return [
165
+ datasets.SplitGenerator(
166
+ name=datasets.Split.TRAIN,
167
+ gen_kwargs={
168
+ "local_extracted_archive": local_extracted_archive,
169
+ "archive_iterator": dl_manager.iter_archive(archive_path),
170
+ "metadata_filepath": "/".join([path_to_data, "train.tsv"]) if path_to_data else "train.tsv",
171
+ "path_to_clips": path_to_clips,
172
+ },
173
+ ),
174
+ datasets.SplitGenerator(
175
+ name=datasets.Split.TEST,
176
+ gen_kwargs={
177
+ "local_extracted_archive": local_extracted_archive,
178
+ "archive_iterator": dl_manager.iter_archive(archive_path),
179
+ "metadata_filepath": "/".join([path_to_data, "test.tsv"]) if path_to_data else "test.tsv",
180
+ "path_to_clips": path_to_clips,
181
+ },
182
+ ),
183
+ datasets.SplitGenerator(
184
+ name=datasets.Split.VALIDATION,
185
+ gen_kwargs={
186
+ "local_extracted_archive": local_extracted_archive,
187
+ "archive_iterator": dl_manager.iter_archive(archive_path),
188
+ "metadata_filepath": "/".join([path_to_data, "dev.tsv"]) if path_to_data else "dev.tsv",
189
+ "path_to_clips": path_to_clips,
190
+ },
191
+ ),
192
+ datasets.SplitGenerator(
193
+ name="other",
194
+ gen_kwargs={
195
+ "local_extracted_archive": local_extracted_archive,
196
+ "archive_iterator": dl_manager.iter_archive(archive_path),
197
+ "metadata_filepath": "/".join([path_to_data, "other.tsv"]) if path_to_data else "other.tsv",
198
+ "path_to_clips": path_to_clips,
199
+ },
200
+ ),
201
+ datasets.SplitGenerator(
202
+ name="invalidated",
203
+ gen_kwargs={
204
+ "local_extracted_archive": local_extracted_archive,
205
+ "archive_iterator": dl_manager.iter_archive(archive_path),
206
+ "metadata_filepath": "/".join(
207
+ [path_to_data, "invalidated.tsv"]) if path_to_data else "invalidated.tsv",
208
+ "path_to_clips": path_to_clips,
209
+ },
210
+ ),
211
+ ]
212
+
213
+ def _generate_examples(self, local_extracted_archive, archive_iterator, metadata_filepath, path_to_clips):
214
+ """Yields examples."""
215
+ data_fields = list(self._info().features.keys())
216
+ metadata = {}
217
+ metadata_found = False
218
+ for path, f in archive_iterator:
219
+ if path == metadata_filepath:
220
+ metadata_found = True
221
+ lines = (line.decode("utf-8") for line in f)
222
+ reader = csv.DictReader(lines, delimiter="\t", quoting=csv.QUOTE_NONE)
223
+ for row in reader:
224
+ # set absolute path for mp3 audio file
225
+ if not row["path"].endswith(".mp3"):
226
+ row["path"] += ".mp3"
227
+ row["path"] = os.path.join(path_to_clips, row["path"])
228
+ # accent -> accents in CV 8.0
229
+ if "accents" in row:
230
+ row["accent"] = row["accents"]
231
+ del row["accents"]
232
+ # if data is incomplete, fill with empty values
233
+ for field in data_fields:
234
+ if field not in row:
235
+ row[field] = ""
236
+ metadata[row["path"]] = row
237
+ elif path.startswith(path_to_clips):
238
+ assert metadata_found, "Found audio clips before the metadata TSV file."
239
+ if not metadata:
240
+ break
241
+ if path in metadata:
242
+ result = metadata[path]
243
+ # set the audio feature and the path to the extracted file
244
+ result["audio"] = {"path": path, "bytes": f.read()}
245
+ result["path"] = os.path.join(local_extracted_archive, path) if local_extracted_archive else None
246
+
247
+ yield path, result
generate_datasets.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import shutil
4
+
5
+ import json
6
+ import requests
7
+
8
+ RELEASE_STATS_URL = (
9
+ "https://commonvoice.mozilla.org/dist/releases/{}.json"
10
+ )
11
+ VERSIONS = [
12
+ {"semver": "1.0.0", "name": "common_voice_1_0", "release": "cv-corpus-1"},
13
+ {"semver": "2.0.0", "name": "common_voice_2_0", "release": "cv-corpus-2"},
14
+ {"semver": "3.0.0", "name": "common_voice_3_0", "release": "cv-corpus-3"},
15
+ {"semver": "4.0.0", "name": "common_voice_4_0", "release": "cv-corpus-4-2019-12-10"},
16
+ {"semver": "5.0.0", "name": "common_voice_5_0", "release": "cv-corpus-5-2020-06-22"},
17
+ {"semver": "5.1.0", "name": "common_voice_5_1", "release": "cv-corpus-5.1-2020-06-22"},
18
+ {"semver": "6.0.0", "name": "common_voice_6_0", "release": "cv-corpus-6.0-2020-12-11"},
19
+ {"semver": "6.1.0", "name": "common_voice_6_1", "release": "cv-corpus-6.1-2020-12-11"},
20
+ {"semver": "7.0.0", "name": "common_voice_7_0", "release": "cv-corpus-7.0-2021-07-21"},
21
+ {"semver": "8.0.0", "name": "common_voice_8_0", "release": "cv-corpus-8.0-2022-01-19"},
22
+ {"semver": "9.0.0", "name": "common_voice_9_0", "release": "cv-corpus-9.0-2022-04-27"},
23
+ ]
24
+
25
+
26
+ def num_to_size(num: int):
27
+ if num < 1000:
28
+ return "n<1K"
29
+ elif num < 10_000:
30
+ return "1K<n<10K"
31
+ elif num < 100_000:
32
+ return "10K<n<100K"
33
+ elif num < 1_000_000:
34
+ return "100K<n<1M"
35
+ elif num < 10_000_000:
36
+ return "1M<n<10M"
37
+ elif num < 100_000_000:
38
+ return "10M<n<100M"
39
+ elif num < 1_000_000_000:
40
+ return "100M<n<1B"
41
+
42
+
43
+ def get_language_names():
44
+ # source: https://github.com/common-voice/common-voice/blob/release-v1.71.0/web/locales/en/messages.ftl
45
+ languages = {}
46
+ with open("languages.ftl") as fin:
47
+ for line in fin:
48
+ lang_code, lang_name = line.strip().split(" = ")
49
+ languages[lang_code] = lang_name
50
+
51
+ return languages
52
+
53
+
54
+ def main():
55
+ language_names = get_language_names()
56
+
57
+ for version in VERSIONS:
58
+ stats_url = RELEASE_STATS_URL.format(version["release"])
59
+ release_stats = requests.get(stats_url).text
60
+ release_stats = json.loads(release_stats)
61
+ release_stats["version"] = version["semver"]
62
+
63
+ dataset_path = version["name"]
64
+ os.makedirs(dataset_path, exist_ok=True)
65
+ with open(f"{dataset_path}/release_stats.py", "w") as fout:
66
+ fout.write("STATS = " + str(release_stats))
67
+
68
+ with open(f"README.template.md", "r") as fin:
69
+ readme = fin.read()
70
+ readme = readme.replace("{{NAME}}", release_stats["name"])
71
+
72
+ locales = sorted(release_stats["locales"].keys())
73
+ languages = [f"- {loc}" for loc in locales]
74
+ readme = readme.replace("{{LANGUAGES}}", "\n".join(languages))
75
+
76
+ sizes = [f" {loc}:\n - {num_to_size(release_stats['locales'][loc]['clips'])}" for loc in locales]
77
+ readme = readme.replace("{{SIZES}}", "\n".join(sizes))
78
+
79
+ languages_human = sorted([language_names[loc] for loc in locales])
80
+ readme = readme.replace("{{LANGUAGES_HUMAN}}", ", ".join(languages_human))
81
+
82
+ readme = readme.replace("{{TOTAL_HRS}}", str(release_stats["totalHrs"]))
83
+ readme = readme.replace("{{VAL_HRS}}", str(release_stats["totalValidHrs"]))
84
+ readme = readme.replace("{{NUM_LANGS}}", str(len(locales)))
85
+
86
+ with open(f"{dataset_path}/README.md", "w") as fout:
87
+ fout.write(readme)
88
+ with open(f"{dataset_path}/languages.py", "w") as fout:
89
+ fout.write("LANGUAGES = " + str(language_names))
90
+
91
+ shutil.copy("dataset_script.py", f"{dataset_path}/{dataset_path}.py")
92
+
93
+
94
+ if __name__ == "__main__":
95
+ main()
96
+
languages.ftl ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ab = Abkhaz
2
+ ace = Acehnese
3
+ ady = Adyghe
4
+ af = Afrikaans
5
+ am = Amharic
6
+ an = Aragonese
7
+ ar = Arabic
8
+ arn = Mapudungun
9
+ as = Assamese
10
+ ast = Asturian
11
+ az = Azerbaijani
12
+ ba = Bashkir
13
+ bas = Basaa
14
+ be = Belarusian
15
+ bg = Bulgarian
16
+ bn = Bengali
17
+ br = Breton
18
+ bs = Bosnian
19
+ bxr = Buryat
20
+ ca = Catalan
21
+ cak = Kaqchikel
22
+ ckb = Central Kurdish
23
+ cnh = Hakha Chin
24
+ co = Corsican
25
+ cs = Czech
26
+ cv = Chuvash
27
+ cy = Welsh
28
+ da = Danish
29
+ de = German
30
+ dsb = Sorbian, Lower
31
+ dv = Dhivehi
32
+ el = Greek
33
+ en = English
34
+ eo = Esperanto
35
+ es = Spanish
36
+ et = Estonian
37
+ eu = Basque
38
+ fa = Persian
39
+ ff = Fulah
40
+ fi = Finnish
41
+ fo = Faroese
42
+ fr = French
43
+ fy-NL = Frisian
44
+ ga-IE = Irish
45
+ gl = Galician
46
+ gn = Guarani
47
+ gom = Goan Konkani
48
+ ha = Hausa
49
+ he = Hebrew
50
+ hi = Hindi
51
+ hr = Croatian
52
+ hsb = Sorbian, Upper
53
+ ht = Haitian
54
+ hu = Hungarian
55
+ hy-AM = Armenian
56
+ hyw = Armenian Western
57
+ ia = Interlingua
58
+ id = Indonesian
59
+ ie = Interlingue
60
+ ig = Igbo
61
+ is = Icelandic
62
+ it = Italian
63
+ izh = Izhorian
64
+ ja = Japanese
65
+ ka = Georgian
66
+ kaa = Karakalpak
67
+ kab = Kabyle
68
+ kbd = Kabardian
69
+ ki = Kikuyu
70
+ kk = Kazakh
71
+ km = Khmer
72
+ kmr = Kurmanji Kurdish
73
+ knn = Konkani (Devanagari)
74
+ ko = Korean
75
+ kpv = Komi-Zyrian
76
+ kw = Cornish
77
+ ky = Kyrgyz
78
+ lb = Luxembourgish
79
+ lg = Luganda
80
+ lij = Ligurian
81
+ lt = Lithuanian
82
+ lv = Latvian
83
+ mai = Maithili
84
+ mdf = Moksha
85
+ mg = Malagasy
86
+ mhr = Meadow Mari
87
+ mk = Macedonian
88
+ ml = Malayalam
89
+ mn = Mongolian
90
+ mni = Meetei Lon
91
+ mos = Mossi
92
+ mr = Marathi
93
+ mrj = Hill Mari
94
+ ms = Malay
95
+ mt = Maltese
96
+ my = Burmese
97
+ myv = Erzya
98
+ nan-tw = Taiwanese (Minnan)
99
+ nb-NO = Norwegian Bokmål
100
+ ne-NP = Nepali
101
+ nia = Nias
102
+ nl = Dutch
103
+ nn-NO = Norwegian Nynorsk
104
+ nyn = Runyankole
105
+ oc = Occitan
106
+ or = Odia
107
+ pa-IN = Punjabi
108
+ pap-AW = Papiamento (Aruba)
109
+ pl = Polish
110
+ ps = Pashto
111
+ pt = Portuguese
112
+ quc = K'iche'
113
+ quy = Quechua Chanka
114
+ rm-sursilv = Romansh Sursilvan
115
+ rm-vallader = Romansh Vallader
116
+ ro = Romanian
117
+ ru = Russian
118
+ rw = Kinyarwanda
119
+ sah = Sakha
120
+ sat = Santali (Ol Chiki)
121
+ sc = Sardinian
122
+ scn = Sicilian
123
+ shi = Shilha
124
+ si = Sinhala
125
+ sk = Slovak
126
+ skr = Saraiki
127
+ sl = Slovenian
128
+ so = Somali
129
+ sq = Albanian
130
+ sr = Serbian
131
+ sv-SE = Swedish
132
+ sw = Swahili
133
+ syr = Syriac
134
+ ta = Tamil
135
+ te = Telugu
136
+ tg = Tajik
137
+ th = Thai
138
+ ti = Tigrinya
139
+ tig = Tigre
140
+ tk = Turkmen
141
+ tl = Tagalog
142
+ tok = Toki Pona
143
+ tr = Turkish
144
+ tt = Tatar
145
+ tw = Twi
146
+ ty = Tahitian
147
+ uby = Ubykh
148
+ udm = Udmurt
149
+ ug = Uyghur
150
+ uk = Ukrainian
151
+ ur = Urdu
152
+ uz = Uzbek
153
+ vec = Venetian
154
+ vi = Vietnamese
155
+ vot = Votic
156
+ yi = Yiddish
157
+ yo = Yoruba
158
+ yue = Cantonese
159
+ zh-CN = Chinese (China)
160
+ zh-HK = Chinese (Hong Kong)
161
+ zh-TW = Chinese (Taiwan)
test.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ from datasets import load_dataset
2
+
3
+ dataset = load_dataset("./common_voice_9_0", "et", split="test", use_auth_token=True)
4
+ print(dataset)
5
+ print(dataset[100])