Commit
·
b4d59ba
1
Parent(s):
2deeec5
Upload
Browse files- .gitignore +1 -0
- README.template.md +209 -0
- dataset_script.py +247 -0
- generate_datasets.py +96 -0
- languages.ftl +161 -0
- test.py +5 -0
.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
common_voice_*
|
README.template.md
ADDED
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pretty_name: {{NAME}}
|
3 |
+
annotations_creators:
|
4 |
+
- crowdsourced
|
5 |
+
language_creators:
|
6 |
+
- crowdsourced
|
7 |
+
languages:
|
8 |
+
{{LANGUAGES}}
|
9 |
+
licenses:
|
10 |
+
- cc0-1.0
|
11 |
+
multilinguality:
|
12 |
+
- multilingual
|
13 |
+
size_categories:
|
14 |
+
{{SIZES}}
|
15 |
+
source_datasets:
|
16 |
+
- extended|common_voice
|
17 |
+
task_categories:
|
18 |
+
- speech-processing
|
19 |
+
task_ids:
|
20 |
+
- automatic-speech-recognition
|
21 |
+
paperswithcode_id: common-voice
|
22 |
+
extra_gated_prompt: "By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset."
|
23 |
+
---
|
24 |
+
|
25 |
+
# Dataset Card for {{NAME}}
|
26 |
+
|
27 |
+
## Table of Contents
|
28 |
+
- [Dataset Description](#dataset-description)
|
29 |
+
- [Dataset Summary](#dataset-summary)
|
30 |
+
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
|
31 |
+
- [Languages](#languages)
|
32 |
+
- [Dataset Structure](#dataset-structure)
|
33 |
+
- [Data Instances](#data-instances)
|
34 |
+
- [Data Fields](#data-fields)
|
35 |
+
- [Data Splits](#data-splits)
|
36 |
+
- [Dataset Creation](#dataset-creation)
|
37 |
+
- [Curation Rationale](#curation-rationale)
|
38 |
+
- [Source Data](#source-data)
|
39 |
+
- [Annotations](#annotations)
|
40 |
+
- [Personal and Sensitive Information](#personal-and-sensitive-information)
|
41 |
+
- [Considerations for Using the Data](#considerations-for-using-the-data)
|
42 |
+
- [Social Impact of Dataset](#social-impact-of-dataset)
|
43 |
+
- [Discussion of Biases](#discussion-of-biases)
|
44 |
+
- [Other Known Limitations](#other-known-limitations)
|
45 |
+
- [Additional Information](#additional-information)
|
46 |
+
- [Dataset Curators](#dataset-curators)
|
47 |
+
- [Licensing Information](#licensing-information)
|
48 |
+
- [Citation Information](#citation-information)
|
49 |
+
- [Contributions](#contributions)
|
50 |
+
|
51 |
+
## Dataset Description
|
52 |
+
|
53 |
+
- **Homepage:** https://commonvoice.mozilla.org/en/datasets
|
54 |
+
- **Repository:** https://github.com/common-voice/common-voice
|
55 |
+
- **Paper:** https://arxiv.org/abs/1912.06670
|
56 |
+
- **Leaderboard:** https://paperswithcode.com/dataset/common-voice
|
57 |
+
- **Point of Contact:** [Anton Lozhkov](mailto:[email protected])
|
58 |
+
|
59 |
+
### Dataset Summary
|
60 |
+
|
61 |
+
The Common Voice dataset consists of a unique MP3 and corresponding text file.
|
62 |
+
Many of the {{TOTAL_HRS}} recorded hours in the dataset also include demographic metadata like age, sex, and accent
|
63 |
+
that can help improve the accuracy of speech recognition engines.
|
64 |
+
|
65 |
+
The dataset currently consists of {{VAL_HRS}} validated hours in {{NUM_LANGS}} languages, but more voices and languages are always added.
|
66 |
+
Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing.
|
67 |
+
|
68 |
+
### Supported Tasks and Leaderboards
|
69 |
+
|
70 |
+
The results for models trained on the Common Voice datasets are available via the
|
71 |
+
[Papers with Code Leaderboards](https://paperswithcode.com/dataset/common-voice)
|
72 |
+
|
73 |
+
### Languages
|
74 |
+
|
75 |
+
```
|
76 |
+
{{LANGUAGES_HUMAN}}
|
77 |
+
```
|
78 |
+
|
79 |
+
## Dataset Structure
|
80 |
+
|
81 |
+
### Data Instances
|
82 |
+
|
83 |
+
A typical data point comprises the `path` to the audio file and its `sentence`.
|
84 |
+
Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`.
|
85 |
+
|
86 |
+
```python
|
87 |
+
{
|
88 |
+
'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5',
|
89 |
+
'path': 'et/clips/common_voice_et_18318995.mp3',
|
90 |
+
'audio': {
|
91 |
+
'path': 'et/clips/common_voice_et_18318995.mp3',
|
92 |
+
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
|
93 |
+
'sampling_rate': 48000
|
94 |
+
},
|
95 |
+
'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.',
|
96 |
+
'up_votes': 2,
|
97 |
+
'down_votes': 0,
|
98 |
+
'age': 'twenties',
|
99 |
+
'gender': 'male',
|
100 |
+
'accent': '',
|
101 |
+
'locale': 'et',
|
102 |
+
'segment': ''
|
103 |
+
}
|
104 |
+
```
|
105 |
+
|
106 |
+
### Data Fields
|
107 |
+
|
108 |
+
`client_id` (`string`): An id for which client (voice) made the recording
|
109 |
+
|
110 |
+
`path` (`string`): The path to the audio file
|
111 |
+
|
112 |
+
`audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
|
113 |
+
|
114 |
+
`sentence` (`string`): The sentence the user was prompted to speak
|
115 |
+
|
116 |
+
`up_votes` (`int64`): How many upvotes the audio file has received from reviewers
|
117 |
+
|
118 |
+
`down_votes` (`int64`): How many downvotes the audio file has received from reviewers
|
119 |
+
|
120 |
+
`age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`)
|
121 |
+
|
122 |
+
`gender` (`string`): The gender of the speaker
|
123 |
+
|
124 |
+
`accent` (`string`): Accent of the speaker
|
125 |
+
|
126 |
+
`locale` (`string`): The locale of the speaker
|
127 |
+
|
128 |
+
`segment` (`string`): Usually an empty field
|
129 |
+
|
130 |
+
### Data Splits
|
131 |
+
|
132 |
+
The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other.
|
133 |
+
|
134 |
+
The validated data is data that has been validated with reviewers and recieved upvotes that the data is of high quality.
|
135 |
+
|
136 |
+
The invalidated data is data has been invalidated by reviewers
|
137 |
+
and received downvotes indicating that the data is of low quality.
|
138 |
+
|
139 |
+
The reported data is data that has been reported, for different reasons.
|
140 |
+
|
141 |
+
The other data is data that has not yet been reviewed.
|
142 |
+
|
143 |
+
The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train.
|
144 |
+
|
145 |
+
## Dataset Creation
|
146 |
+
|
147 |
+
### Curation Rationale
|
148 |
+
|
149 |
+
[Needs More Information]
|
150 |
+
|
151 |
+
### Source Data
|
152 |
+
|
153 |
+
#### Initial Data Collection and Normalization
|
154 |
+
|
155 |
+
[Needs More Information]
|
156 |
+
|
157 |
+
#### Who are the source language producers?
|
158 |
+
|
159 |
+
[Needs More Information]
|
160 |
+
|
161 |
+
### Annotations
|
162 |
+
|
163 |
+
#### Annotation process
|
164 |
+
|
165 |
+
[Needs More Information]
|
166 |
+
|
167 |
+
#### Who are the annotators?
|
168 |
+
|
169 |
+
[Needs More Information]
|
170 |
+
|
171 |
+
### Personal and Sensitive Information
|
172 |
+
|
173 |
+
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
|
174 |
+
|
175 |
+
## Considerations for Using the Data
|
176 |
+
|
177 |
+
### Social Impact of Dataset
|
178 |
+
|
179 |
+
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
|
180 |
+
|
181 |
+
### Discussion of Biases
|
182 |
+
|
183 |
+
[More Information Needed]
|
184 |
+
|
185 |
+
### Other Known Limitations
|
186 |
+
|
187 |
+
[More Information Needed]
|
188 |
+
|
189 |
+
## Additional Information
|
190 |
+
|
191 |
+
### Dataset Curators
|
192 |
+
|
193 |
+
[More Information Needed]
|
194 |
+
|
195 |
+
### Licensing Information
|
196 |
+
|
197 |
+
Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
|
198 |
+
|
199 |
+
### Citation Information
|
200 |
+
|
201 |
+
```
|
202 |
+
@inproceedings{commonvoice:2020,
|
203 |
+
author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
|
204 |
+
title = {Common Voice: A Massively-Multilingual Speech Corpus},
|
205 |
+
booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
|
206 |
+
pages = {4211--4215},
|
207 |
+
year = 2020
|
208 |
+
}
|
209 |
+
```
|
dataset_script.py
ADDED
@@ -0,0 +1,247 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# coding=utf-8
|
2 |
+
# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.
|
3 |
+
#
|
4 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
5 |
+
# you may not use this file except in compliance with the License.
|
6 |
+
# You may obtain a copy of the License at
|
7 |
+
#
|
8 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
9 |
+
#
|
10 |
+
# Unless required by applicable law or agreed to in writing, software
|
11 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
12 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
13 |
+
# See the License for the specific language governing permissions and
|
14 |
+
# limitations under the License.
|
15 |
+
""" Common Voice Dataset"""
|
16 |
+
|
17 |
+
|
18 |
+
import csv
|
19 |
+
import os
|
20 |
+
import urllib
|
21 |
+
|
22 |
+
import datasets
|
23 |
+
import requests
|
24 |
+
from datasets.tasks import AutomaticSpeechRecognition
|
25 |
+
from datasets.utils.py_utils import size_str
|
26 |
+
from huggingface_hub import HfApi, HfFolder
|
27 |
+
|
28 |
+
from .languages import LANGUAGES
|
29 |
+
from .release_stats import STATS
|
30 |
+
|
31 |
+
_CITATION = """\
|
32 |
+
@inproceedings{commonvoice:2020,
|
33 |
+
author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
|
34 |
+
title = {Common Voice: A Massively-Multilingual Speech Corpus},
|
35 |
+
booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
|
36 |
+
pages = {4211--4215},
|
37 |
+
year = 2020
|
38 |
+
}
|
39 |
+
"""
|
40 |
+
|
41 |
+
_HOMEPAGE = "https://commonvoice.mozilla.org/en/datasets"
|
42 |
+
|
43 |
+
_LICENSE = "https://creativecommons.org/publicdomain/zero/1.0/"
|
44 |
+
|
45 |
+
_API_URL = "https://commonvoice.mozilla.org/api/v1"
|
46 |
+
|
47 |
+
|
48 |
+
class CommonVoiceConfig(datasets.BuilderConfig):
|
49 |
+
"""BuilderConfig for CommonVoice."""
|
50 |
+
|
51 |
+
def __init__(self, name, version, **kwargs):
|
52 |
+
self.language = kwargs.pop("language", None)
|
53 |
+
self.release_date = kwargs.pop("release_date", None)
|
54 |
+
self.num_clips = kwargs.pop("num_clips", None)
|
55 |
+
self.num_speakers = kwargs.pop("num_speakers", None)
|
56 |
+
self.validated_hr = kwargs.pop("validated_hr", None)
|
57 |
+
self.total_hr = kwargs.pop("total_hr", None)
|
58 |
+
self.size_bytes = kwargs.pop("size_bytes", None)
|
59 |
+
self.size_human = size_str(self.size_bytes)
|
60 |
+
description = (
|
61 |
+
f"Common Voice speech to text dataset in {self.language} released on {self.release_date}. "
|
62 |
+
f"The dataset comprises {self.validated_hr} hours of validated transcribed speech data "
|
63 |
+
f"out of {self.total_hr} hours in total from {self.num_speakers} speakers. "
|
64 |
+
f"The dataset contains {self.num_clips} audio clips and has a size of {self.size_human}."
|
65 |
+
)
|
66 |
+
super(CommonVoiceConfig, self).__init__(
|
67 |
+
name=name, version=datasets.Version(version), description=description, **kwargs
|
68 |
+
)
|
69 |
+
|
70 |
+
|
71 |
+
class CommonVoice(datasets.GeneratorBasedBuilder):
|
72 |
+
DEFAULT_CONFIG_NAME = "en"
|
73 |
+
DEFAULT_WRITER_BATCH_SIZE = 1000
|
74 |
+
|
75 |
+
BUILDER_CONFIGS = [
|
76 |
+
CommonVoiceConfig(
|
77 |
+
name=lang,
|
78 |
+
version=STATS["version"],
|
79 |
+
language=LANGUAGES[lang],
|
80 |
+
release_date=STATS["date"],
|
81 |
+
num_clips=lang_stats["clips"],
|
82 |
+
num_speakers=lang_stats["users"],
|
83 |
+
validated_hr=float(lang_stats["validHrs"]),
|
84 |
+
total_hr=float(lang_stats["totalHrs"]),
|
85 |
+
size_bytes=int(lang_stats["size"]),
|
86 |
+
)
|
87 |
+
for lang, lang_stats in STATS["locales"].items()
|
88 |
+
]
|
89 |
+
|
90 |
+
def _info(self):
|
91 |
+
total_languages = len(STATS["locales"])
|
92 |
+
total_valid_hours = STATS["totalValidHrs"]
|
93 |
+
description = (
|
94 |
+
"Common Voice is Mozilla's initiative to help teach machines how real people speak. "
|
95 |
+
f"The dataset currently consists of {total_valid_hours} validated hours of speech "
|
96 |
+
f" in {total_languages} languages, but more voices and languages are always added."
|
97 |
+
)
|
98 |
+
features = datasets.Features(
|
99 |
+
{
|
100 |
+
"client_id": datasets.Value("string"),
|
101 |
+
"path": datasets.Value("string"),
|
102 |
+
"audio": datasets.features.Audio(sampling_rate=48_000),
|
103 |
+
"sentence": datasets.Value("string"),
|
104 |
+
"up_votes": datasets.Value("int64"),
|
105 |
+
"down_votes": datasets.Value("int64"),
|
106 |
+
"age": datasets.Value("string"),
|
107 |
+
"gender": datasets.Value("string"),
|
108 |
+
"accent": datasets.Value("string"),
|
109 |
+
"locale": datasets.Value("string"),
|
110 |
+
"segment": datasets.Value("string"),
|
111 |
+
}
|
112 |
+
)
|
113 |
+
|
114 |
+
return datasets.DatasetInfo(
|
115 |
+
description=description,
|
116 |
+
features=features,
|
117 |
+
supervised_keys=None,
|
118 |
+
homepage=_HOMEPAGE,
|
119 |
+
license=_LICENSE,
|
120 |
+
citation=_CITATION,
|
121 |
+
version=self.config.version,
|
122 |
+
# task_templates=[
|
123 |
+
# AutomaticSpeechRecognition(audio_file_path_column="path", transcription_column="sentence")
|
124 |
+
# ],
|
125 |
+
)
|
126 |
+
|
127 |
+
def _get_bundle_url(self, locale, url_template):
|
128 |
+
# path = encodeURIComponent(path)
|
129 |
+
path = url_template.replace("{locale}", locale)
|
130 |
+
path = urllib.parse.quote(path.encode("utf-8"), safe="~()*!.'")
|
131 |
+
# use_cdn = self.config.size_bytes < 20 * 1024 * 1024 * 1024
|
132 |
+
# response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()
|
133 |
+
response = requests.get(f"{_API_URL}/bucket/dataset/{path}", timeout=10.0).json()
|
134 |
+
return response["url"]
|
135 |
+
|
136 |
+
def _log_download(self, locale, bundle_version, auth_token):
|
137 |
+
if isinstance(auth_token, bool):
|
138 |
+
auth_token = HfFolder().get_token()
|
139 |
+
whoami = HfApi().whoami(auth_token)
|
140 |
+
email = whoami["email"] if "email" in whoami else ""
|
141 |
+
payload = {"email": email, "locale": locale, "dataset": bundle_version}
|
142 |
+
requests.post(f"{_API_URL}/{locale}/downloaders", json=payload).json()
|
143 |
+
|
144 |
+
def _split_generators(self, dl_manager):
|
145 |
+
"""Returns SplitGenerators."""
|
146 |
+
hf_auth_token = dl_manager.download_config.use_auth_token
|
147 |
+
if hf_auth_token is None:
|
148 |
+
raise ConnectionError("Please set use_auth_token=True or use_auth_token='<TOKEN>' to download this dataset")
|
149 |
+
|
150 |
+
bundle_url_template = STATS["bundleURLTemplate"]
|
151 |
+
bundle_version = bundle_url_template.split("/")[0]
|
152 |
+
dl_manager.download_config.ignore_url_params = True
|
153 |
+
|
154 |
+
self._log_download(self.config.name, bundle_version, hf_auth_token)
|
155 |
+
archive_path = dl_manager.download(self._get_bundle_url(self.config.name, bundle_url_template))
|
156 |
+
local_extracted_archive = dl_manager.extract(archive_path) if not dl_manager.is_streaming else None
|
157 |
+
|
158 |
+
if self.config.version < datasets.Version("5.0.0"):
|
159 |
+
path_to_data = ""
|
160 |
+
else:
|
161 |
+
path_to_data = "/".join([bundle_version, self.config.name])
|
162 |
+
path_to_clips = "/".join([path_to_data, "clips"]) if path_to_data else "clips"
|
163 |
+
|
164 |
+
return [
|
165 |
+
datasets.SplitGenerator(
|
166 |
+
name=datasets.Split.TRAIN,
|
167 |
+
gen_kwargs={
|
168 |
+
"local_extracted_archive": local_extracted_archive,
|
169 |
+
"archive_iterator": dl_manager.iter_archive(archive_path),
|
170 |
+
"metadata_filepath": "/".join([path_to_data, "train.tsv"]) if path_to_data else "train.tsv",
|
171 |
+
"path_to_clips": path_to_clips,
|
172 |
+
},
|
173 |
+
),
|
174 |
+
datasets.SplitGenerator(
|
175 |
+
name=datasets.Split.TEST,
|
176 |
+
gen_kwargs={
|
177 |
+
"local_extracted_archive": local_extracted_archive,
|
178 |
+
"archive_iterator": dl_manager.iter_archive(archive_path),
|
179 |
+
"metadata_filepath": "/".join([path_to_data, "test.tsv"]) if path_to_data else "test.tsv",
|
180 |
+
"path_to_clips": path_to_clips,
|
181 |
+
},
|
182 |
+
),
|
183 |
+
datasets.SplitGenerator(
|
184 |
+
name=datasets.Split.VALIDATION,
|
185 |
+
gen_kwargs={
|
186 |
+
"local_extracted_archive": local_extracted_archive,
|
187 |
+
"archive_iterator": dl_manager.iter_archive(archive_path),
|
188 |
+
"metadata_filepath": "/".join([path_to_data, "dev.tsv"]) if path_to_data else "dev.tsv",
|
189 |
+
"path_to_clips": path_to_clips,
|
190 |
+
},
|
191 |
+
),
|
192 |
+
datasets.SplitGenerator(
|
193 |
+
name="other",
|
194 |
+
gen_kwargs={
|
195 |
+
"local_extracted_archive": local_extracted_archive,
|
196 |
+
"archive_iterator": dl_manager.iter_archive(archive_path),
|
197 |
+
"metadata_filepath": "/".join([path_to_data, "other.tsv"]) if path_to_data else "other.tsv",
|
198 |
+
"path_to_clips": path_to_clips,
|
199 |
+
},
|
200 |
+
),
|
201 |
+
datasets.SplitGenerator(
|
202 |
+
name="invalidated",
|
203 |
+
gen_kwargs={
|
204 |
+
"local_extracted_archive": local_extracted_archive,
|
205 |
+
"archive_iterator": dl_manager.iter_archive(archive_path),
|
206 |
+
"metadata_filepath": "/".join(
|
207 |
+
[path_to_data, "invalidated.tsv"]) if path_to_data else "invalidated.tsv",
|
208 |
+
"path_to_clips": path_to_clips,
|
209 |
+
},
|
210 |
+
),
|
211 |
+
]
|
212 |
+
|
213 |
+
def _generate_examples(self, local_extracted_archive, archive_iterator, metadata_filepath, path_to_clips):
|
214 |
+
"""Yields examples."""
|
215 |
+
data_fields = list(self._info().features.keys())
|
216 |
+
metadata = {}
|
217 |
+
metadata_found = False
|
218 |
+
for path, f in archive_iterator:
|
219 |
+
if path == metadata_filepath:
|
220 |
+
metadata_found = True
|
221 |
+
lines = (line.decode("utf-8") for line in f)
|
222 |
+
reader = csv.DictReader(lines, delimiter="\t", quoting=csv.QUOTE_NONE)
|
223 |
+
for row in reader:
|
224 |
+
# set absolute path for mp3 audio file
|
225 |
+
if not row["path"].endswith(".mp3"):
|
226 |
+
row["path"] += ".mp3"
|
227 |
+
row["path"] = os.path.join(path_to_clips, row["path"])
|
228 |
+
# accent -> accents in CV 8.0
|
229 |
+
if "accents" in row:
|
230 |
+
row["accent"] = row["accents"]
|
231 |
+
del row["accents"]
|
232 |
+
# if data is incomplete, fill with empty values
|
233 |
+
for field in data_fields:
|
234 |
+
if field not in row:
|
235 |
+
row[field] = ""
|
236 |
+
metadata[row["path"]] = row
|
237 |
+
elif path.startswith(path_to_clips):
|
238 |
+
assert metadata_found, "Found audio clips before the metadata TSV file."
|
239 |
+
if not metadata:
|
240 |
+
break
|
241 |
+
if path in metadata:
|
242 |
+
result = metadata[path]
|
243 |
+
# set the audio feature and the path to the extracted file
|
244 |
+
result["audio"] = {"path": path, "bytes": f.read()}
|
245 |
+
result["path"] = os.path.join(local_extracted_archive, path) if local_extracted_archive else None
|
246 |
+
|
247 |
+
yield path, result
|
generate_datasets.py
ADDED
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
import os
|
3 |
+
import shutil
|
4 |
+
|
5 |
+
import json
|
6 |
+
import requests
|
7 |
+
|
8 |
+
RELEASE_STATS_URL = (
|
9 |
+
"https://commonvoice.mozilla.org/dist/releases/{}.json"
|
10 |
+
)
|
11 |
+
VERSIONS = [
|
12 |
+
{"semver": "1.0.0", "name": "common_voice_1_0", "release": "cv-corpus-1"},
|
13 |
+
{"semver": "2.0.0", "name": "common_voice_2_0", "release": "cv-corpus-2"},
|
14 |
+
{"semver": "3.0.0", "name": "common_voice_3_0", "release": "cv-corpus-3"},
|
15 |
+
{"semver": "4.0.0", "name": "common_voice_4_0", "release": "cv-corpus-4-2019-12-10"},
|
16 |
+
{"semver": "5.0.0", "name": "common_voice_5_0", "release": "cv-corpus-5-2020-06-22"},
|
17 |
+
{"semver": "5.1.0", "name": "common_voice_5_1", "release": "cv-corpus-5.1-2020-06-22"},
|
18 |
+
{"semver": "6.0.0", "name": "common_voice_6_0", "release": "cv-corpus-6.0-2020-12-11"},
|
19 |
+
{"semver": "6.1.0", "name": "common_voice_6_1", "release": "cv-corpus-6.1-2020-12-11"},
|
20 |
+
{"semver": "7.0.0", "name": "common_voice_7_0", "release": "cv-corpus-7.0-2021-07-21"},
|
21 |
+
{"semver": "8.0.0", "name": "common_voice_8_0", "release": "cv-corpus-8.0-2022-01-19"},
|
22 |
+
{"semver": "9.0.0", "name": "common_voice_9_0", "release": "cv-corpus-9.0-2022-04-27"},
|
23 |
+
]
|
24 |
+
|
25 |
+
|
26 |
+
def num_to_size(num: int):
|
27 |
+
if num < 1000:
|
28 |
+
return "n<1K"
|
29 |
+
elif num < 10_000:
|
30 |
+
return "1K<n<10K"
|
31 |
+
elif num < 100_000:
|
32 |
+
return "10K<n<100K"
|
33 |
+
elif num < 1_000_000:
|
34 |
+
return "100K<n<1M"
|
35 |
+
elif num < 10_000_000:
|
36 |
+
return "1M<n<10M"
|
37 |
+
elif num < 100_000_000:
|
38 |
+
return "10M<n<100M"
|
39 |
+
elif num < 1_000_000_000:
|
40 |
+
return "100M<n<1B"
|
41 |
+
|
42 |
+
|
43 |
+
def get_language_names():
|
44 |
+
# source: https://github.com/common-voice/common-voice/blob/release-v1.71.0/web/locales/en/messages.ftl
|
45 |
+
languages = {}
|
46 |
+
with open("languages.ftl") as fin:
|
47 |
+
for line in fin:
|
48 |
+
lang_code, lang_name = line.strip().split(" = ")
|
49 |
+
languages[lang_code] = lang_name
|
50 |
+
|
51 |
+
return languages
|
52 |
+
|
53 |
+
|
54 |
+
def main():
|
55 |
+
language_names = get_language_names()
|
56 |
+
|
57 |
+
for version in VERSIONS:
|
58 |
+
stats_url = RELEASE_STATS_URL.format(version["release"])
|
59 |
+
release_stats = requests.get(stats_url).text
|
60 |
+
release_stats = json.loads(release_stats)
|
61 |
+
release_stats["version"] = version["semver"]
|
62 |
+
|
63 |
+
dataset_path = version["name"]
|
64 |
+
os.makedirs(dataset_path, exist_ok=True)
|
65 |
+
with open(f"{dataset_path}/release_stats.py", "w") as fout:
|
66 |
+
fout.write("STATS = " + str(release_stats))
|
67 |
+
|
68 |
+
with open(f"README.template.md", "r") as fin:
|
69 |
+
readme = fin.read()
|
70 |
+
readme = readme.replace("{{NAME}}", release_stats["name"])
|
71 |
+
|
72 |
+
locales = sorted(release_stats["locales"].keys())
|
73 |
+
languages = [f"- {loc}" for loc in locales]
|
74 |
+
readme = readme.replace("{{LANGUAGES}}", "\n".join(languages))
|
75 |
+
|
76 |
+
sizes = [f" {loc}:\n - {num_to_size(release_stats['locales'][loc]['clips'])}" for loc in locales]
|
77 |
+
readme = readme.replace("{{SIZES}}", "\n".join(sizes))
|
78 |
+
|
79 |
+
languages_human = sorted([language_names[loc] for loc in locales])
|
80 |
+
readme = readme.replace("{{LANGUAGES_HUMAN}}", ", ".join(languages_human))
|
81 |
+
|
82 |
+
readme = readme.replace("{{TOTAL_HRS}}", str(release_stats["totalHrs"]))
|
83 |
+
readme = readme.replace("{{VAL_HRS}}", str(release_stats["totalValidHrs"]))
|
84 |
+
readme = readme.replace("{{NUM_LANGS}}", str(len(locales)))
|
85 |
+
|
86 |
+
with open(f"{dataset_path}/README.md", "w") as fout:
|
87 |
+
fout.write(readme)
|
88 |
+
with open(f"{dataset_path}/languages.py", "w") as fout:
|
89 |
+
fout.write("LANGUAGES = " + str(language_names))
|
90 |
+
|
91 |
+
shutil.copy("dataset_script.py", f"{dataset_path}/{dataset_path}.py")
|
92 |
+
|
93 |
+
|
94 |
+
if __name__ == "__main__":
|
95 |
+
main()
|
96 |
+
|
languages.ftl
ADDED
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
ab = Abkhaz
|
2 |
+
ace = Acehnese
|
3 |
+
ady = Adyghe
|
4 |
+
af = Afrikaans
|
5 |
+
am = Amharic
|
6 |
+
an = Aragonese
|
7 |
+
ar = Arabic
|
8 |
+
arn = Mapudungun
|
9 |
+
as = Assamese
|
10 |
+
ast = Asturian
|
11 |
+
az = Azerbaijani
|
12 |
+
ba = Bashkir
|
13 |
+
bas = Basaa
|
14 |
+
be = Belarusian
|
15 |
+
bg = Bulgarian
|
16 |
+
bn = Bengali
|
17 |
+
br = Breton
|
18 |
+
bs = Bosnian
|
19 |
+
bxr = Buryat
|
20 |
+
ca = Catalan
|
21 |
+
cak = Kaqchikel
|
22 |
+
ckb = Central Kurdish
|
23 |
+
cnh = Hakha Chin
|
24 |
+
co = Corsican
|
25 |
+
cs = Czech
|
26 |
+
cv = Chuvash
|
27 |
+
cy = Welsh
|
28 |
+
da = Danish
|
29 |
+
de = German
|
30 |
+
dsb = Sorbian, Lower
|
31 |
+
dv = Dhivehi
|
32 |
+
el = Greek
|
33 |
+
en = English
|
34 |
+
eo = Esperanto
|
35 |
+
es = Spanish
|
36 |
+
et = Estonian
|
37 |
+
eu = Basque
|
38 |
+
fa = Persian
|
39 |
+
ff = Fulah
|
40 |
+
fi = Finnish
|
41 |
+
fo = Faroese
|
42 |
+
fr = French
|
43 |
+
fy-NL = Frisian
|
44 |
+
ga-IE = Irish
|
45 |
+
gl = Galician
|
46 |
+
gn = Guarani
|
47 |
+
gom = Goan Konkani
|
48 |
+
ha = Hausa
|
49 |
+
he = Hebrew
|
50 |
+
hi = Hindi
|
51 |
+
hr = Croatian
|
52 |
+
hsb = Sorbian, Upper
|
53 |
+
ht = Haitian
|
54 |
+
hu = Hungarian
|
55 |
+
hy-AM = Armenian
|
56 |
+
hyw = Armenian Western
|
57 |
+
ia = Interlingua
|
58 |
+
id = Indonesian
|
59 |
+
ie = Interlingue
|
60 |
+
ig = Igbo
|
61 |
+
is = Icelandic
|
62 |
+
it = Italian
|
63 |
+
izh = Izhorian
|
64 |
+
ja = Japanese
|
65 |
+
ka = Georgian
|
66 |
+
kaa = Karakalpak
|
67 |
+
kab = Kabyle
|
68 |
+
kbd = Kabardian
|
69 |
+
ki = Kikuyu
|
70 |
+
kk = Kazakh
|
71 |
+
km = Khmer
|
72 |
+
kmr = Kurmanji Kurdish
|
73 |
+
knn = Konkani (Devanagari)
|
74 |
+
ko = Korean
|
75 |
+
kpv = Komi-Zyrian
|
76 |
+
kw = Cornish
|
77 |
+
ky = Kyrgyz
|
78 |
+
lb = Luxembourgish
|
79 |
+
lg = Luganda
|
80 |
+
lij = Ligurian
|
81 |
+
lt = Lithuanian
|
82 |
+
lv = Latvian
|
83 |
+
mai = Maithili
|
84 |
+
mdf = Moksha
|
85 |
+
mg = Malagasy
|
86 |
+
mhr = Meadow Mari
|
87 |
+
mk = Macedonian
|
88 |
+
ml = Malayalam
|
89 |
+
mn = Mongolian
|
90 |
+
mni = Meetei Lon
|
91 |
+
mos = Mossi
|
92 |
+
mr = Marathi
|
93 |
+
mrj = Hill Mari
|
94 |
+
ms = Malay
|
95 |
+
mt = Maltese
|
96 |
+
my = Burmese
|
97 |
+
myv = Erzya
|
98 |
+
nan-tw = Taiwanese (Minnan)
|
99 |
+
nb-NO = Norwegian Bokmål
|
100 |
+
ne-NP = Nepali
|
101 |
+
nia = Nias
|
102 |
+
nl = Dutch
|
103 |
+
nn-NO = Norwegian Nynorsk
|
104 |
+
nyn = Runyankole
|
105 |
+
oc = Occitan
|
106 |
+
or = Odia
|
107 |
+
pa-IN = Punjabi
|
108 |
+
pap-AW = Papiamento (Aruba)
|
109 |
+
pl = Polish
|
110 |
+
ps = Pashto
|
111 |
+
pt = Portuguese
|
112 |
+
quc = K'iche'
|
113 |
+
quy = Quechua Chanka
|
114 |
+
rm-sursilv = Romansh Sursilvan
|
115 |
+
rm-vallader = Romansh Vallader
|
116 |
+
ro = Romanian
|
117 |
+
ru = Russian
|
118 |
+
rw = Kinyarwanda
|
119 |
+
sah = Sakha
|
120 |
+
sat = Santali (Ol Chiki)
|
121 |
+
sc = Sardinian
|
122 |
+
scn = Sicilian
|
123 |
+
shi = Shilha
|
124 |
+
si = Sinhala
|
125 |
+
sk = Slovak
|
126 |
+
skr = Saraiki
|
127 |
+
sl = Slovenian
|
128 |
+
so = Somali
|
129 |
+
sq = Albanian
|
130 |
+
sr = Serbian
|
131 |
+
sv-SE = Swedish
|
132 |
+
sw = Swahili
|
133 |
+
syr = Syriac
|
134 |
+
ta = Tamil
|
135 |
+
te = Telugu
|
136 |
+
tg = Tajik
|
137 |
+
th = Thai
|
138 |
+
ti = Tigrinya
|
139 |
+
tig = Tigre
|
140 |
+
tk = Turkmen
|
141 |
+
tl = Tagalog
|
142 |
+
tok = Toki Pona
|
143 |
+
tr = Turkish
|
144 |
+
tt = Tatar
|
145 |
+
tw = Twi
|
146 |
+
ty = Tahitian
|
147 |
+
uby = Ubykh
|
148 |
+
udm = Udmurt
|
149 |
+
ug = Uyghur
|
150 |
+
uk = Ukrainian
|
151 |
+
ur = Urdu
|
152 |
+
uz = Uzbek
|
153 |
+
vec = Venetian
|
154 |
+
vi = Vietnamese
|
155 |
+
vot = Votic
|
156 |
+
yi = Yiddish
|
157 |
+
yo = Yoruba
|
158 |
+
yue = Cantonese
|
159 |
+
zh-CN = Chinese (China)
|
160 |
+
zh-HK = Chinese (Hong Kong)
|
161 |
+
zh-TW = Chinese (Taiwan)
|
test.py
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from datasets import load_dataset
|
2 |
+
|
3 |
+
dataset = load_dataset("./common_voice_9_0", "et", split="test", use_auth_token=True)
|
4 |
+
print(dataset)
|
5 |
+
print(dataset[100])
|