Upload folder using huggingface_hub
Browse files- .gitattributes +2 -0
- .gitignore +8 -0
- 1_Pooling/config.json +10 -0
- README.md +148 -3
- checkpoints/model_1/eval/binary_classification_evaluation_eval_results.csv +4 -0
- config.json +25 -0
- config_sentence_transformers.json +10 -0
- model.safetensors +3 -0
- modules.json +14 -0
- requirements.txt +5 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +51 -0
- src/data/generate_clean_corpus.sh +2 -0
- src/data/generate_lat_pairs.py +177 -0
- src/data/get_data.sh +4 -0
- src/train_minilm.py +43 -0
- tokenizer.json +3 -0
- tokenizer_config.json +65 -0
- unigram.json +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
37 |
+
unigram.json filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
/src/data/extracted
|
2 |
+
/src/data/kkwiki-latest-pages-articles.xml.bz2
|
3 |
+
/src/data/kazakh_latin_pairs.jsonl
|
4 |
+
/src/data/clean_pairs.jsonl
|
5 |
+
/src/data/kk.txt
|
6 |
+
/logs/**
|
7 |
+
/src/test_minilm.py
|
8 |
+
/MiniDalaLM/README.md
|
1_Pooling/config.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 384,
|
3 |
+
"pooling_mode_cls_token": false,
|
4 |
+
"pooling_mode_mean_tokens": true,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
7 |
+
"pooling_mode_weightedmean_tokens": false,
|
8 |
+
"pooling_mode_lasttoken": false,
|
9 |
+
"include_prompt": true
|
10 |
+
}
|
README.md
CHANGED
@@ -1,3 +1,148 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: paraphrase-multilingual-MiniLM-L12-v2
|
3 |
+
license: mit
|
4 |
+
language: kaz
|
5 |
+
tags:
|
6 |
+
- sentence-transformers
|
7 |
+
- sentence-similarity
|
8 |
+
- feature-extraction
|
9 |
+
- kazakh
|
10 |
+
- low-resource
|
11 |
+
- cultural-nlp
|
12 |
+
- multilingual-minilm
|
13 |
+
pipeline_tag: sentence-similarity
|
14 |
+
model-index:
|
15 |
+
- name: DalaT5
|
16 |
+
results:
|
17 |
+
- task:
|
18 |
+
name: Binary Classification
|
19 |
+
type: binary-classification
|
20 |
+
dataset:
|
21 |
+
name: Kazakh Latin Corpus
|
22 |
+
type: custom
|
23 |
+
metrics:
|
24 |
+
- name: Training Loss
|
25 |
+
type: loss
|
26 |
+
value: 0.0023
|
27 |
+
- name: Cosine Accuracy (Evaluation)
|
28 |
+
type: accuracy
|
29 |
+
value: 0.9999
|
30 |
+
- name: Cosine Accuracy Threshold
|
31 |
+
type: accuracy
|
32 |
+
value: 0.9997
|
33 |
+
---
|
34 |
+
# MiniDalaLM - Embedding Extractor for Latin Kazakh 🇰🇿
|
35 |
+
|
36 |
+
> 'Dala' means 'steppe' in Kazakh - a nod to where the voice of this model might echo.
|
37 |
+
|
38 |
+
**MiniDalaLM** is a fine-tuned version of `paraphrase-multilingual-MiniLM-L12-v2`, trained to **extract embeddings** from Kazakh text that make use of the officially adopted, [2021 alphabet reform-based](https://astanatimes.com/2021/02/kazakhstan-presents-new-latin-alphabet-plans-gradual-transition-through-2031/) Latin script. It is meant to serve as a **foundational model** to be improved upon as needed and used alongside its more powerful transliteration-based cousin, [DalaT5](https://huggingface.co/crossroderick/dalat5).
|
39 |
+
|
40 |
+
⚠️ Limitations
|
41 |
+
- May produce unexpected outputs for very short inputs or mixed-script text
|
42 |
+
- Accuracy may vary across dialects or uncommon characters
|
43 |
+
|
44 |
+
---
|
45 |
+
|
46 |
+
## 🧠 Purpose
|
47 |
+
|
48 |
+
Much like DalaT5, this model wasn’t built for production-grade embedding extraction or for linguistic study alone.
|
49 |
+
|
50 |
+
It was born from something else:
|
51 |
+
- A deep **respect for Kazakh culture**
|
52 |
+
- A belief that **no language should ever be forgotten**
|
53 |
+
- A desire to **aid the country's modernisation efforts** through AI
|
54 |
+
|
55 |
+
> *I'm not Kazakh, but I believe that there is beauty in helping those that may be in need - with the sole expectation being that it may prove useful to them. So, I help and give away freely.*
|
56 |
+
|
57 |
+
---
|
58 |
+
|
59 |
+
## 🌍 Жоба туралы / About the Project
|
60 |
+
|
61 |
+
### 🏕 Қазақша
|
62 |
+
|
63 |
+
**MiniDalaLM** - Қазақстанның ұлттық модернизациялау күш-жігерін қолдауға арналған, қазақша латын деректеріне дәл бапталған трансформатор. Модель ендірілгендер арқылы мәтіндік мүмкіндіктерді шығаруға бағытталған, бұл оны күшті лингвистикалық құралдардың негізі ретінде тамаша етеді.
|
64 |
+
|
65 |
+
Бұл жоба:
|
66 |
+
- AI жүйесінде **аз ұсынылған тілдерге** қолдау көрсетеді
|
67 |
+
- Қазақтың латыншаланған болашағына **ашық қолжетімділік** ұсынады
|
68 |
+
- Шетелдік – кішіпейілділікпен, ізденімпаздықпен, терең қамқорлықпен жасаған
|
69 |
+
|
70 |
+
---
|
71 |
+
|
72 |
+
### 🌐 English
|
73 |
+
|
74 |
+
**MiniDalaLM** is a transformer fine-tuned on Kazakh Latin data, designed to support Kazakhstan’s national modernisation efforts. The model focuses on textual feature extraction via embeddings, making it ideal as the backbone of more powerful linguistic tools.
|
75 |
+
|
76 |
+
This project:
|
77 |
+
- Supports **underrepresented languages** in AI
|
78 |
+
- Offers **open access** to the Latinised future of Kazakh
|
79 |
+
- Was created by a foreigner - with humility, curiosity, and deep care
|
80 |
+
|
81 |
+
---
|
82 |
+
|
83 |
+
## 💻 Байқап көріңіз / Try it out
|
84 |
+
|
85 |
+
Құшақтап тұрған бет арқылы тікелей пайдаланыңыз🤗 Sentence Transformers / Use directly via Hugging Face 🤗 Sentence Transformers:
|
86 |
+
|
87 |
+
```python
|
88 |
+
from sentence_transformers import SentenceTransformer
|
89 |
+
|
90 |
+
model = SentenceTransformer("crossroderick/minidalalm")
|
91 |
+
|
92 |
+
sentences = [
|
93 |
+
"Vakkaritstso-Albaneze (ital. 'Vaccarizzo Albanese') — Italiiadağy kommuna, Kalabriia äkımşılık aimağyna qarasty Kozentsa provintsiiasynda ornalasqan.",
|
94 |
+
"Qalanyñ tūraqty tūrğyndarynyñ sany 1236 adamdy qūraidy (2008). Halyq tyğyzdyğy 154 adam/km². Alyp jatqan jer aumağy 8 km² şamasynda. Poşta indeksı — 87060.",
|
95 |
+
"Eldı mekennıñ qamqorşysy — Madonna di Costantinopoli.",
|
96 |
+
]
|
97 |
+
|
98 |
+
embeddings = model.encode(sentences)
|
99 |
+
|
100 |
+
print(embeddings)
|
101 |
+
```
|
102 |
+
|
103 |
+
---
|
104 |
+
|
105 |
+
## 🙏 Алғыс / Acknowledgements
|
106 |
+
|
107 |
+
Тәуелсіз жоба болғанына қарамастан, MiniDalaLM өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, MiniDalaLM makes use of three very important datasets:
|
108 |
+
|
109 |
+
- The first ~50 thousand records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
|
110 |
+
- The first ~55 thousand records of the raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
|
111 |
+
- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
|
112 |
+
|
113 |
+
---
|
114 |
+
|
115 |
+
## 🤖 Нақты баптау нұсқаулары / Fine-tuning instructions
|
116 |
+
|
117 |
+
Деректер жиынының жалпы өлшемін ескере отырып, олар осы үлгінің репозиторийіне қосылмаған. Дегенмен, MiniDalaLM-ті өзіңіз дәл баптағыңыз келсе, келесі әрекеттерді орындаңыз / Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune MiniDalaLM yourself, please do the following:
|
118 |
+
|
119 |
+
1. `get_data.sh` қабық сценарий файлын "src/data" қалтасында іске қосыңыз / Run the `get_data.sh` shell script file in the "src/data" folder
|
120 |
+
2. Сол қалтадағы `generate_lat_pairs.py` файлын іске қосыңыз / Run the `generate_lat_pairs.py` file in the same folder
|
121 |
+
3. Қазақ корпус файлын тазалау және деректер жинағын араластыру үшін `generate_clean_corpus.sh` іске қосыңыз / Run `generate_clean_corpus.sh` to clean the Kazakh corpus file and shuffle the dataset
|
122 |
+
|
123 |
+
KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін `huggingface-cli` орнатуыңыз қажет. Бұл туралы толығырақ [мына жерден](https://huggingface.co/docs/huggingface_hub/en/guides/cli) оқыңыз / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install `huggingface-cli` to authenticate yourself for the download to commence. Read more about it [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
|
124 |
+
|
125 |
+
Егер сіз Windows жүйесінде болсаңыз, «get_data.sh» сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, `generate_clean_corpus.sh` файлында да қате пайда болады, бұл сізге `kazakh_latin_pairs.json` файлындағы бос немесе бос жолдарды сүзу, сондай-ақ оны араластыру үшін баламалы Windows функциясын табуды талап етеді. Бұған қоса, 'wikiextractor' және `sentencetransformers` бумаларын алдын ала орнатуды ұмытпаңыз (нақты нұсқаларды 'requirements.txt' файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_pairs.json` file, as well as shuffle it. Additionally, be sure to install the `wikiextractor` and `sentencetransformers` packages beforehand (the exact versions can be found in the `requirements.txt` file).
|
126 |
+
|
127 |
+
---
|
128 |
+
|
129 |
+
## 📋 Өзгеріс журналы / Changelog
|
130 |
+
|
131 |
+
* **MiniDalaLM v1:** 5 мамырда жөнделді және сол күні қолжетімді болды. Қазақ морфологиясына тез бейімделіп, өзінің негізгі үлгісінің табиғатын пайдаланған бастапқы нұсқа / Fine-tuned on May 5 and made available on the same day. Initial version that benefitted from the nature of its base model, quickly adapting to Kazakh morphology
|
132 |
+
|
133 |
+
|
134 |
+
---
|
135 |
+
|
136 |
+
## 📚 Несиелер / Credits
|
137 |
+
|
138 |
+
Егер сіз MiniDalaLM-ті туынды жұмыстарды зерттеуде қолдансаңыз - біріншіден, рахмет. Екіншіден, егер сіз қаласаңыз, дәйексөз келтіріңіз / If you use MiniDalaLM in research of derivative works - first off, thank you. Secondly, should you be willing, feel free to cite:
|
139 |
+
|
140 |
+
```
|
141 |
+
@misc{pereira_cruz_dalat5_2025,
|
142 |
+
author = {Rodrigo Pereira Cruz},
|
143 |
+
title = {MiniDalaLM: Feature extraction on Latin Kazakh via embeddings},
|
144 |
+
year = 2025,
|
145 |
+
url = {https://huggingface.co/crossroderick/minidalalm},
|
146 |
+
publisher = {Hugging Face}
|
147 |
+
}
|
148 |
+
```
|
checkpoints/model_1/eval/binary_classification_evaluation_eval_results.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
epoch,steps,cosine_accuracy,cosine_accuracy_threshold,cosine_f1,cosine_precision,cosine_recall,cosine_f1_threshold,cosine_ap,cosine_mcc
|
2 |
+
0.12711325791280031,1000,0.9999880826113382,0.9995862245559692,0,0,0,0,0.0,0.0
|
3 |
+
0.25422651582560063,2000,0.9999880826113382,0.9996930360794067,0,0,0,0,0.0,0.0
|
4 |
+
0.3813397737384009,3000,0.9999880826113382,0.9997444152832031,0,0,0,0,0.0,0.0
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"BertModel"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0.1,
|
6 |
+
"classifier_dropout": null,
|
7 |
+
"gradient_checkpointing": false,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 384,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 1536,
|
13 |
+
"layer_norm_eps": 1e-12,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"model_type": "bert",
|
16 |
+
"num_attention_heads": 12,
|
17 |
+
"num_hidden_layers": 12,
|
18 |
+
"pad_token_id": 0,
|
19 |
+
"position_embedding_type": "absolute",
|
20 |
+
"torch_dtype": "float32",
|
21 |
+
"transformers_version": "4.51.2",
|
22 |
+
"type_vocab_size": 2,
|
23 |
+
"use_cache": true,
|
24 |
+
"vocab_size": 250037
|
25 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "4.1.0",
|
4 |
+
"transformers": "4.51.2",
|
5 |
+
"pytorch": "2.5.1+cu124"
|
6 |
+
},
|
7 |
+
"prompts": {},
|
8 |
+
"default_prompt_name": null,
|
9 |
+
"similarity_fn_name": "cosine"
|
10 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:196c0c9ee62fe9083b00d15b46cda9f5cbfb0fdec38b1b392c1d43aaddee1ba2
|
3 |
+
size 470637416
|
modules.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
}
|
14 |
+
]
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
datasets==3.5.0
|
2 |
+
sentence_transformers==4.1.0
|
3 |
+
torch==2.5.1
|
4 |
+
tqdm==4.67.1
|
5 |
+
wikiextractor==3.0.7
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 128,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": {
|
3 |
+
"content": "<s>",
|
4 |
+
"lstrip": false,
|
5 |
+
"normalized": false,
|
6 |
+
"rstrip": false,
|
7 |
+
"single_word": false
|
8 |
+
},
|
9 |
+
"cls_token": {
|
10 |
+
"content": "<s>",
|
11 |
+
"lstrip": false,
|
12 |
+
"normalized": false,
|
13 |
+
"rstrip": false,
|
14 |
+
"single_word": false
|
15 |
+
},
|
16 |
+
"eos_token": {
|
17 |
+
"content": "</s>",
|
18 |
+
"lstrip": false,
|
19 |
+
"normalized": false,
|
20 |
+
"rstrip": false,
|
21 |
+
"single_word": false
|
22 |
+
},
|
23 |
+
"mask_token": {
|
24 |
+
"content": "<mask>",
|
25 |
+
"lstrip": true,
|
26 |
+
"normalized": false,
|
27 |
+
"rstrip": false,
|
28 |
+
"single_word": false
|
29 |
+
},
|
30 |
+
"pad_token": {
|
31 |
+
"content": "<pad>",
|
32 |
+
"lstrip": false,
|
33 |
+
"normalized": false,
|
34 |
+
"rstrip": false,
|
35 |
+
"single_word": false
|
36 |
+
},
|
37 |
+
"sep_token": {
|
38 |
+
"content": "</s>",
|
39 |
+
"lstrip": false,
|
40 |
+
"normalized": false,
|
41 |
+
"rstrip": false,
|
42 |
+
"single_word": false
|
43 |
+
},
|
44 |
+
"unk_token": {
|
45 |
+
"content": "<unk>",
|
46 |
+
"lstrip": false,
|
47 |
+
"normalized": false,
|
48 |
+
"rstrip": false,
|
49 |
+
"single_word": false
|
50 |
+
}
|
51 |
+
}
|
src/data/generate_clean_corpus.sh
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
shuf kazakh_latin_pairs.jsonl -o kazakh_latin_pairs.jsonl
|
2 |
+
grep '\S' kazakh_latin_pairs.jsonl > clean_pairs.jsonl
|
src/data/generate_lat_pairs.py
ADDED
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import json
|
3 |
+
import random
|
4 |
+
from tqdm import tqdm
|
5 |
+
from itertools import islice
|
6 |
+
from datasets import load_dataset
|
7 |
+
|
8 |
+
from typing import List
|
9 |
+
|
10 |
+
|
11 |
+
# Kazakh Cyrillic character to the Kazakh Latin character mapping from 2021 onwards
|
12 |
+
cyrillic_to_latin = {
|
13 |
+
"А": "A", "а": "a",
|
14 |
+
"Ә": "Ä", "ә": "ä",
|
15 |
+
"Б": "B", "б": "b",
|
16 |
+
"Д": "D", "д": "d",
|
17 |
+
"Е": "E", "е": "e",
|
18 |
+
"Ф": "F", "ф": "f",
|
19 |
+
"Г": "G", "г": "g",
|
20 |
+
"Ғ": "Ğ", "ғ": "ğ",
|
21 |
+
"Х": "H", "х": "h", # also Һ, see below
|
22 |
+
"Һ": "H", "һ": "h",
|
23 |
+
|
24 |
+
"И": "I", "и": "i", # used for [и], [й]
|
25 |
+
"І": "I", "і": "ı", # distinct from И in sound, both map to 'I/i'
|
26 |
+
"Ж": "J", "ж": "j",
|
27 |
+
|
28 |
+
"К": "K", "к": "k",
|
29 |
+
"Қ": "Q", "қ": "q",
|
30 |
+
"Л": "L", "л": "l",
|
31 |
+
"М": "M", "м": "m",
|
32 |
+
"Н": "N", "н": "n",
|
33 |
+
"Ң": "Ñ", "ң": "ñ",
|
34 |
+
|
35 |
+
"О": "O", "о": "o",
|
36 |
+
"Ө": "Ö", "ө": "ö",
|
37 |
+
|
38 |
+
"П": "P", "п": "p",
|
39 |
+
"Р": "R", "р": "r",
|
40 |
+
"С": "S", "с": "s",
|
41 |
+
"Ш": "Ş", "ш": "ş",
|
42 |
+
"Т": "T", "т": "t",
|
43 |
+
|
44 |
+
"У": "U", "у": "u", # basic 'u' sound, distinct from Ұ
|
45 |
+
"Ұ": "Ū", "ұ": "ū", # back rounded, used frequently
|
46 |
+
"Ү": "Ü", "ү": "ü", # front rounded
|
47 |
+
|
48 |
+
"В": "V", "в": "v",
|
49 |
+
"Ы": "Y", "ы": "y",
|
50 |
+
"Й": "I", "й": "i", # same treatment as И
|
51 |
+
"Ц": "Ts", "ц": "ts", # for Russian borrowings
|
52 |
+
"Ч": "Ch", "ч": "ch",
|
53 |
+
"Щ": "Ş", "щ": "ş", # typically simplified to 'ş'
|
54 |
+
|
55 |
+
"Э": "E", "э": "e",
|
56 |
+
"Ю": "Iu", "ю": "iu", # borrowed words only
|
57 |
+
"Я": "Ia", "я": "ia",
|
58 |
+
|
59 |
+
"Ъ": "", "ъ": "",
|
60 |
+
"Ь": "", "ь": "",
|
61 |
+
|
62 |
+
"З": "Z", "з": "z",
|
63 |
+
|
64 |
+
# Additional (not in table but used in borrowings)
|
65 |
+
"Ё": "Io", "ё": "io",
|
66 |
+
}
|
67 |
+
|
68 |
+
|
69 |
+
def convert_to_latin(text: str) -> str:
|
70 |
+
"""
|
71 |
+
Simple function to apply the Cyrillic -> Latin mapping for Kazakh characters.
|
72 |
+
"""
|
73 |
+
return ''.join(cyrillic_to_latin.get(char, char) for char in text)
|
74 |
+
|
75 |
+
|
76 |
+
def create_augmented_pairs(sentences: List) -> List:
|
77 |
+
"""
|
78 |
+
Create Kazakh Latin pairs between original sentences and slightly changed ones.
|
79 |
+
"""
|
80 |
+
pairs = []
|
81 |
+
|
82 |
+
# Randomly change sentences
|
83 |
+
for _ in range(len(sentences)):
|
84 |
+
s = random.choice(sentences)
|
85 |
+
|
86 |
+
# Create a minor variation
|
87 |
+
s_aug = s.replace(".", "").replace(",", "") # remove punctuation
|
88 |
+
s_aug = s_aug.replace("ğa", "ga").replace("ñ", "n") # light spelling variants
|
89 |
+
s_aug = s_aug.capitalize()
|
90 |
+
|
91 |
+
if s != s_aug:
|
92 |
+
pairs.append({"texts": [s, s_aug]})
|
93 |
+
|
94 |
+
return pairs
|
95 |
+
|
96 |
+
|
97 |
+
# Process all files in "extracted" dir
|
98 |
+
# Output file path
|
99 |
+
output_path = "src/data/kazakh_latin_pairs.jsonl"
|
100 |
+
|
101 |
+
# List to hold all Latin sentences
|
102 |
+
latin_sentences = []
|
103 |
+
|
104 |
+
# First step: process the Wikipedia dump
|
105 |
+
print("Processing the Wikipedia dump of Kazakh articles...")
|
106 |
+
|
107 |
+
# Iterate over all folders
|
108 |
+
for root, _, files in os.walk("src/data/extracted"):
|
109 |
+
for fname in tqdm(files, desc = "Files in Wikipedia dump"):
|
110 |
+
with open(os.path.join(root, fname), 'r', encoding = "utf-8") as f:
|
111 |
+
for line in f:
|
112 |
+
try:
|
113 |
+
data = json.loads(line)
|
114 |
+
cyr_text = data["text"].strip()
|
115 |
+
lat_text = convert_to_latin(cyr_text).strip()
|
116 |
+
|
117 |
+
if lat_text:
|
118 |
+
latin_sentences.append(lat_text)
|
119 |
+
|
120 |
+
except Exception as e:
|
121 |
+
tqdm.write(f"Skipping due to: {e}")
|
122 |
+
|
123 |
+
continue
|
124 |
+
|
125 |
+
print("Done")
|
126 |
+
|
127 |
+
# Second step: process the "CC100-Kazakh" dataset
|
128 |
+
print("Loading 'CC100-Kazakh' dataset...")
|
129 |
+
|
130 |
+
with open("src/data/kk.txt", 'r', encoding = "utf-8") as f:
|
131 |
+
for line in tqdm(islice(f, 50_000), total = 50_000, desc = "Lines in CC100-Kazakh"):
|
132 |
+
try:
|
133 |
+
cyr_text = line.strip()
|
134 |
+
lat_text = convert_to_latin(cyr_text).strip()
|
135 |
+
|
136 |
+
if lat_text:
|
137 |
+
latin_sentences.append(lat_text)
|
138 |
+
|
139 |
+
except Exception as e:
|
140 |
+
tqdm.write(f"Skipping due to: {e}")
|
141 |
+
|
142 |
+
continue
|
143 |
+
|
144 |
+
# Third step: process 15% of the raw, Kazakh-centred part of the "KazParC" dataset
|
145 |
+
print("Loading 'KazParC' dataset...")
|
146 |
+
|
147 |
+
kazparc = load_dataset("issai/kazparc", "kazparc_raw", split = "train[:15%]")
|
148 |
+
|
149 |
+
for entry in tqdm(kazparc, desc = "Entries in KazParC"):
|
150 |
+
try:
|
151 |
+
if "kk" in entry and isinstance(entry["kk"], str):
|
152 |
+
cyr_text = entry["kk"].strip()
|
153 |
+
lat_text = convert_to_latin(cyr_text).strip()
|
154 |
+
|
155 |
+
if lat_text:
|
156 |
+
latin_sentences.append(lat_text)
|
157 |
+
|
158 |
+
except Exception as e:
|
159 |
+
tqdm.write(f"Skipping due to: {e}")
|
160 |
+
|
161 |
+
continue
|
162 |
+
|
163 |
+
|
164 |
+
# Fourth and last step: create Latin sentences with variations
|
165 |
+
print("Creating Latin pairs...")
|
166 |
+
|
167 |
+
augmented_pairs = create_augmented_pairs(latin_sentences)
|
168 |
+
|
169 |
+
with open(output_path, 'w', encoding = "utf-8") as f:
|
170 |
+
for pair in tqdm(augmented_pairs, desc = "Dataset entries"):
|
171 |
+
try:
|
172 |
+
f.write(json.dumps(pair, ensure_ascii = False) + "\n")
|
173 |
+
|
174 |
+
except Exception as e:
|
175 |
+
tqdm.write(f"Skipping due to: {e}")
|
176 |
+
|
177 |
+
continue
|
src/data/get_data.sh
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
|
2 |
+
wget http://data.statmt.org/cc-100/kk.txt.xz
|
3 |
+
unxz kk.txt.xz
|
4 |
+
python3 -m wikiextractor.WikiExtractor kkwiki-latest-pages-articles.xml.bz2 --output extracted --json
|
src/train_minilm.py
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from datasets import load_dataset
|
2 |
+
from torch.utils.data import DataLoader
|
3 |
+
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
|
4 |
+
|
5 |
+
|
6 |
+
# Path config
|
7 |
+
base_model = "paraphrase-multilingual-MiniLM-L12-v2"
|
8 |
+
data_path = "src/data/clean_pairs.jsonl"
|
9 |
+
|
10 |
+
# Load the full dataset and convert to input examples
|
11 |
+
dataset = load_dataset("json", data_files = "src/data/clean_pairs.jsonl", split = "train")
|
12 |
+
|
13 |
+
# Create input examples
|
14 |
+
all_samples = [
|
15 |
+
InputExample(texts = entry["texts"])
|
16 |
+
for entry in dataset
|
17 |
+
]
|
18 |
+
|
19 |
+
# Split into train and eval sets (75/25)
|
20 |
+
split_idx = int(len(all_samples) * 0.75)
|
21 |
+
train_samples = all_samples[:split_idx]
|
22 |
+
eval_samples = all_samples[split_idx:]
|
23 |
+
|
24 |
+
# Model and loss
|
25 |
+
model = SentenceTransformer(base_model)
|
26 |
+
train_dataloader = DataLoader(train_samples, shuffle = True, batch_size = 32)
|
27 |
+
train_loss = losses.MultipleNegativesRankingLoss(model)
|
28 |
+
|
29 |
+
# Evaluation setup
|
30 |
+
evaluator = evaluation.BinaryClassificationEvaluator.from_input_examples(eval_samples, name = "eval")
|
31 |
+
|
32 |
+
# Train with eval
|
33 |
+
model.fit(
|
34 |
+
train_objectives = [(train_dataloader, train_loss)],
|
35 |
+
epochs = 0.5,
|
36 |
+
warmup_steps = 100,
|
37 |
+
evaluator = evaluator,
|
38 |
+
evaluation_steps = 1000,
|
39 |
+
show_progress_bar = True
|
40 |
+
)
|
41 |
+
|
42 |
+
# Save final model
|
43 |
+
model.save("MiniDalaLM")
|
tokenizer.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
|
3 |
+
size 17082987
|
tokenizer_config.json
ADDED
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "<s>",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"1": {
|
12 |
+
"content": "<pad>",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"2": {
|
20 |
+
"content": "</s>",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"3": {
|
28 |
+
"content": "<unk>",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"250001": {
|
36 |
+
"content": "<mask>",
|
37 |
+
"lstrip": true,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"bos_token": "<s>",
|
45 |
+
"clean_up_tokenization_spaces": false,
|
46 |
+
"cls_token": "<s>",
|
47 |
+
"do_lower_case": true,
|
48 |
+
"eos_token": "</s>",
|
49 |
+
"extra_special_tokens": {},
|
50 |
+
"mask_token": "<mask>",
|
51 |
+
"max_length": 128,
|
52 |
+
"model_max_length": 128,
|
53 |
+
"pad_to_multiple_of": null,
|
54 |
+
"pad_token": "<pad>",
|
55 |
+
"pad_token_type_id": 0,
|
56 |
+
"padding_side": "right",
|
57 |
+
"sep_token": "</s>",
|
58 |
+
"stride": 0,
|
59 |
+
"strip_accents": null,
|
60 |
+
"tokenize_chinese_chars": true,
|
61 |
+
"tokenizer_class": "BertTokenizer",
|
62 |
+
"truncation_side": "right",
|
63 |
+
"truncation_strategy": "longest_first",
|
64 |
+
"unk_token": "<unk>"
|
65 |
+
}
|
unigram.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:da145b5e7700ae40f16691ec32a0b1fdc1ee3298db22a31ea55f57a966c4a65d
|
3 |
+
size 14763260
|