crossroderick commited on
Commit
83daab2
·
verified ·
1 Parent(s): d646747

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ unigram.json filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ /src/data/extracted
2
+ /src/data/kkwiki-latest-pages-articles.xml.bz2
3
+ /src/data/kazakh_latin_pairs.jsonl
4
+ /src/data/clean_pairs.jsonl
5
+ /src/data/kk.txt
6
+ /logs/**
7
+ /src/test_minilm.py
8
+ /MiniDalaLM/README.md
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,148 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: paraphrase-multilingual-MiniLM-L12-v2
3
+ license: mit
4
+ language: kaz
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - kazakh
10
+ - low-resource
11
+ - cultural-nlp
12
+ - multilingual-minilm
13
+ pipeline_tag: sentence-similarity
14
+ model-index:
15
+ - name: DalaT5
16
+ results:
17
+ - task:
18
+ name: Binary Classification
19
+ type: binary-classification
20
+ dataset:
21
+ name: Kazakh Latin Corpus
22
+ type: custom
23
+ metrics:
24
+ - name: Training Loss
25
+ type: loss
26
+ value: 0.0023
27
+ - name: Cosine Accuracy (Evaluation)
28
+ type: accuracy
29
+ value: 0.9999
30
+ - name: Cosine Accuracy Threshold
31
+ type: accuracy
32
+ value: 0.9997
33
+ ---
34
+ # MiniDalaLM - Embedding Extractor for Latin Kazakh 🇰🇿
35
+
36
+ > 'Dala' means 'steppe' in Kazakh - a nod to where the voice of this model might echo.
37
+
38
+ **MiniDalaLM** is a fine-tuned version of `paraphrase-multilingual-MiniLM-L12-v2`, trained to **extract embeddings** from Kazakh text that make use of the officially adopted, [2021 alphabet reform-based](https://astanatimes.com/2021/02/kazakhstan-presents-new-latin-alphabet-plans-gradual-transition-through-2031/) Latin script. It is meant to serve as a **foundational model** to be improved upon as needed and used alongside its more powerful transliteration-based cousin, [DalaT5](https://huggingface.co/crossroderick/dalat5).
39
+
40
+ ⚠️ Limitations
41
+ - May produce unexpected outputs for very short inputs or mixed-script text
42
+ - Accuracy may vary across dialects or uncommon characters
43
+
44
+ ---
45
+
46
+ ## 🧠 Purpose
47
+
48
+ Much like DalaT5, this model wasn’t built for production-grade embedding extraction or for linguistic study alone.
49
+
50
+ It was born from something else:
51
+ - A deep **respect for Kazakh culture**
52
+ - A belief that **no language should ever be forgotten**
53
+ - A desire to **aid the country's modernisation efforts** through AI
54
+
55
+ > *I'm not Kazakh, but I believe that there is beauty in helping those that may be in need - with the sole expectation being that it may prove useful to them. So, I help and give away freely.*
56
+
57
+ ---
58
+
59
+ ## 🌍 Жоба туралы / About the Project
60
+
61
+ ### 🏕 Қазақша
62
+
63
+ **MiniDalaLM** - Қазақстанның ұлттық модернизациялау күш-жігерін қолдауға арналған, қазақша латын деректеріне дәл бапталған трансформатор. Модель ендірілгендер арқылы мәтіндік мүмкіндіктерді шығаруға бағытталған, бұл оны күшті лингвистикалық құралдардың негізі ретінде тамаша етеді.
64
+
65
+ Бұл жоба:
66
+ - AI жүйесінде **аз ұсынылған тілдерге** қолдау көрсетеді
67
+ - Қазақтың латыншаланған болашағына **ашық қолжетімділік** ұсынады
68
+ - Шетелдік – кішіпейілділікпен, ізденімпаздықпен, терең қамқорлықпен жасаған
69
+
70
+ ---
71
+
72
+ ### 🌐 English
73
+
74
+ **MiniDalaLM** is a transformer fine-tuned on Kazakh Latin data, designed to support Kazakhstan’s national modernisation efforts. The model focuses on textual feature extraction via embeddings, making it ideal as the backbone of more powerful linguistic tools.
75
+
76
+ This project:
77
+ - Supports **underrepresented languages** in AI
78
+ - Offers **open access** to the Latinised future of Kazakh
79
+ - Was created by a foreigner - with humility, curiosity, and deep care
80
+
81
+ ---
82
+
83
+ ## 💻 Байқап көріңіз / Try it out
84
+
85
+ Құшақтап тұрған бет арқылы тікелей пайдаланыңыз🤗 Sentence Transformers / Use directly via Hugging Face 🤗 Sentence Transformers:
86
+
87
+ ```python
88
+ from sentence_transformers import SentenceTransformer
89
+
90
+ model = SentenceTransformer("crossroderick/minidalalm")
91
+
92
+ sentences = [
93
+ "Vakkaritstso-Albaneze (ital. 'Vaccarizzo Albanese') — Italiiadağy kommuna, Kalabriia äkımşılık aimağyna qarasty Kozentsa provintsiiasynda ornalasqan.",
94
+ "Qalanyñ tūraqty tūrğyndarynyñ sany 1236 adamdy qūraidy (2008). Halyq tyğyzdyğy 154 adam/km². Alyp jatqan jer aumağy 8 km² şamasynda. Poşta indeksı — 87060.",
95
+ "Eldı mekennıñ qamqorşysy — Madonna di Costantinopoli.",
96
+ ]
97
+
98
+ embeddings = model.encode(sentences)
99
+
100
+ print(embeddings)
101
+ ```
102
+
103
+ ---
104
+
105
+ ## 🙏 Алғыс / Acknowledgements
106
+
107
+ Тәуелсіз жоба болғанына қарамастан, MiniDalaLM өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, MiniDalaLM makes use of three very important datasets:
108
+
109
+ - The first ~50 thousand records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
110
+ - The first ~55 thousand records of the raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
111
+ - The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
112
+
113
+ ---
114
+
115
+ ## 🤖 Нақты баптау нұсқаулары / Fine-tuning instructions
116
+
117
+ Деректер жиынының жалпы өлшемін ескере отырып, олар осы үлгінің репозиторийіне қосылмаған. Дегенмен, MiniDalaLM-ті өзіңіз дәл баптағыңыз келсе, келесі әрекеттерді орындаңыз / Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune MiniDalaLM yourself, please do the following:
118
+
119
+ 1. `get_data.sh` қабық сценарий файлын "src/data" қалтасында іске қосыңыз / Run the `get_data.sh` shell script file in the "src/data" folder
120
+ 2. Сол қалтадағы `generate_lat_pairs.py` файлын іске қосыңыз / Run the `generate_lat_pairs.py` file in the same folder
121
+ 3. Қазақ корпус файлын тазалау және деректер жинағын араластыру үшін `generate_clean_corpus.sh` іске қосыңыз / Run `generate_clean_corpus.sh` to clean the Kazakh corpus file and shuffle the dataset
122
+
123
+ KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін `huggingface-cli` орнатуыңыз қажет. Бұл туралы толығырақ [мына жерден](https://huggingface.co/docs/huggingface_hub/en/guides/cli) оқыңыз / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install `huggingface-cli` to authenticate yourself for the download to commence. Read more about it [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
124
+
125
+ Егер сіз Windows жүйесінде болсаңыз, «get_data.sh» сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, `generate_clean_corpus.sh` файлында да қате пайда болады, бұл сізге `kazakh_latin_pairs.json` файлындағы бос немесе бос жолдарды сүзу, сондай-ақ оны араластыру үшін баламалы Windows функциясын табуды талап етеді. Бұған қоса, 'wikiextractor' және `sentencetransformers` бумаларын алдын ала орнатуды ұмытпаңыз (нақты нұсқаларды 'requirements.txt' файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_pairs.json` file, as well as shuffle it. Additionally, be sure to install the `wikiextractor` and `sentencetransformers` packages beforehand (the exact versions can be found in the `requirements.txt` file).
126
+
127
+ ---
128
+
129
+ ## 📋 Өзгеріс журналы / Changelog
130
+
131
+ * **MiniDalaLM v1:** 5 мамырда жөнделді және сол күні қолжетімді болды. Қазақ морфологиясына тез бейімделіп, өзінің негізгі үлгісінің табиғатын пайдаланған бастапқы нұсқа / Fine-tuned on May 5 and made available on the same day. Initial version that benefitted from the nature of its base model, quickly adapting to Kazakh morphology
132
+
133
+
134
+ ---
135
+
136
+ ## 📚 Несиелер / Credits
137
+
138
+ Егер сіз MiniDalaLM-ті туынды жұмыстарды зерттеуде қолдансаңыз - біріншіден, рахмет. Екіншіден, егер сіз қаласаңыз, дәйексөз келтіріңіз / If you use MiniDalaLM in research of derivative works - first off, thank you. Secondly, should you be willing, feel free to cite:
139
+
140
+ ```
141
+ @misc{pereira_cruz_dalat5_2025,
142
+ author = {Rodrigo Pereira Cruz},
143
+ title = {MiniDalaLM: Feature extraction on Latin Kazakh via embeddings},
144
+ year = 2025,
145
+ url = {https://huggingface.co/crossroderick/minidalalm},
146
+ publisher = {Hugging Face}
147
+ }
148
+ ```
checkpoints/model_1/eval/binary_classification_evaluation_eval_results.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ epoch,steps,cosine_accuracy,cosine_accuracy_threshold,cosine_f1,cosine_precision,cosine_recall,cosine_f1_threshold,cosine_ap,cosine_mcc
2
+ 0.12711325791280031,1000,0.9999880826113382,0.9995862245559692,0,0,0,0,0.0,0.0
3
+ 0.25422651582560063,2000,0.9999880826113382,0.9996930360794067,0,0,0,0,0.0,0.0
4
+ 0.3813397737384009,3000,0.9999880826113382,0.9997444152832031,0,0,0,0,0.0,0.0
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 384,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 1536,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.51.2",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 250037
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.51.2",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:196c0c9ee62fe9083b00d15b46cda9f5cbfb0fdec38b1b392c1d43aaddee1ba2
3
+ size 470637416
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ datasets==3.5.0
2
+ sentence_transformers==4.1.0
3
+ torch==2.5.1
4
+ tqdm==4.67.1
5
+ wikiextractor==3.0.7
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
src/data/generate_clean_corpus.sh ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ shuf kazakh_latin_pairs.jsonl -o kazakh_latin_pairs.jsonl
2
+ grep '\S' kazakh_latin_pairs.jsonl > clean_pairs.jsonl
src/data/generate_lat_pairs.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import random
4
+ from tqdm import tqdm
5
+ from itertools import islice
6
+ from datasets import load_dataset
7
+
8
+ from typing import List
9
+
10
+
11
+ # Kazakh Cyrillic character to the Kazakh Latin character mapping from 2021 onwards
12
+ cyrillic_to_latin = {
13
+ "А": "A", "а": "a",
14
+ "Ә": "Ä", "ә": "ä",
15
+ "Б": "B", "б": "b",
16
+ "Д": "D", "д": "d",
17
+ "Е": "E", "е": "e",
18
+ "Ф": "F", "ф": "f",
19
+ "Г": "G", "г": "g",
20
+ "Ғ": "Ğ", "ғ": "ğ",
21
+ "Х": "H", "х": "h", # also Һ, see below
22
+ "Һ": "H", "һ": "h",
23
+
24
+ "И": "I", "и": "i", # used for [и], [й]
25
+ "І": "I", "і": "ı", # distinct from И in sound, both map to 'I/i'
26
+ "Ж": "J", "ж": "j",
27
+
28
+ "К": "K", "к": "k",
29
+ "Қ": "Q", "қ": "q",
30
+ "Л": "L", "л": "l",
31
+ "М": "M", "м": "m",
32
+ "Н": "N", "н": "n",
33
+ "Ң": "Ñ", "ң": "ñ",
34
+
35
+ "О": "O", "о": "o",
36
+ "Ө": "Ö", "ө": "ö",
37
+
38
+ "П": "P", "п": "p",
39
+ "Р": "R", "р": "r",
40
+ "С": "S", "с": "s",
41
+ "Ш": "Ş", "ш": "ş",
42
+ "Т": "T", "т": "t",
43
+
44
+ "У": "U", "у": "u", # basic 'u' sound, distinct from Ұ
45
+ "Ұ": "Ū", "ұ": "ū", # back rounded, used frequently
46
+ "Ү": "Ü", "ү": "ü", # front rounded
47
+
48
+ "В": "V", "в": "v",
49
+ "Ы": "Y", "ы": "y",
50
+ "Й": "I", "й": "i", # same treatment as И
51
+ "Ц": "Ts", "ц": "ts", # for Russian borrowings
52
+ "Ч": "Ch", "ч": "ch",
53
+ "Щ": "Ş", "щ": "ş", # typically simplified to 'ş'
54
+
55
+ "Э": "E", "э": "e",
56
+ "Ю": "Iu", "ю": "iu", # borrowed words only
57
+ "Я": "Ia", "я": "ia",
58
+
59
+ "Ъ": "", "ъ": "",
60
+ "Ь": "", "ь": "",
61
+
62
+ "З": "Z", "з": "z",
63
+
64
+ # Additional (not in table but used in borrowings)
65
+ "Ё": "Io", "ё": "io",
66
+ }
67
+
68
+
69
+ def convert_to_latin(text: str) -> str:
70
+ """
71
+ Simple function to apply the Cyrillic -> Latin mapping for Kazakh characters.
72
+ """
73
+ return ''.join(cyrillic_to_latin.get(char, char) for char in text)
74
+
75
+
76
+ def create_augmented_pairs(sentences: List) -> List:
77
+ """
78
+ Create Kazakh Latin pairs between original sentences and slightly changed ones.
79
+ """
80
+ pairs = []
81
+
82
+ # Randomly change sentences
83
+ for _ in range(len(sentences)):
84
+ s = random.choice(sentences)
85
+
86
+ # Create a minor variation
87
+ s_aug = s.replace(".", "").replace(",", "") # remove punctuation
88
+ s_aug = s_aug.replace("ğa", "ga").replace("ñ", "n") # light spelling variants
89
+ s_aug = s_aug.capitalize()
90
+
91
+ if s != s_aug:
92
+ pairs.append({"texts": [s, s_aug]})
93
+
94
+ return pairs
95
+
96
+
97
+ # Process all files in "extracted" dir
98
+ # Output file path
99
+ output_path = "src/data/kazakh_latin_pairs.jsonl"
100
+
101
+ # List to hold all Latin sentences
102
+ latin_sentences = []
103
+
104
+ # First step: process the Wikipedia dump
105
+ print("Processing the Wikipedia dump of Kazakh articles...")
106
+
107
+ # Iterate over all folders
108
+ for root, _, files in os.walk("src/data/extracted"):
109
+ for fname in tqdm(files, desc = "Files in Wikipedia dump"):
110
+ with open(os.path.join(root, fname), 'r', encoding = "utf-8") as f:
111
+ for line in f:
112
+ try:
113
+ data = json.loads(line)
114
+ cyr_text = data["text"].strip()
115
+ lat_text = convert_to_latin(cyr_text).strip()
116
+
117
+ if lat_text:
118
+ latin_sentences.append(lat_text)
119
+
120
+ except Exception as e:
121
+ tqdm.write(f"Skipping due to: {e}")
122
+
123
+ continue
124
+
125
+ print("Done")
126
+
127
+ # Second step: process the "CC100-Kazakh" dataset
128
+ print("Loading 'CC100-Kazakh' dataset...")
129
+
130
+ with open("src/data/kk.txt", 'r', encoding = "utf-8") as f:
131
+ for line in tqdm(islice(f, 50_000), total = 50_000, desc = "Lines in CC100-Kazakh"):
132
+ try:
133
+ cyr_text = line.strip()
134
+ lat_text = convert_to_latin(cyr_text).strip()
135
+
136
+ if lat_text:
137
+ latin_sentences.append(lat_text)
138
+
139
+ except Exception as e:
140
+ tqdm.write(f"Skipping due to: {e}")
141
+
142
+ continue
143
+
144
+ # Third step: process 15% of the raw, Kazakh-centred part of the "KazParC" dataset
145
+ print("Loading 'KazParC' dataset...")
146
+
147
+ kazparc = load_dataset("issai/kazparc", "kazparc_raw", split = "train[:15%]")
148
+
149
+ for entry in tqdm(kazparc, desc = "Entries in KazParC"):
150
+ try:
151
+ if "kk" in entry and isinstance(entry["kk"], str):
152
+ cyr_text = entry["kk"].strip()
153
+ lat_text = convert_to_latin(cyr_text).strip()
154
+
155
+ if lat_text:
156
+ latin_sentences.append(lat_text)
157
+
158
+ except Exception as e:
159
+ tqdm.write(f"Skipping due to: {e}")
160
+
161
+ continue
162
+
163
+
164
+ # Fourth and last step: create Latin sentences with variations
165
+ print("Creating Latin pairs...")
166
+
167
+ augmented_pairs = create_augmented_pairs(latin_sentences)
168
+
169
+ with open(output_path, 'w', encoding = "utf-8") as f:
170
+ for pair in tqdm(augmented_pairs, desc = "Dataset entries"):
171
+ try:
172
+ f.write(json.dumps(pair, ensure_ascii = False) + "\n")
173
+
174
+ except Exception as e:
175
+ tqdm.write(f"Skipping due to: {e}")
176
+
177
+ continue
src/data/get_data.sh ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
2
+ wget http://data.statmt.org/cc-100/kk.txt.xz
3
+ unxz kk.txt.xz
4
+ python3 -m wikiextractor.WikiExtractor kkwiki-latest-pages-articles.xml.bz2 --output extracted --json
src/train_minilm.py ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datasets import load_dataset
2
+ from torch.utils.data import DataLoader
3
+ from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
4
+
5
+
6
+ # Path config
7
+ base_model = "paraphrase-multilingual-MiniLM-L12-v2"
8
+ data_path = "src/data/clean_pairs.jsonl"
9
+
10
+ # Load the full dataset and convert to input examples
11
+ dataset = load_dataset("json", data_files = "src/data/clean_pairs.jsonl", split = "train")
12
+
13
+ # Create input examples
14
+ all_samples = [
15
+ InputExample(texts = entry["texts"])
16
+ for entry in dataset
17
+ ]
18
+
19
+ # Split into train and eval sets (75/25)
20
+ split_idx = int(len(all_samples) * 0.75)
21
+ train_samples = all_samples[:split_idx]
22
+ eval_samples = all_samples[split_idx:]
23
+
24
+ # Model and loss
25
+ model = SentenceTransformer(base_model)
26
+ train_dataloader = DataLoader(train_samples, shuffle = True, batch_size = 32)
27
+ train_loss = losses.MultipleNegativesRankingLoss(model)
28
+
29
+ # Evaluation setup
30
+ evaluator = evaluation.BinaryClassificationEvaluator.from_input_examples(eval_samples, name = "eval")
31
+
32
+ # Train with eval
33
+ model.fit(
34
+ train_objectives = [(train_dataloader, train_loss)],
35
+ epochs = 0.5,
36
+ warmup_steps = 100,
37
+ evaluator = evaluator,
38
+ evaluation_steps = 1000,
39
+ show_progress_bar = True
40
+ )
41
+
42
+ # Save final model
43
+ model.save("MiniDalaLM")
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "do_lower_case": true,
48
+ "eos_token": "</s>",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "<mask>",
51
+ "max_length": 128,
52
+ "model_max_length": 128,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "<pad>",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "</s>",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "<unk>"
65
+ }
unigram.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da145b5e7700ae40f16691ec32a0b1fdc1ee3298db22a31ea55f57a966c4a65d
3
+ size 14763260