Update README.md
Browse files
README.md
CHANGED
@@ -24,6 +24,20 @@ The reasons for using the English corpus together are as follows:
|
|
24 |
|
25 |
Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
### Continued pre-training
|
28 |
|
29 |
The dataset used for training is as follows.
|
|
|
24 |
|
25 |
Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
|
26 |
|
27 |
+
### Vocab Expansion
|
28 |
+
|
29 |
+
| Model Name | Vocabulary Size | Description |
|
30 |
+
| --- | --- | --- |
|
31 |
+
| Original phi-2 | 50,295 | BBPE |
|
32 |
+
| **phi-2-ko** | 66,676 | BBPE. Added Korean vocab and merges |
|
33 |
+
|
34 |
+
**Tokenizing "아마존 세이지메이커"**
|
35 |
+
|
36 |
+
| Model | # of tokens | Tokens |
|
37 |
+
| --- | --- | --- |
|
38 |
+
| Original phi-2 | 25 | `['ì', 'ķ', 'Ħ', 'ë', '§', 'Ī', 'ì', '¡', '´', 'Ġì', 'Ħ', '¸', 'ìĿ', '´', 'ì', '§', 'Ģ', 'ë', '©', 'Ķ', 'ìĿ', '´', 'ì', '»', '¤']` |
|
39 |
+
| **phi-2-ko** |6| `['ìķĦë§Ī', 'ì¡´', 'ĠìĦ¸', 'ìĿ´ì§Ģ', 'ë©ĶìĿ´', '커']` |
|
40 |
+
|
41 |
### Continued pre-training
|
42 |
|
43 |
The dataset used for training is as follows.
|