daekeun-ml
/

phi-2-ko-v0.1

Text Generation

text-generation-inference

Model card Files Files and versions

daekeun-ml commited on Feb 8, 2024

Commit

d90dc49

·

verified ·

1 Parent(s): 59497c1

Update README.md

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

@@ -24,6 +24,20 @@ The reasons for using the English corpus together are as follows:
 Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
 ### Continued pre-training
 The dataset used for training is as follows.

 Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
+### Vocab Expansion
+| Model Name | Vocabulary Size | Description |
+| --- | --- | --- |
+| Original phi-2 | 50,295 | BBPE |
+| **phi-2-ko** | 66,676 | BBPE. Added Korean vocab and merges |
+**Tokenizing "아마존 세이지메이커"**
+| Model | # of tokens | Tokens |
+| --- | --- | --- |
+| Original phi-2 | 25 | `['ì', 'ķ', 'Ħ', 'ë', '§', 'Ī', 'ì', '¡', '´', 'Ġì', 'Ħ', '¸', 'ìĿ', '´', 'ì', '§', 'Ģ', 'ë', '©', 'Ķ', 'ìĿ', '´', 'ì', '»', '¤']` |
+| **phi-2-ko** |6| `['ìķĦë§Ī', 'ì¡´', 'ĠìĦ¸', 'ìĿ´ì§Ģ', 'ë©ĶìĿ´', 'ì»¤']` |
 ### Continued pre-training
 The dataset used for training is as follows.