daekeun-ml commited on
Commit
d90dc49
·
verified ·
1 Parent(s): 59497c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -0
README.md CHANGED
@@ -24,6 +24,20 @@ The reasons for using the English corpus together are as follows:
24
 
25
  Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ### Continued pre-training
28
 
29
  The dataset used for training is as follows.
 
24
 
25
  Since my role is not as a working developer, but as an solutions architect helping customers with quick PoCs/prototypes, and I was limited by AWS GPU resources available, I only trained with 5GB of data instead of hundreds of GB of massive data.
26
 
27
+ ### Vocab Expansion
28
+
29
+ | Model Name | Vocabulary Size | Description |
30
+ | --- | --- | --- |
31
+ | Original phi-2 | 50,295 | BBPE |
32
+ | **phi-2-ko** | 66,676 | BBPE. Added Korean vocab and merges |
33
+
34
+ **Tokenizing "아마존 세이지메이커"**
35
+
36
+ | Model | # of tokens | Tokens |
37
+ | --- | --- | --- |
38
+ | Original phi-2 | 25 | `['ì', 'ķ', 'Ħ', 'ë', '§', 'Ī', 'ì', '¡', '´', 'Ġì', 'Ħ', '¸', 'ìĿ', '´', 'ì', '§', 'Ģ', 'ë', '©', 'Ķ', 'ìĿ', '´', 'ì', '»', '¤']` |
39
+ | **phi-2-ko** |6| `['ìķĦë§Ī', 'ì¡´', 'ĠìĦ¸', 'ìĿ´ì§Ģ', 'ë©ĶìĿ´', '커']` |
40
+
41
  ### Continued pre-training
42
 
43
  The dataset used for training is as follows.