Update README.md
Browse files
README.md
CHANGED
@@ -10,8 +10,66 @@ CSTinyLlama-1.2B is a Czech language model continously pretrained on 168b train
|
|
10 |
Training was done on [Karolina](https://www.it4i.cz/en) cluster.
|
11 |
|
12 |
# Loss
|
|
|
|
|
|
|
|
|
13 |
## Train Cross-Entropy
|
14 |
<img src="figures/tllama_train.png" width="900"/>
|
15 |
|
16 |
## Test Perplexity
|
17 |
-
<img src="figures/tllama_test.png" width="900"/>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
Training was done on [Karolina](https://www.it4i.cz/en) cluster.
|
11 |
|
12 |
# Loss
|
13 |
+
Below we
|
14 |
+
- (i) demonstrate the convergence speed of released model (`TINYLLAMA1.2B_cztokenizer64k_align1.7k_tllama1.1B_C2048_lr1e-04_150k`, at 160k step).
|
15 |
+
- (ii) justify the contributions of our vocabulary swap method. We swap 1.7K tokens in this run, similarly as for our other models (see [Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k)), by comparing the swapped model with model trained from scratch (using same hyperparameters) `scratch_cztokenizer64k_tllama1.1B_C2048_lr1e-04_150k`.
|
16 |
+
|
17 |
## Train Cross-Entropy
|
18 |
<img src="figures/tllama_train.png" width="900"/>
|
19 |
|
20 |
## Test Perplexity
|
21 |
+
<img src="figures/tllama_test.png" width="900"/>
|
22 |
+
|
23 |
+
## Training parameters
|
24 |
+
Not mentioned parameters are the same as for [TinyLLama-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T).
|
25 |
+
|
26 |
+
| **Name** | **Value** | **Note** |
|
27 |
+
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
|
28 |
+
| dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
|
29 |
+
| tokenizer_size | 64k | |
|
30 |
+
| max_seq_len | 2048 | |
|
31 |
+
| batch_size | 512 | |
|
32 |
+
| learning_rate | 1.0e-4 | |
|
33 |
+
| optimizer | LionW | |
|
34 |
+
| optimizer_betas | 0.9/0.95 | |
|
35 |
+
| optimizer_weight_decay | 0 | |
|
36 |
+
| gradient_clipping_max_norm | 1.0 | |
|
37 |
+
| attn_impl | flash2 | |
|
38 |
+
| fsdp | SHARD_GRAD_OP | (optimized for A100 40GB GPUs) |
|
39 |
+
| precision | bf16 | |
|
40 |
+
| scheduler | cosine | |
|
41 |
+
| scheduler_warmup | 100 steps | |
|
42 |
+
| scheduler_steps | 200,000 | |
|
43 |
+
| scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
|
44 |
+
|
45 |
+
|
46 |
+
## Usage
|
47 |
+
```python
|
48 |
+
import torch
|
49 |
+
import transformers
|
50 |
+
from transformers import pipeline
|
51 |
+
|
52 |
+
name = 'BUT-FIT/CSTinyLlama-1.2B'
|
53 |
+
|
54 |
+
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
|
55 |
+
model = transformers.AutoModelForCausalLM.from_pretrained(
|
56 |
+
name,
|
57 |
+
config=config,
|
58 |
+
trust_remote_code=True
|
59 |
+
)
|
60 |
+
|
61 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)
|
62 |
+
|
63 |
+
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
|
64 |
+
|
65 |
+
with torch.autocast('cuda', dtype=torch.bfloat16):
|
66 |
+
print(
|
67 |
+
pipe('Nejznámějším českým spisovatelem ',
|
68 |
+
max_new_tokens=100,
|
69 |
+
top_p=0.95,
|
70 |
+
repetition_penalty=1.0,
|
71 |
+
do_sample=True,
|
72 |
+
use_cache=True))
|
73 |
+
```
|
74 |
+
|
75 |
+
|