BUT-FIT
/

CSTinyLlama-1.2B

@@ -10,8 +10,66 @@ CSTinyLlama-1.2B  is a Czech language model continously pretrained on 168b train
 Training was done on [Karolina](https://www.it4i.cz/en) cluster.
 # Loss
 ## Train Cross-Entropy
 <img src="figures/tllama_train.png" width="900"/>
 ## Test Perplexity
-<img src="figures/tllama_test.png" width="900"/>

 Training was done on [Karolina](https://www.it4i.cz/en) cluster.
 # Loss
+Below we
+- (i) demonstrate the convergence speed of released model (`TINYLLAMA1.2B_cztokenizer64k_align1.7k_tllama1.1B_C2048_lr1e-04_150k`, at 160k step).
+- (ii) justify the contributions of our vocabulary swap method. We swap 1.7K tokens in this run, similarly as for our other models (see [Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k)), by comparing the swapped model with model trained from scratch (using same hyperparameters) `scratch_cztokenizer64k_tllama1.1B_C2048_lr1e-04_150k`.
 ## Train Cross-Entropy
 <img src="figures/tllama_train.png" width="900"/>
 ## Test Perplexity
+<img src="figures/tllama_test.png" width="900"/>
+## Training parameters
+Not mentioned parameters are the same as for [TinyLLama-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T).
+| **Name**                   | **Value**     | **Note**                                                                                     |
+|----------------------------|---------------|----------------------------------------------------------------------------------------------|
+| dataset_type               | Concat        | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
+| tokenizer_size             | 64k           |                                                                                              |
+| max_seq_len                | 2048          |                                                                                              |
+| batch_size                 | 512           |                                                                                              |
+| learning_rate              | 1.0e-4        |                                                                                              |
+| optimizer                  | LionW         |                                                                                              |
+| optimizer_betas            | 0.9/0.95      |                                                                                              |
+| optimizer_weight_decay     | 0             |                                                                                              |
+| gradient_clipping_max_norm | 1.0           |                                                                                              |
+| attn_impl                  | flash2        |                                                                                              |
+| fsdp                       | SHARD_GRAD_OP | (optimized for A100 40GB GPUs)                                                               |
+| precision                  | bf16          |                                                                                              |
+| scheduler                  | cosine        |                                                                                              |
+| scheduler_warmup           | 100 steps     |                                                                                              |
+| scheduler_steps            | 200,000       |                                                                                              |
+| scheduler_alpha            | 0.1           | So LR on last step is 0.1*(vanilla LR)                                                       |
+## Usage
+```python
+import torch
+import transformers
+from transformers import pipeline
+name = 'BUT-FIT/CSTinyLlama-1.2B'
+config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
+model = transformers.AutoModelForCausalLM.from_pretrained(
+    name,
+    config=config,
+    trust_remote_code=True
+)
+tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)
+pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
+with torch.autocast('cuda', dtype=torch.bfloat16):
+    print(
+        pipe('Nejznámějším českým spisovatelem ',
+             max_new_tokens=100,
+             top_p=0.95,
+             repetition_penalty=1.0,
+             do_sample=True,
+             use_cache=True))
+```