Text Generation
Transformers
Safetensors
Czech
llama
text-generation-inference
mfajcik commited on
Commit
4c6b15f
·
verified ·
1 Parent(s): caa7831

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -1
README.md CHANGED
@@ -10,8 +10,66 @@ CSTinyLlama-1.2B is a Czech language model continously pretrained on 168b train
10
  Training was done on [Karolina](https://www.it4i.cz/en) cluster.
11
 
12
  # Loss
 
 
 
 
13
  ## Train Cross-Entropy
14
  <img src="figures/tllama_train.png" width="900"/>
15
 
16
  ## Test Perplexity
17
- <img src="figures/tllama_test.png" width="900"/>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  Training was done on [Karolina](https://www.it4i.cz/en) cluster.
11
 
12
  # Loss
13
+ Below we
14
+ - (i) demonstrate the convergence speed of released model (`TINYLLAMA1.2B_cztokenizer64k_align1.7k_tllama1.1B_C2048_lr1e-04_150k`, at 160k step).
15
+ - (ii) justify the contributions of our vocabulary swap method. We swap 1.7K tokens in this run, similarly as for our other models (see [Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k)), by comparing the swapped model with model trained from scratch (using same hyperparameters) `scratch_cztokenizer64k_tllama1.1B_C2048_lr1e-04_150k`.
16
+
17
  ## Train Cross-Entropy
18
  <img src="figures/tllama_train.png" width="900"/>
19
 
20
  ## Test Perplexity
21
+ <img src="figures/tllama_test.png" width="900"/>
22
+
23
+ ## Training parameters
24
+ Not mentioned parameters are the same as for [TinyLLama-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T).
25
+
26
+ | **Name** | **Value** | **Note** |
27
+ |----------------------------|---------------|----------------------------------------------------------------------------------------------|
28
+ | dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
29
+ | tokenizer_size | 64k | |
30
+ | max_seq_len | 2048 | |
31
+ | batch_size | 512 | |
32
+ | learning_rate | 1.0e-4 | |
33
+ | optimizer | LionW | |
34
+ | optimizer_betas | 0.9/0.95 | |
35
+ | optimizer_weight_decay | 0 | |
36
+ | gradient_clipping_max_norm | 1.0 | |
37
+ | attn_impl | flash2 | |
38
+ | fsdp | SHARD_GRAD_OP | (optimized for A100 40GB GPUs) |
39
+ | precision | bf16 | |
40
+ | scheduler | cosine | |
41
+ | scheduler_warmup | 100 steps | |
42
+ | scheduler_steps | 200,000 | |
43
+ | scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
44
+
45
+
46
+ ## Usage
47
+ ```python
48
+ import torch
49
+ import transformers
50
+ from transformers import pipeline
51
+
52
+ name = 'BUT-FIT/CSTinyLlama-1.2B'
53
+
54
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
55
+ model = transformers.AutoModelForCausalLM.from_pretrained(
56
+ name,
57
+ config=config,
58
+ trust_remote_code=True
59
+ )
60
+
61
+ tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)
62
+
63
+ pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
64
+
65
+ with torch.autocast('cuda', dtype=torch.bfloat16):
66
+ print(
67
+ pipe('Nejznámějším českým spisovatelem ',
68
+ max_new_tokens=100,
69
+ top_p=0.95,
70
+ repetition_penalty=1.0,
71
+ do_sample=True,
72
+ use_cache=True))
73
+ ```
74
+
75
+