End of training

82696e3 verified about 1 year ago

4.3 kB

metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: istily_bench_gpt2_simple_objectives
    results: []

distily_bench_gpt2_simple_objectives

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 433.0859
eval_frwikippl: 2823.5620
eval_zhwikippl: 4932.8379
eval_loss: 21.1035
eval_runtime: 34.4485
eval_samples_per_second: 58.058
eval_steps_per_second: 7.257

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:kl_divergence_loss()), activations_weight=0.1, activations_loss_fn=(fn:mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:mse_loss()))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0893 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	54069.2930	57285.3438	69.6280	34.3114	58.29	7.286	54227.1016
1000	0.0404	1149.4497	6758.9292	22.9270	34.3626	58.203	7.275	55191.4258
2000	0.0808	848.3209	5094.3662	22.2020	34.3795	58.174	7.272	14284.0166
3000	0.1212	700.4797	4480.8540	21.8288	34.371	58.189	7.274	7045.9990
4000	0.1616	615.9059	3635.8176	21.5565	34.4355	58.08	7.26	3316.0488
5000	0.2020	556.0313	3492.5959	21.4455	34.3262	58.265	7.283	4788.7505
6000	0.2424	528.5394	3328.1577	21.2810	34.3681	58.193	7.274	3058.2744
7000	0.2828	479.2375	2988.6665	21.2197	34.3863	58.163	7.27	3689.9192
8000	0.3232	448.9053	2847.9541	21.0785	34.5149	57.946	7.243	1743.5521
9000	0.3636	433.0859	2823.5620	21.1035	34.4485	58.058	7.257	4932.8379
10000	0.4040	423.8369	2843.9414	21.0105	34.4298	58.089	7.261	3959.4795
11000	0.4444	394.3074	2524.8374	20.9575	34.5178	57.941	7.243	6243.0879
12000	0.4848	385.4673	2595.5920	20.9185	34.4535	58.049	7.256	17321.8613
13000	0.5253	369.9537	2477.9255	20.8475	34.4953	57.979	7.247	2443.6860
14000	0.5657	358.8618	2519.8567	20.7897	34.9016	57.304	7.163	3639.9983
15000	0.6061	343.0577	2395.4692	20.7710	34.3143	58.285	7.286	1816.2738
16000	0.6465	343.8312	2195.5515	20.7428	34.184	58.507	7.313	14709.8760
17000	0.6869	336.7496	2234.2798	20.7590	34.4691	58.023	7.253	6489.5991
18000	0.7273	338.3747	2191.5310	20.6583	34.4634	58.033	7.254	2819.0298
19000	0.7677	324.3280	2071.9238	20.6345	34.4307	58.088	7.261	3877.8486
20000	0.8081	315.1911	2056.7864	20.5710	34.2186	58.448	7.306	3151.9771
21000	0.8485	315.4604	2161.1489	20.5432	34.5086	57.957	7.245	3105.1853
22000	0.8889	324.6304	1950.2999	20.6125	34.2565	58.383	7.298	2055.8921
23000	0.9293	313.9452	1958.0153	20.5900	34.5413	57.902	7.238	4405.8896
24000	0.9697	311.3475	1918.9283	20.5405	34.2718	58.357	7.295	11800.9756
24750	1.0	303.2348	1956.3597	20.4700	34.3296	58.259	7.282	15104.0020

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0