End of training

7c05548 verified over 1 year ago

4.32 kB

metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_gpt2_simple_objectives2
    results: []

distily_bench_gpt2_simple_objectives2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 512.2950
eval_frwikippl: 3101.1487
eval_zhwikippl: 191798.5312
eval_loss: 0.1841
eval_runtime: 38.3167
eval_samples_per_second: 52.197
eval_steps_per_second: 6.525

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:jsd_loss()), activations_weight=0.2, activations_loss_fn=(fn:soft_mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:soft_mse_loss()))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 10.3934 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	55129.1953	56939.0469	0.5624	38.2552	52.28	6.535	54824.1562
1000	0.0404	1574.1282	9231.6934	0.2450	38.197	52.36	6.545	323126.3438
2000	0.0808	1109.2875	6477.1401	0.2300	38.4705	51.988	6.498	361869.5
3000	0.1212	959.5045	5362.6807	0.2196	38.2081	52.345	6.543	331961.1875
4000	0.1616	849.7711	4466.3486	0.2122	38.1788	52.385	6.548	229626.5781
5000	0.2020	715.5438	3672.9194	0.2054	38.1399	52.439	6.555	179842.4219
6000	0.2424	674.3274	3799.3496	0.1993	38.1179	52.469	6.559	222800.1875
7000	0.2828	571.2620	3382.5713	0.1930	38.2139	52.337	6.542	174077.0156
8000	0.3232	530.7191	2989.9294	0.1883	38.251	52.286	6.536	218381.6875
9000	0.3636	512.2950	3101.1487	0.1841	38.3167	52.197	6.525	191798.5312
10000	0.4040	465.7365	2662.6936	0.1801	38.3474	52.155	6.519	149415.4531
11000	0.4444	435.4128	2513.4690	0.1768	38.3141	52.2	6.525	239843.7656
12000	0.4848	418.3436	2475.8303	0.1744	38.4612	52.001	6.5	213309.0469
13000	0.5253	386.7266	2253.5813	0.1722	38.3131	52.201	6.525	170715.9219
14000	0.5657	387.9898	2286.2295	0.1699	38.4198	52.057	6.507	168226.6875
15000	0.6061	381.0330	2336.0906	0.1681	38.3633	52.133	6.517	192619.9219
16000	0.6465	358.8618	2008.6333	0.1662	38.5711	51.852	6.482	109902.5547
17000	0.6869	354.6786	1894.4617	0.1651	38.3338	52.173	6.522	185501.1719
18000	0.7273	351.6073	1982.4641	0.1639	38.3968	52.088	6.511	157025.25
19000	0.7677	349.6740	2298.5125	0.1630	38.4227	52.053	6.507	302094.75
20000	0.8081	331.0454	1852.9810	0.1615	38.3923	52.094	6.512	188850.5469
21000	0.8485	325.8680	1841.2605	0.1604	38.5044	51.942	6.493	98031.1953
22000	0.8889	325.0340	2070.4631	0.1595	38.4226	52.053	6.507	161017.3438
23000	0.9293	312.5102	1947.8259	0.1585	38.4551	52.009	6.501	129694.3594
24000	0.9697	313.1418	1909.7499	0.1579	38.2352	52.308	6.538	171997.4531
24750	1.0	319.7268	2126.0857	0.1576	38.5377	51.897	6.487	784044.3125

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0