End of training

ad1087f verified over 1 year ago

3.17 kB

metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.12b_gpt2
    results: []

distily_bench_obj_cross_v2.12b_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 249.0
eval_frwikippl: 600.0
eval_zhwikippl: 186.0
eval_tinystoriesppl: 220.0
eval_loss: 0.9819
eval_runtime: 12.7319
eval_samples_per_second: 47.126
eval_steps_per_second: 11.781

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 4.1856 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	837518622720.0	78065325572096.0	19.8108	12.6525	47.421	11.855	2667577344.0	36009005809664.0
1500	0.1010	1472.0	8832.0	2.5979	12.6262	47.52	11.88	1056.0	19200.0
3000	0.2020	500.0	3040.0	1.8976	12.7775	46.958	11.739	354.0	552.0
4500	0.3030	312.0	1320.0	1.5456	12.7017	47.238	11.809	249.0	260.0
6000	0.4040	234.0	940.0	1.3441	12.5854	47.674	11.919	204.0	158.0
7500	0.5051	190.0	656.0	1.1277	12.5936	47.643	11.911	164.0	152.0
9000	0.6061	249.0	600.0	0.9819	12.7319	47.126	11.781	220.0	186.0
10500	0.7071	141.0	436.0	0.8717	12.5874	47.667	11.917	121.0	128.0
12000	0.8081	193.0	482.0	0.8292	12.6439	47.454	11.863	163.0	135.0
13500	0.9091	202.0	504.0	0.8078	12.5913	47.652	11.913	176.0	136.0
14850	1.0	196.0	490.0	0.8045	12.677	47.33	11.832	170.0	135.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0