metadata

base_model: gpt2
library_name: distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: gpt2_model_card_distily_test
    results: []

gpt2_model_card_distily_test

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 3251.3369
eval_frwikippl: 12842.3994
eval_zhwikippl: 91987.7734
eval_loss: 2288.0
eval_runtime: 0.0553
eval_samples_per_second: 18.087
eval_steps_per_second: 18.087

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_strategy: logits_activations
loss_fn: reverse_kl
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 1
eval_batch_size: 2
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 1.2452GB

Model Results

epoch	eval_enwikippl	eval_frwikippl	eval_loss	eval_runtime	eval_samples_per_second	eval_steps_per_second	eval_zhwikippl	step
0	58331.5781	58190.1172	6944.0	0.0763	13.107	13.107	54568.5117	0
0.5025	2778.4973	13039.9355	2080.0	0.0561	17.833	17.833	100748.5312	100
0.7538	2581.9565	12580.9199	2048.0	0.0551	18.153	18.153	110134.0156	150
0.2513	3251.3369	12842.3994	2288.0	0.0553	18.087	18.087	91987.7734	50

Framework versions

Distily 0.1.0
Transformers 4.43.3
Pytorch 2.3.0
Datasets 2.20.0