End of training

Browse files

Files changed (6) hide show

README.md +15 -15
logs/learning_rate=0.0001, per_device_train_batch_size=8, warmup_ratio=0.5/completed.flag +0 -0
logs/learning_rate=4e-05, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724118938.5f530b1cf724 +3 -0
logs/learning_rate=4e-05, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724121341.5f530b1cf724 +3 -0
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 111.0
-- eval_frwikippl: 400.0
-- eval_zhwikippl: 122.5
-- eval_tinystoriesppl: 91.0
-- eval_loss: 0.8789
-- eval_runtime: 12.6655
-- eval_samples_per_second: 47.373
-- eval_steps_per_second: 11.843
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -48,7 +48,7 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.0001
 - train_batch_size: 8
 - eval_batch_size: 4
 - seed: 42
@@ -64,12 +64,12 @@ Peak GPU Memory: 7.9381 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
-| 0 | 0 | 828928688128.0 | 52226802319360.0 | 21.0583 | 12.4569 | 48.166 | 12.042 | 5167382528.0 | 20753281974272.0 |
-| 1500 | 0.2020 | 512.0 | 3472.0 | 1.8762 | 12.4942 | 48.022 | 12.006 | 344.0 | 868.0 |
-| 3000 | 0.4040 | 237.0 | 944.0 | 1.4192 | 12.543 | 47.835 | 11.959 | 207.0 | 223.0 |
-| 4500 | 0.6061 | 148.0 | 532.0 | 1.1068 | 12.5192 | 47.926 | 11.982 | 135.0 | 158.0 |
-| 6000 | 0.8081 | 118.0 | 430.0 | 0.9155 | 12.5398 | 47.848 | 11.962 | 98.0 | 122.0 |
-| 7425 | 1.0 | 111.0 | 400.0 | 0.8789 | 12.6655 | 47.373 | 11.843 | 91.0 | 122.5 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 173.0
+- eval_frwikippl: 624.0
+- eval_zhwikippl: 160.0
+- eval_tinystoriesppl: 145.0
+- eval_loss: 1.1443
+- eval_runtime: 12.6089
+- eval_samples_per_second: 47.585
+- eval_steps_per_second: 11.896
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 4e-05
 - train_batch_size: 8
 - eval_batch_size: 4
 - seed: 42
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
+| 0 | 0 | 1176821039104.0 | 72567767433216.0 | 20.1450 | 12.5909 | 47.653 | 11.913 | 3019898880.0 | 12713103196160.0 |
+| 1500 | 0.2020 | 864.0 | 4992.0 | 2.2099 | 12.5621 | 47.763 | 11.941 | 556.0 | 6784.0 |
+| 3000 | 0.4040 | 370.0 | 1720.0 | 1.6174 | 12.5358 | 47.863 | 11.966 | 270.0 | 286.0 |
+| 4500 | 0.6061 | 216.0 | 808.0 | 1.2965 | 12.5792 | 47.698 | 11.924 | 174.0 | 202.0 |
+| 6000 | 0.8081 | 178.0 | 676.0 | 1.1639 | 12.4818 | 48.07 | 12.017 | 149.0 | 162.0 |
+| 7425 | 1.0 | 173.0 | 624.0 | 1.1443 | 12.6089 | 47.585 | 11.896 | 145.0 | 160.0 |
 ### Framework versions
 - Distily 0.2.0

logs/learning_rate=0.0001, per_device_train_batch_size=8, warmup_ratio=0.5/completed.flag ADDED Viewed

File without changes

logs/learning_rate=4e-05, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724118938.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25c2702019ac303d4c8e04d3f730902ff9b80cef919f5bad2655fab79704cefb
+size 3512272

logs/learning_rate=4e-05, per_device_train_batch_size=8, warmup_ratio=0.5/events.out.tfevents.1724121341.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ba94d219c2ec7b29469114b62dab9aa8d589c73158b84c77d725873d04c7df36
+size 578

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:44c225267def37ca71584d3beff29b20501933b79fbecef253613f1b35f4a73d
 size 248894656

 version https://git-lfs.github.com/spec/v1
+oid sha256:5947fcd031c5672f831689839d515d6379ade26dcaea708601e75b87f9e4d701
 size 248894656

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:77bf0e293d306f9ced0e580e530e228cdb6b58e30b5f6999d1d162bfa633f029
 size 1017899144

 version https://git-lfs.github.com/spec/v1
+oid sha256:3b55a604840cd8f97ad8e84c04212bdd9ee1a4a1ab7072e317003d24156fc046
 size 1017899144