taicheng commited on
Commit
fa14ad0
·
verified ·
1 Parent(s): f60eb5f

Model save

Browse files
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: alignment-handbook/zephyr-7b-sft-full
5
+ tags:
6
+ - trl
7
+ - dpo
8
+ - generated_from_trainer
9
+ model-index:
10
+ - name: zephyr-7b-align-scan-7e-07-0.45-cosine-3.0
11
+ results: []
12
+ ---
13
+
14
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
+ should probably proofread and complete it, then remove this comment. -->
16
+
17
+ # zephyr-7b-align-scan-7e-07-0.45-cosine-3.0
18
+
19
+ This model is a fine-tuned version of [alignment-handbook/zephyr-7b-sft-full](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full) on an unknown dataset.
20
+ It achieves the following results on the evaluation set:
21
+ - Loss: 0.9344
22
+ - Rewards/chosen: -0.3373
23
+ - Rewards/rejected: -2.0235
24
+ - Rewards/accuracies: 0.3452
25
+ - Rewards/margins: 1.6862
26
+ - Logps/rejected: -85.6249
27
+ - Logps/chosen: -75.2407
28
+ - Logits/rejected: -2.6727
29
+ - Logits/chosen: -2.6886
30
+
31
+ ## Model description
32
+
33
+ More information needed
34
+
35
+ ## Intended uses & limitations
36
+
37
+ More information needed
38
+
39
+ ## Training and evaluation data
40
+
41
+ More information needed
42
+
43
+ ## Training procedure
44
+
45
+ ### Training hyperparameters
46
+
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 7e-07
49
+ - train_batch_size: 8
50
+ - eval_batch_size: 8
51
+ - seed: 42
52
+ - distributed_type: multi-GPU
53
+ - num_devices: 4
54
+ - gradient_accumulation_steps: 2
55
+ - total_train_batch_size: 64
56
+ - total_eval_batch_size: 32
57
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
58
+ - lr_scheduler_type: cosine
59
+ - lr_scheduler_warmup_ratio: 0.1
60
+ - num_epochs: 3
61
+
62
+ ### Training results
63
+
64
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
65
+ |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
66
+ | 0.6948 | 0.3484 | 100 | 0.6997 | 0.9816 | 0.4942 | 0.3452 | 0.4873 | -80.0301 | -72.3100 | -2.5413 | -2.5577 |
67
+ | 0.7373 | 0.6969 | 200 | 0.7720 | 1.2732 | 0.5117 | 0.3294 | 0.7615 | -79.9912 | -71.6619 | -2.5716 | -2.5870 |
68
+ | 0.4002 | 1.0453 | 300 | 0.8163 | 0.4524 | -0.4497 | 0.3472 | 0.9021 | -82.1276 | -73.4859 | -2.6256 | -2.6409 |
69
+ | 0.3982 | 1.3937 | 400 | 0.8872 | 1.2165 | 0.0680 | 0.3313 | 1.1485 | -80.9772 | -71.7879 | -2.7106 | -2.7265 |
70
+ | 0.389 | 1.7422 | 500 | 0.9107 | 0.3181 | -0.9594 | 0.3353 | 1.2775 | -83.2604 | -73.7844 | -2.7188 | -2.7346 |
71
+ | 0.3707 | 2.0906 | 600 | 0.8992 | 0.6908 | -0.7854 | 0.3472 | 1.4762 | -82.8736 | -72.9561 | -2.6904 | -2.7065 |
72
+ | 0.3672 | 2.4390 | 700 | 0.9354 | -0.5110 | -2.2396 | 0.3492 | 1.7285 | -86.1051 | -75.6269 | -2.6662 | -2.6823 |
73
+ | 0.3596 | 2.7875 | 800 | 0.9344 | -0.3373 | -2.0235 | 0.3452 | 1.6862 | -85.6249 | -75.2407 | -2.6727 | -2.6886 |
74
+
75
+
76
+ ### Framework versions
77
+
78
+ - Transformers 4.44.2
79
+ - Pytorch 2.4.0
80
+ - Datasets 2.21.0
81
+ - Tokenizers 0.19.1
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.49774221859604084,
5
+ "train_runtime": 9884.8529,
6
+ "train_samples": 18340,
7
+ "train_samples_per_second": 5.566,
8
+ "train_steps_per_second": 0.087
9
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.44.2"
6
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.49774221859604084,
5
+ "train_runtime": 9884.8529,
6
+ "train_samples": 18340,
7
+ "train_samples_per_second": 5.566,
8
+ "train_steps_per_second": 0.087
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1475 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 3.0,
5
+ "eval_steps": 100,
6
+ "global_step": 861,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.003484320557491289,
13
+ "grad_norm": 260.1595222158553,
14
+ "learning_rate": 8.045977011494253e-09,
15
+ "logits/chosen": -2.5345611572265625,
16
+ "logits/rejected": -2.581700563430786,
17
+ "logps/chosen": -60.002105712890625,
18
+ "logps/rejected": -99.98374938964844,
19
+ "loss": 0.6931,
20
+ "rewards/accuracies": 0.0,
21
+ "rewards/chosen": 0.0,
22
+ "rewards/margins": 0.0,
23
+ "rewards/rejected": 0.0,
24
+ "step": 1
25
+ },
26
+ {
27
+ "epoch": 0.03484320557491289,
28
+ "grad_norm": 236.39725577198064,
29
+ "learning_rate": 8.045977011494252e-08,
30
+ "logits/chosen": -2.5637922286987305,
31
+ "logits/rejected": -2.5625600814819336,
32
+ "logps/chosen": -59.67917251586914,
33
+ "logps/rejected": -73.37747955322266,
34
+ "loss": 0.6959,
35
+ "rewards/accuracies": 0.1875,
36
+ "rewards/chosen": -0.008780181407928467,
37
+ "rewards/margins": -0.010324436239898205,
38
+ "rewards/rejected": 0.0015442547155544162,
39
+ "step": 10
40
+ },
41
+ {
42
+ "epoch": 0.06968641114982578,
43
+ "grad_norm": 308.42103352579716,
44
+ "learning_rate": 1.6091954022988505e-07,
45
+ "logits/chosen": -2.6053006649017334,
46
+ "logits/rejected": -2.5642189979553223,
47
+ "logps/chosen": -104.08708190917969,
48
+ "logps/rejected": -94.90479278564453,
49
+ "loss": 0.6931,
50
+ "rewards/accuracies": 0.34375,
51
+ "rewards/chosen": 0.017988674342632294,
52
+ "rewards/margins": 0.022790271788835526,
53
+ "rewards/rejected": -0.004801597446203232,
54
+ "step": 20
55
+ },
56
+ {
57
+ "epoch": 0.10452961672473868,
58
+ "grad_norm": 300.68038725792655,
59
+ "learning_rate": 2.413793103448276e-07,
60
+ "logits/chosen": -2.5924947261810303,
61
+ "logits/rejected": -2.5725739002227783,
62
+ "logps/chosen": -82.33592987060547,
63
+ "logps/rejected": -91.51930236816406,
64
+ "loss": 0.6759,
65
+ "rewards/accuracies": 0.3187499940395355,
66
+ "rewards/chosen": 0.0722854882478714,
67
+ "rewards/margins": 0.06289331614971161,
68
+ "rewards/rejected": 0.009392165578901768,
69
+ "step": 30
70
+ },
71
+ {
72
+ "epoch": 0.13937282229965156,
73
+ "grad_norm": 238.79409703480542,
74
+ "learning_rate": 3.218390804597701e-07,
75
+ "logits/chosen": -2.4986560344696045,
76
+ "logits/rejected": -2.49660062789917,
77
+ "logps/chosen": -77.81121826171875,
78
+ "logps/rejected": -73.2593002319336,
79
+ "loss": 0.654,
80
+ "rewards/accuracies": 0.28125,
81
+ "rewards/chosen": 0.04823530465364456,
82
+ "rewards/margins": 0.16432270407676697,
83
+ "rewards/rejected": -0.1160874143242836,
84
+ "step": 40
85
+ },
86
+ {
87
+ "epoch": 0.17421602787456447,
88
+ "grad_norm": 217.0577300579145,
89
+ "learning_rate": 4.022988505747126e-07,
90
+ "logits/chosen": -2.5254292488098145,
91
+ "logits/rejected": -2.52939772605896,
92
+ "logps/chosen": -63.20558547973633,
93
+ "logps/rejected": -75.68934631347656,
94
+ "loss": 0.6651,
95
+ "rewards/accuracies": 0.26249998807907104,
96
+ "rewards/chosen": 0.33010634779930115,
97
+ "rewards/margins": 0.14375844597816467,
98
+ "rewards/rejected": 0.18634793162345886,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 0.20905923344947736,
103
+ "grad_norm": 222.80981857126162,
104
+ "learning_rate": 4.827586206896552e-07,
105
+ "logits/chosen": -2.4844985008239746,
106
+ "logits/rejected": -2.478571653366089,
107
+ "logps/chosen": -70.86429595947266,
108
+ "logps/rejected": -66.64778137207031,
109
+ "loss": 0.6498,
110
+ "rewards/accuracies": 0.3375000059604645,
111
+ "rewards/chosen": 1.0539968013763428,
112
+ "rewards/margins": 0.2616458535194397,
113
+ "rewards/rejected": 0.7923508882522583,
114
+ "step": 60
115
+ },
116
+ {
117
+ "epoch": 0.24390243902439024,
118
+ "grad_norm": 243.92358905965492,
119
+ "learning_rate": 5.632183908045977e-07,
120
+ "logits/chosen": -2.4932572841644287,
121
+ "logits/rejected": -2.4880998134613037,
122
+ "logps/chosen": -61.057044982910156,
123
+ "logps/rejected": -66.05777740478516,
124
+ "loss": 0.6558,
125
+ "rewards/accuracies": 0.32499998807907104,
126
+ "rewards/chosen": 1.4306201934814453,
127
+ "rewards/margins": 0.36464259028434753,
128
+ "rewards/rejected": 1.0659778118133545,
129
+ "step": 70
130
+ },
131
+ {
132
+ "epoch": 0.2787456445993031,
133
+ "grad_norm": 250.0266197573928,
134
+ "learning_rate": 6.436781609195402e-07,
135
+ "logits/chosen": -2.4400105476379395,
136
+ "logits/rejected": -2.430251121520996,
137
+ "logps/chosen": -72.05781555175781,
138
+ "logps/rejected": -74.76571655273438,
139
+ "loss": 0.6758,
140
+ "rewards/accuracies": 0.32499998807907104,
141
+ "rewards/chosen": 1.623175024986267,
142
+ "rewards/margins": 0.38602036237716675,
143
+ "rewards/rejected": 1.2371546030044556,
144
+ "step": 80
145
+ },
146
+ {
147
+ "epoch": 0.313588850174216,
148
+ "grad_norm": 256.3370381835667,
149
+ "learning_rate": 6.999740526496426e-07,
150
+ "logits/chosen": -2.489266872406006,
151
+ "logits/rejected": -2.503830909729004,
152
+ "logps/chosen": -62.3740234375,
153
+ "logps/rejected": -67.14833068847656,
154
+ "loss": 0.7197,
155
+ "rewards/accuracies": 0.28125,
156
+ "rewards/chosen": 1.6237154006958008,
157
+ "rewards/margins": 0.28914302587509155,
158
+ "rewards/rejected": 1.334572196006775,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 0.34843205574912894,
163
+ "grad_norm": 260.27523546187444,
164
+ "learning_rate": 6.995128734390302e-07,
165
+ "logits/chosen": -2.478929042816162,
166
+ "logits/rejected": -2.4793598651885986,
167
+ "logps/chosen": -71.95938873291016,
168
+ "logps/rejected": -79.0130615234375,
169
+ "loss": 0.6948,
170
+ "rewards/accuracies": 0.32499998807907104,
171
+ "rewards/chosen": 1.4903005361557007,
172
+ "rewards/margins": 0.5729378461837769,
173
+ "rewards/rejected": 0.9173626899719238,
174
+ "step": 100
175
+ },
176
+ {
177
+ "epoch": 0.34843205574912894,
178
+ "eval_logits/chosen": -2.5576956272125244,
179
+ "eval_logits/rejected": -2.541304588317871,
180
+ "eval_logps/chosen": -72.30996704101562,
181
+ "eval_logps/rejected": -80.03009033203125,
182
+ "eval_loss": 0.6997446417808533,
183
+ "eval_rewards/accuracies": 0.3452380895614624,
184
+ "eval_rewards/chosen": 0.9815704226493835,
185
+ "eval_rewards/margins": 0.48734748363494873,
186
+ "eval_rewards/rejected": 0.4942229390144348,
187
+ "eval_runtime": 114.1155,
188
+ "eval_samples_per_second": 17.526,
189
+ "eval_steps_per_second": 0.552,
190
+ "step": 100
191
+ },
192
+ {
193
+ "epoch": 0.3832752613240418,
194
+ "grad_norm": 269.9879110557674,
195
+ "learning_rate": 6.984759608935431e-07,
196
+ "logits/chosen": -2.495579957962036,
197
+ "logits/rejected": -2.460557460784912,
198
+ "logps/chosen": -72.69947814941406,
199
+ "logps/rejected": -63.181190490722656,
200
+ "loss": 0.7231,
201
+ "rewards/accuracies": 0.26875001192092896,
202
+ "rewards/chosen": 0.21174518764019012,
203
+ "rewards/margins": 0.2885180115699768,
204
+ "rewards/rejected": -0.0767727941274643,
205
+ "step": 110
206
+ },
207
+ {
208
+ "epoch": 0.4181184668989547,
209
+ "grad_norm": 215.87968567956005,
210
+ "learning_rate": 6.96865023062192e-07,
211
+ "logits/chosen": -2.5388295650482178,
212
+ "logits/rejected": -2.5087077617645264,
213
+ "logps/chosen": -77.22286224365234,
214
+ "logps/rejected": -67.7669448852539,
215
+ "loss": 0.6929,
216
+ "rewards/accuracies": 0.3125,
217
+ "rewards/chosen": 0.10741810500621796,
218
+ "rewards/margins": 0.6003297567367554,
219
+ "rewards/rejected": -0.4929116666316986,
220
+ "step": 120
221
+ },
222
+ {
223
+ "epoch": 0.4529616724738676,
224
+ "grad_norm": 364.597210128526,
225
+ "learning_rate": 6.946827135542728e-07,
226
+ "logits/chosen": -2.575723171234131,
227
+ "logits/rejected": -2.5577731132507324,
228
+ "logps/chosen": -83.23577117919922,
229
+ "logps/rejected": -88.67012023925781,
230
+ "loss": 0.7704,
231
+ "rewards/accuracies": 0.34375,
232
+ "rewards/chosen": 0.2902565896511078,
233
+ "rewards/margins": 0.8129722476005554,
234
+ "rewards/rejected": -0.52271568775177,
235
+ "step": 130
236
+ },
237
+ {
238
+ "epoch": 0.4878048780487805,
239
+ "grad_norm": 192.41269674377438,
240
+ "learning_rate": 6.919326271682209e-07,
241
+ "logits/chosen": -2.478255271911621,
242
+ "logits/rejected": -2.4691178798675537,
243
+ "logps/chosen": -78.7398681640625,
244
+ "logps/rejected": -70.11216735839844,
245
+ "loss": 0.6961,
246
+ "rewards/accuracies": 0.3499999940395355,
247
+ "rewards/chosen": 1.082502007484436,
248
+ "rewards/margins": 0.7890673875808716,
249
+ "rewards/rejected": 0.29343467950820923,
250
+ "step": 140
251
+ },
252
+ {
253
+ "epoch": 0.5226480836236934,
254
+ "grad_norm": 266.7719593983313,
255
+ "learning_rate": 6.886192939700987e-07,
256
+ "logits/chosen": -2.5632805824279785,
257
+ "logits/rejected": -2.5232510566711426,
258
+ "logps/chosen": -76.60926818847656,
259
+ "logps/rejected": -78.16521453857422,
260
+ "loss": 0.7487,
261
+ "rewards/accuracies": 0.2874999940395355,
262
+ "rewards/chosen": 1.2937796115875244,
263
+ "rewards/margins": 0.6702733635902405,
264
+ "rewards/rejected": 0.6235060691833496,
265
+ "step": 150
266
+ },
267
+ {
268
+ "epoch": 0.5574912891986062,
269
+ "grad_norm": 236.73335221722851,
270
+ "learning_rate": 6.84748171831466e-07,
271
+ "logits/chosen": -2.5542993545532227,
272
+ "logits/rejected": -2.5730948448181152,
273
+ "logps/chosen": -61.49628829956055,
274
+ "logps/rejected": -69.87091064453125,
275
+ "loss": 0.7401,
276
+ "rewards/accuracies": 0.2750000059604645,
277
+ "rewards/chosen": 1.3628534078598022,
278
+ "rewards/margins": 0.4169517159461975,
279
+ "rewards/rejected": 0.9459015727043152,
280
+ "step": 160
281
+ },
282
+ {
283
+ "epoch": 0.5923344947735192,
284
+ "grad_norm": 280.18013018164186,
285
+ "learning_rate": 6.803256374389282e-07,
286
+ "logits/chosen": -2.571448564529419,
287
+ "logits/rejected": -2.558177947998047,
288
+ "logps/chosen": -65.29753112792969,
289
+ "logps/rejected": -73.9312744140625,
290
+ "loss": 0.7499,
291
+ "rewards/accuracies": 0.28125,
292
+ "rewards/chosen": 1.65143620967865,
293
+ "rewards/margins": 0.579193651676178,
294
+ "rewards/rejected": 1.0722427368164062,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 0.627177700348432,
299
+ "grad_norm": 257.9946756555396,
300
+ "learning_rate": 6.753589757901721e-07,
301
+ "logits/chosen": -2.601243495941162,
302
+ "logits/rejected": -2.5908031463623047,
303
+ "logps/chosen": -87.05482482910156,
304
+ "logps/rejected": -82.89573669433594,
305
+ "loss": 0.7906,
306
+ "rewards/accuracies": 0.3375000059604645,
307
+ "rewards/chosen": 2.16105318069458,
308
+ "rewards/margins": 0.5344269871711731,
309
+ "rewards/rejected": 1.6266262531280518,
310
+ "step": 180
311
+ },
312
+ {
313
+ "epoch": 0.662020905923345,
314
+ "grad_norm": 200.6147691015439,
315
+ "learning_rate": 6.69856368193789e-07,
316
+ "logits/chosen": -2.6006789207458496,
317
+ "logits/rejected": -2.5925745964050293,
318
+ "logps/chosen": -67.43627166748047,
319
+ "logps/rejected": -78.66804504394531,
320
+ "loss": 0.7368,
321
+ "rewards/accuracies": 0.28125,
322
+ "rewards/chosen": 1.626844048500061,
323
+ "rewards/margins": 0.3397817015647888,
324
+ "rewards/rejected": 1.2870622873306274,
325
+ "step": 190
326
+ },
327
+ {
328
+ "epoch": 0.6968641114982579,
329
+ "grad_norm": 346.9455585144427,
330
+ "learning_rate": 6.63826878792655e-07,
331
+ "logits/chosen": -2.6272549629211426,
332
+ "logits/rejected": -2.633775234222412,
333
+ "logps/chosen": -86.80072021484375,
334
+ "logps/rejected": -90.06497192382812,
335
+ "loss": 0.7373,
336
+ "rewards/accuracies": 0.3499999940395355,
337
+ "rewards/chosen": 1.7350423336029053,
338
+ "rewards/margins": 0.8414648175239563,
339
+ "rewards/rejected": 0.8935775756835938,
340
+ "step": 200
341
+ },
342
+ {
343
+ "epoch": 0.6968641114982579,
344
+ "eval_logits/chosen": -2.5869662761688232,
345
+ "eval_logits/rejected": -2.571585178375244,
346
+ "eval_logps/chosen": -71.66189575195312,
347
+ "eval_logps/rejected": -79.99118041992188,
348
+ "eval_loss": 0.7720171213150024,
349
+ "eval_rewards/accuracies": 0.329365074634552,
350
+ "eval_rewards/chosen": 1.2732017040252686,
351
+ "eval_rewards/margins": 0.7614731192588806,
352
+ "eval_rewards/rejected": 0.5117285847663879,
353
+ "eval_runtime": 113.8896,
354
+ "eval_samples_per_second": 17.561,
355
+ "eval_steps_per_second": 0.553,
356
+ "step": 200
357
+ },
358
+ {
359
+ "epoch": 0.7317073170731707,
360
+ "grad_norm": 371.1201689930978,
361
+ "learning_rate": 6.572804396330676e-07,
362
+ "logits/chosen": -2.590230941772461,
363
+ "logits/rejected": -2.5670275688171387,
364
+ "logps/chosen": -66.80162811279297,
365
+ "logps/rejected": -63.04413604736328,
366
+ "loss": 0.7579,
367
+ "rewards/accuracies": 0.3499999940395355,
368
+ "rewards/chosen": 1.132849931716919,
369
+ "rewards/margins": 0.9707193374633789,
370
+ "rewards/rejected": 0.16213065385818481,
371
+ "step": 210
372
+ },
373
+ {
374
+ "epoch": 0.7665505226480837,
375
+ "grad_norm": 233.2613327041566,
376
+ "learning_rate": 6.502278343042315e-07,
377
+ "logits/chosen": -2.6396641731262207,
378
+ "logits/rejected": -2.6218152046203613,
379
+ "logps/chosen": -70.56974029541016,
380
+ "logps/rejected": -69.75597381591797,
381
+ "loss": 0.7735,
382
+ "rewards/accuracies": 0.24375000596046448,
383
+ "rewards/chosen": 1.2997958660125732,
384
+ "rewards/margins": 0.4415515959262848,
385
+ "rewards/rejected": 0.8582441210746765,
386
+ "step": 220
387
+ },
388
+ {
389
+ "epoch": 0.8013937282229965,
390
+ "grad_norm": 326.1156990352116,
391
+ "learning_rate": 6.42680680175045e-07,
392
+ "logits/chosen": -2.6760098934173584,
393
+ "logits/rejected": -2.65986967086792,
394
+ "logps/chosen": -86.16526794433594,
395
+ "logps/rejected": -87.11499786376953,
396
+ "loss": 0.8168,
397
+ "rewards/accuracies": 0.3687500059604645,
398
+ "rewards/chosen": 1.7259118556976318,
399
+ "rewards/margins": 1.2606786489486694,
400
+ "rewards/rejected": 0.4652332663536072,
401
+ "step": 230
402
+ },
403
+ {
404
+ "epoch": 0.8362369337979094,
405
+ "grad_norm": 267.49129367815516,
406
+ "learning_rate": 6.346514092574479e-07,
407
+ "logits/chosen": -2.6791954040527344,
408
+ "logits/rejected": -2.6473562717437744,
409
+ "logps/chosen": -81.90107727050781,
410
+ "logps/rejected": -76.54811096191406,
411
+ "loss": 0.8387,
412
+ "rewards/accuracies": 0.3375000059604645,
413
+ "rewards/chosen": 2.1782379150390625,
414
+ "rewards/margins": 0.5636407136917114,
415
+ "rewards/rejected": 1.6145970821380615,
416
+ "step": 240
417
+ },
418
+ {
419
+ "epoch": 0.8710801393728222,
420
+ "grad_norm": 359.51061878716314,
421
+ "learning_rate": 6.26153247727851e-07,
422
+ "logits/chosen": -2.676499843597412,
423
+ "logits/rejected": -2.6430983543395996,
424
+ "logps/chosen": -90.66349792480469,
425
+ "logps/rejected": -86.98454284667969,
426
+ "loss": 0.6777,
427
+ "rewards/accuracies": 0.39375001192092896,
428
+ "rewards/chosen": 2.1536006927490234,
429
+ "rewards/margins": 0.8208580017089844,
430
+ "rewards/rejected": 1.3327425718307495,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 0.9059233449477352,
435
+ "grad_norm": 219.1541113628816,
436
+ "learning_rate": 6.172001941403836e-07,
437
+ "logits/chosen": -2.585671901702881,
438
+ "logits/rejected": -2.599630832672119,
439
+ "logps/chosen": -56.668235778808594,
440
+ "logps/rejected": -64.54667663574219,
441
+ "loss": 0.8203,
442
+ "rewards/accuracies": 0.2874999940395355,
443
+ "rewards/chosen": 1.0715070962905884,
444
+ "rewards/margins": 0.584784209728241,
445
+ "rewards/rejected": 0.4867229461669922,
446
+ "step": 260
447
+ },
448
+ {
449
+ "epoch": 0.9407665505226481,
450
+ "grad_norm": 447.78049155610177,
451
+ "learning_rate": 6.078069963678453e-07,
452
+ "logits/chosen": -2.667015314102173,
453
+ "logits/rejected": -2.6670918464660645,
454
+ "logps/chosen": -67.22764587402344,
455
+ "logps/rejected": -82.75209045410156,
456
+ "loss": 0.7889,
457
+ "rewards/accuracies": 0.33125001192092896,
458
+ "rewards/chosen": 0.7436366081237793,
459
+ "rewards/margins": 0.842780590057373,
460
+ "rewards/rejected": -0.09914406388998032,
461
+ "step": 270
462
+ },
463
+ {
464
+ "epoch": 0.975609756097561,
465
+ "grad_norm": 255.49301673895127,
466
+ "learning_rate": 5.979891273083455e-07,
467
+ "logits/chosen": -2.560213804244995,
468
+ "logits/rejected": -2.5412333011627197,
469
+ "logps/chosen": -65.57135009765625,
470
+ "logps/rejected": -70.57018280029297,
471
+ "loss": 0.6995,
472
+ "rewards/accuracies": 0.32499998807907104,
473
+ "rewards/chosen": 1.0183632373809814,
474
+ "rewards/margins": 0.8994625806808472,
475
+ "rewards/rejected": 0.11890069395303726,
476
+ "step": 280
477
+ },
478
+ {
479
+ "epoch": 1.0104529616724738,
480
+ "grad_norm": 62.440035652192904,
481
+ "learning_rate": 5.8776275939765e-07,
482
+ "logits/chosen": -2.583773612976074,
483
+ "logits/rejected": -2.558880567550659,
484
+ "logps/chosen": -67.36951446533203,
485
+ "logps/rejected": -66.36976623535156,
486
+ "loss": 0.6156,
487
+ "rewards/accuracies": 0.375,
488
+ "rewards/chosen": 2.1141774654388428,
489
+ "rewards/margins": 2.7693214416503906,
490
+ "rewards/rejected": -0.6551438570022583,
491
+ "step": 290
492
+ },
493
+ {
494
+ "epoch": 1.0452961672473868,
495
+ "grad_norm": 22.68193419500007,
496
+ "learning_rate": 5.771447379692167e-07,
497
+ "logits/chosen": -2.6168670654296875,
498
+ "logits/rejected": -2.603750228881836,
499
+ "logps/chosen": -58.1750373840332,
500
+ "logps/rejected": -80.46726989746094,
501
+ "loss": 0.4002,
502
+ "rewards/accuracies": 0.42500001192092896,
503
+ "rewards/chosen": 3.7682430744171143,
504
+ "rewards/margins": 8.851958274841309,
505
+ "rewards/rejected": -5.083714962005615,
506
+ "step": 300
507
+ },
508
+ {
509
+ "epoch": 1.0452961672473868,
510
+ "eval_logits/chosen": -2.6408894062042236,
511
+ "eval_logits/rejected": -2.6256282329559326,
512
+ "eval_logps/chosen": -73.48590087890625,
513
+ "eval_logps/rejected": -82.12760162353516,
514
+ "eval_loss": 0.8162885904312134,
515
+ "eval_rewards/accuracies": 0.3472222089767456,
516
+ "eval_rewards/chosen": 0.45239681005477905,
517
+ "eval_rewards/margins": 0.9020630717277527,
518
+ "eval_rewards/rejected": -0.44966623187065125,
519
+ "eval_runtime": 113.8253,
520
+ "eval_samples_per_second": 17.571,
521
+ "eval_steps_per_second": 0.553,
522
+ "step": 300
523
+ },
524
+ {
525
+ "epoch": 1.0801393728222997,
526
+ "grad_norm": 5.506688225969992,
527
+ "learning_rate": 5.66152553505804e-07,
528
+ "logits/chosen": -2.588723659515381,
529
+ "logits/rejected": -2.5905752182006836,
530
+ "logps/chosen": -58.51898193359375,
531
+ "logps/rejected": -90.38175964355469,
532
+ "loss": 0.4013,
533
+ "rewards/accuracies": 0.4375,
534
+ "rewards/chosen": 4.1029510498046875,
535
+ "rewards/margins": 10.913629531860352,
536
+ "rewards/rejected": -6.810678958892822,
537
+ "step": 310
538
+ },
539
+ {
540
+ "epoch": 1.1149825783972125,
541
+ "grad_norm": 84.44098198698735,
542
+ "learning_rate": 5.548043128283609e-07,
543
+ "logits/chosen": -2.6132149696350098,
544
+ "logits/rejected": -2.600966691970825,
545
+ "logps/chosen": -64.97319030761719,
546
+ "logps/rejected": -90.65545654296875,
547
+ "loss": 0.3888,
548
+ "rewards/accuracies": 0.46875,
549
+ "rewards/chosen": 4.044018745422363,
550
+ "rewards/margins": 9.642068862915039,
551
+ "rewards/rejected": -5.598048686981201,
552
+ "step": 320
553
+ },
554
+ {
555
+ "epoch": 1.1498257839721253,
556
+ "grad_norm": 54.66825088919817,
557
+ "learning_rate": 5.43118709269656e-07,
558
+ "logits/chosen": -2.5924811363220215,
559
+ "logits/rejected": -2.5696775913238525,
560
+ "logps/chosen": -71.87019348144531,
561
+ "logps/rejected": -88.80474853515625,
562
+ "loss": 0.3669,
563
+ "rewards/accuracies": 0.53125,
564
+ "rewards/chosen": 5.513075828552246,
565
+ "rewards/margins": 10.093175888061523,
566
+ "rewards/rejected": -4.5801005363464355,
567
+ "step": 330
568
+ },
569
+ {
570
+ "epoch": 1.1846689895470384,
571
+ "grad_norm": 16.225429361874546,
572
+ "learning_rate": 5.311149918817793e-07,
573
+ "logits/chosen": -2.58941650390625,
574
+ "logits/rejected": -2.597695827484131,
575
+ "logps/chosen": -71.19819641113281,
576
+ "logps/rejected": -106.7779769897461,
577
+ "loss": 0.3766,
578
+ "rewards/accuracies": 0.5562499761581421,
579
+ "rewards/chosen": 5.647311687469482,
580
+ "rewards/margins": 10.82011890411377,
581
+ "rewards/rejected": -5.172807216644287,
582
+ "step": 340
583
+ },
584
+ {
585
+ "epoch": 1.2195121951219512,
586
+ "grad_norm": 78.05238402288815,
587
+ "learning_rate": 5.188129337282367e-07,
588
+ "logits/chosen": -2.643777370452881,
589
+ "logits/rejected": -2.613312005996704,
590
+ "logps/chosen": -56.92789840698242,
591
+ "logps/rejected": -74.79656982421875,
592
+ "loss": 0.3934,
593
+ "rewards/accuracies": 0.48750001192092896,
594
+ "rewards/chosen": 5.251105308532715,
595
+ "rewards/margins": 9.353957176208496,
596
+ "rewards/rejected": -4.102850914001465,
597
+ "step": 350
598
+ },
599
+ {
600
+ "epoch": 1.254355400696864,
601
+ "grad_norm": 58.627931535437625,
602
+ "learning_rate": 5.062327993128697e-07,
603
+ "logits/chosen": -2.6389856338500977,
604
+ "logits/rejected": -2.612027406692505,
605
+ "logps/chosen": -61.96246337890625,
606
+ "logps/rejected": -73.02973175048828,
607
+ "loss": 0.3632,
608
+ "rewards/accuracies": 0.44999998807907104,
609
+ "rewards/chosen": 4.449435710906982,
610
+ "rewards/margins": 7.763936519622803,
611
+ "rewards/rejected": -3.3145008087158203,
612
+ "step": 360
613
+ },
614
+ {
615
+ "epoch": 1.289198606271777,
616
+ "grad_norm": 103.01146780644565,
617
+ "learning_rate": 4.933953111992535e-07,
618
+ "logits/chosen": -2.6231448650360107,
619
+ "logits/rejected": -2.6383228302001953,
620
+ "logps/chosen": -60.911376953125,
621
+ "logps/rejected": -85.25511932373047,
622
+ "loss": 0.3915,
623
+ "rewards/accuracies": 0.4625000059604645,
624
+ "rewards/chosen": 4.569601535797119,
625
+ "rewards/margins": 9.484132766723633,
626
+ "rewards/rejected": -4.9145307540893555,
627
+ "step": 370
628
+ },
629
+ {
630
+ "epoch": 1.32404181184669,
631
+ "grad_norm": 98.7337260301499,
632
+ "learning_rate": 4.803216158755572e-07,
633
+ "logits/chosen": -2.643336772918701,
634
+ "logits/rejected": -2.638636589050293,
635
+ "logps/chosen": -74.83843994140625,
636
+ "logps/rejected": -102.59355163574219,
637
+ "loss": 0.3849,
638
+ "rewards/accuracies": 0.543749988079071,
639
+ "rewards/chosen": 5.416236877441406,
640
+ "rewards/margins": 12.673513412475586,
641
+ "rewards/rejected": -7.2572784423828125,
642
+ "step": 380
643
+ },
644
+ {
645
+ "epoch": 1.3588850174216027,
646
+ "grad_norm": 88.65160189656861,
647
+ "learning_rate": 4.6703324892109645e-07,
648
+ "logits/chosen": -2.7196712493896484,
649
+ "logits/rejected": -2.7034525871276855,
650
+ "logps/chosen": -58.760398864746094,
651
+ "logps/rejected": -83.69285583496094,
652
+ "loss": 0.3793,
653
+ "rewards/accuracies": 0.4437499940395355,
654
+ "rewards/chosen": 3.705737352371216,
655
+ "rewards/margins": 9.30022144317627,
656
+ "rewards/rejected": -5.594483375549316,
657
+ "step": 390
658
+ },
659
+ {
660
+ "epoch": 1.3937282229965158,
661
+ "grad_norm": 24.91794224364189,
662
+ "learning_rate": 4.535520995319585e-07,
663
+ "logits/chosen": -2.7114458084106445,
664
+ "logits/rejected": -2.6843841075897217,
665
+ "logps/chosen": -75.79566955566406,
666
+ "logps/rejected": -109.73543548583984,
667
+ "loss": 0.3982,
668
+ "rewards/accuracies": 0.5062500238418579,
669
+ "rewards/chosen": 4.6581196784973145,
670
+ "rewards/margins": 9.757375717163086,
671
+ "rewards/rejected": -5.099255084991455,
672
+ "step": 400
673
+ },
674
+ {
675
+ "epoch": 1.3937282229965158,
676
+ "eval_logits/chosen": -2.726470470428467,
677
+ "eval_logits/rejected": -2.7106189727783203,
678
+ "eval_logps/chosen": -71.78788757324219,
679
+ "eval_logps/rejected": -80.97720336914062,
680
+ "eval_loss": 0.887217104434967,
681
+ "eval_rewards/accuracies": 0.3313491940498352,
682
+ "eval_rewards/chosen": 1.2165056467056274,
683
+ "eval_rewards/margins": 1.1484907865524292,
684
+ "eval_rewards/rejected": 0.06801486760377884,
685
+ "eval_runtime": 113.9417,
686
+ "eval_samples_per_second": 17.553,
687
+ "eval_steps_per_second": 0.553,
688
+ "step": 400
689
+ },
690
+ {
691
+ "epoch": 1.4285714285714286,
692
+ "grad_norm": 2.035285199174074,
693
+ "learning_rate": 4.3990037446413313e-07,
694
+ "logits/chosen": -2.70420503616333,
695
+ "logits/rejected": -2.6959073543548584,
696
+ "logps/chosen": -72.02146911621094,
697
+ "logps/rejected": -90.1993637084961,
698
+ "loss": 0.5091,
699
+ "rewards/accuracies": 0.48750001192092896,
700
+ "rewards/chosen": 4.766329288482666,
701
+ "rewards/margins": 9.186702728271484,
702
+ "rewards/rejected": -4.420374393463135,
703
+ "step": 410
704
+ },
705
+ {
706
+ "epoch": 1.4634146341463414,
707
+ "grad_norm": 106.25796220752933,
708
+ "learning_rate": 4.2610056145354496e-07,
709
+ "logits/chosen": -2.7422072887420654,
710
+ "logits/rejected": -2.740999937057495,
711
+ "logps/chosen": -67.51850891113281,
712
+ "logps/rejected": -96.81071472167969,
713
+ "loss": 0.4118,
714
+ "rewards/accuracies": 0.4375,
715
+ "rewards/chosen": 3.858595371246338,
716
+ "rewards/margins": 9.231226921081543,
717
+ "rewards/rejected": -5.372632026672363,
718
+ "step": 420
719
+ },
720
+ {
721
+ "epoch": 1.4982578397212545,
722
+ "grad_norm": 7.263239275981174,
723
+ "learning_rate": 4.1217539217324226e-07,
724
+ "logits/chosen": -2.714473247528076,
725
+ "logits/rejected": -2.699317455291748,
726
+ "logps/chosen": -59.63612747192383,
727
+ "logps/rejected": -85.27086639404297,
728
+ "loss": 0.4066,
729
+ "rewards/accuracies": 0.45625001192092896,
730
+ "rewards/chosen": 4.061794281005859,
731
+ "rewards/margins": 9.813599586486816,
732
+ "rewards/rejected": -5.751805782318115,
733
+ "step": 430
734
+ },
735
+ {
736
+ "epoch": 1.533101045296167,
737
+ "grad_norm": 132.98971170304543,
738
+ "learning_rate": 3.9814780478876267e-07,
739
+ "logits/chosen": -2.725816249847412,
740
+ "logits/rejected": -2.73162841796875,
741
+ "logps/chosen": -59.733802795410156,
742
+ "logps/rejected": -84.18232727050781,
743
+ "loss": 0.4048,
744
+ "rewards/accuracies": 0.41874998807907104,
745
+ "rewards/chosen": 4.610183238983154,
746
+ "rewards/margins": 9.417313575744629,
747
+ "rewards/rejected": -4.807129859924316,
748
+ "step": 440
749
+ },
750
+ {
751
+ "epoch": 1.5679442508710801,
752
+ "grad_norm": 28.002507272860324,
753
+ "learning_rate": 3.8404090617335413e-07,
754
+ "logits/chosen": -2.7826929092407227,
755
+ "logits/rejected": -2.749206304550171,
756
+ "logps/chosen": -80.09606170654297,
757
+ "logps/rejected": -101.27384948730469,
758
+ "loss": 0.4032,
759
+ "rewards/accuracies": 0.512499988079071,
760
+ "rewards/chosen": 4.835663795471191,
761
+ "rewards/margins": 11.725584030151367,
762
+ "rewards/rejected": -6.88992166519165,
763
+ "step": 450
764
+ },
765
+ {
766
+ "epoch": 1.6027874564459932,
767
+ "grad_norm": 37.52261766419059,
768
+ "learning_rate": 3.698779338452938e-07,
769
+ "logits/chosen": -2.769883632659912,
770
+ "logits/rejected": -2.7476718425750732,
771
+ "logps/chosen": -66.14527130126953,
772
+ "logps/rejected": -93.9597396850586,
773
+ "loss": 0.4069,
774
+ "rewards/accuracies": 0.5062500238418579,
775
+ "rewards/chosen": 4.515407085418701,
776
+ "rewards/margins": 12.145170211791992,
777
+ "rewards/rejected": -7.629762172698975,
778
+ "step": 460
779
+ },
780
+ {
781
+ "epoch": 1.6376306620209058,
782
+ "grad_norm": 6.272455332024739,
783
+ "learning_rate": 3.556822176900017e-07,
784
+ "logits/chosen": -2.739861011505127,
785
+ "logits/rejected": -2.731259822845459,
786
+ "logps/chosen": -54.19404983520508,
787
+ "logps/rejected": -83.7978286743164,
788
+ "loss": 0.4057,
789
+ "rewards/accuracies": 0.4000000059604645,
790
+ "rewards/chosen": 3.198758840560913,
791
+ "rewards/margins": 8.655891418457031,
792
+ "rewards/rejected": -5.4571332931518555,
793
+ "step": 470
794
+ },
795
+ {
796
+ "epoch": 1.6724738675958188,
797
+ "grad_norm": 46.64342674711966,
798
+ "learning_rate": 3.414771415300036e-07,
799
+ "logits/chosen": -2.737431049346924,
800
+ "logits/rejected": -2.7219510078430176,
801
+ "logps/chosen": -47.62141799926758,
802
+ "logps/rejected": -61.3114128112793,
803
+ "loss": 0.4139,
804
+ "rewards/accuracies": 0.3499999940395355,
805
+ "rewards/chosen": 2.8980441093444824,
806
+ "rewards/margins": 7.096662998199463,
807
+ "rewards/rejected": -4.198617935180664,
808
+ "step": 480
809
+ },
810
+ {
811
+ "epoch": 1.7073170731707317,
812
+ "grad_norm": 28.98635525863854,
813
+ "learning_rate": 3.2728610460604674e-07,
814
+ "logits/chosen": -2.712428569793701,
815
+ "logits/rejected": -2.704003095626831,
816
+ "logps/chosen": -66.01129913330078,
817
+ "logps/rejected": -81.80970764160156,
818
+ "loss": 0.4894,
819
+ "rewards/accuracies": 0.38749998807907104,
820
+ "rewards/chosen": 3.2447237968444824,
821
+ "rewards/margins": 8.53111743927002,
822
+ "rewards/rejected": -5.286392688751221,
823
+ "step": 490
824
+ },
825
+ {
826
+ "epoch": 1.7421602787456445,
827
+ "grad_norm": 80.97214190913306,
828
+ "learning_rate": 3.131324830328163e-07,
829
+ "logits/chosen": -2.6181178092956543,
830
+ "logits/rejected": -2.6126463413238525,
831
+ "logps/chosen": -67.113037109375,
832
+ "logps/rejected": -96.601318359375,
833
+ "loss": 0.389,
834
+ "rewards/accuracies": 0.45625001192092896,
835
+ "rewards/chosen": 4.128693580627441,
836
+ "rewards/margins": 9.92380142211914,
837
+ "rewards/rejected": -5.795108795166016,
838
+ "step": 500
839
+ },
840
+ {
841
+ "epoch": 1.7421602787456445,
842
+ "eval_logits/chosen": -2.7346041202545166,
843
+ "eval_logits/rejected": -2.7188162803649902,
844
+ "eval_logps/chosen": -73.7844009399414,
845
+ "eval_logps/rejected": -83.26041412353516,
846
+ "eval_loss": 0.910707414150238,
847
+ "eval_rewards/accuracies": 0.335317462682724,
848
+ "eval_rewards/chosen": 0.318076491355896,
849
+ "eval_rewards/margins": 1.277505874633789,
850
+ "eval_rewards/rejected": -0.9594294428825378,
851
+ "eval_runtime": 113.9969,
852
+ "eval_samples_per_second": 17.544,
853
+ "eval_steps_per_second": 0.553,
854
+ "step": 500
855
+ },
856
+ {
857
+ "epoch": 1.7770034843205575,
858
+ "grad_norm": 51.127719184203606,
859
+ "learning_rate": 2.9903959129274836e-07,
860
+ "logits/chosen": -2.6881165504455566,
861
+ "logits/rejected": -2.669621467590332,
862
+ "logps/chosen": -61.860618591308594,
863
+ "logps/rejected": -81.27119445800781,
864
+ "loss": 0.3864,
865
+ "rewards/accuracies": 0.4625000059604645,
866
+ "rewards/chosen": 4.003110885620117,
867
+ "rewards/margins": 9.262310028076172,
868
+ "rewards/rejected": -5.259199619293213,
869
+ "step": 510
870
+ },
871
+ {
872
+ "epoch": 1.8118466898954704,
873
+ "grad_norm": 41.46570799336418,
874
+ "learning_rate": 2.850306438313643e-07,
875
+ "logits/chosen": -2.680405855178833,
876
+ "logits/rejected": -2.675224781036377,
877
+ "logps/chosen": -64.68285369873047,
878
+ "logps/rejected": -88.90663146972656,
879
+ "loss": 0.4,
880
+ "rewards/accuracies": 0.45625001192092896,
881
+ "rewards/chosen": 4.314764499664307,
882
+ "rewards/margins": 9.576990127563477,
883
+ "rewards/rejected": -5.262225151062012,
884
+ "step": 520
885
+ },
886
+ {
887
+ "epoch": 1.8466898954703832,
888
+ "grad_norm": 33.602492864431376,
889
+ "learning_rate": 2.711287168173922e-07,
890
+ "logits/chosen": -2.647599697113037,
891
+ "logits/rejected": -2.642518997192383,
892
+ "logps/chosen": -61.43975830078125,
893
+ "logps/rejected": -83.95713806152344,
894
+ "loss": 0.4019,
895
+ "rewards/accuracies": 0.4625000059604645,
896
+ "rewards/chosen": 4.453371047973633,
897
+ "rewards/margins": 10.12047290802002,
898
+ "rewards/rejected": -5.667101860046387,
899
+ "step": 530
900
+ },
901
+ {
902
+ "epoch": 1.8815331010452963,
903
+ "grad_norm": 24.453834226726745,
904
+ "learning_rate": 2.573567101306622e-07,
905
+ "logits/chosen": -2.65397310256958,
906
+ "logits/rejected": -2.66703462600708,
907
+ "logps/chosen": -52.19524002075195,
908
+ "logps/rejected": -87.27598571777344,
909
+ "loss": 0.4031,
910
+ "rewards/accuracies": 0.45625001192092896,
911
+ "rewards/chosen": 5.190995693206787,
912
+ "rewards/margins": 11.461599349975586,
913
+ "rewards/rejected": -6.270604133605957,
914
+ "step": 540
915
+ },
916
+ {
917
+ "epoch": 1.916376306620209,
918
+ "grad_norm": 2.684221407433218,
919
+ "learning_rate": 2.4373730964039504e-07,
920
+ "logits/chosen": -2.627750873565674,
921
+ "logits/rejected": -2.610384464263916,
922
+ "logps/chosen": -77.6567611694336,
923
+ "logps/rejected": -100.83102416992188,
924
+ "loss": 0.3791,
925
+ "rewards/accuracies": 0.5,
926
+ "rewards/chosen": 5.930109977722168,
927
+ "rewards/margins": 12.205314636230469,
928
+ "rewards/rejected": -6.275204658508301,
929
+ "step": 550
930
+ },
931
+ {
932
+ "epoch": 1.951219512195122,
933
+ "grad_norm": 79.1320535853655,
934
+ "learning_rate": 2.3029294983601597e-07,
935
+ "logits/chosen": -2.643815517425537,
936
+ "logits/rejected": -2.647261142730713,
937
+ "logps/chosen": -54.161102294921875,
938
+ "logps/rejected": -79.52574157714844,
939
+ "loss": 0.419,
940
+ "rewards/accuracies": 0.4375,
941
+ "rewards/chosen": 4.301507949829102,
942
+ "rewards/margins": 8.454828262329102,
943
+ "rewards/rejected": -4.153320789337158,
944
+ "step": 560
945
+ },
946
+ {
947
+ "epoch": 1.986062717770035,
948
+ "grad_norm": 29.398825057181867,
949
+ "learning_rate": 2.1704577687205507e-07,
950
+ "logits/chosen": -2.730459690093994,
951
+ "logits/rejected": -2.7015600204467773,
952
+ "logps/chosen": -56.092933654785156,
953
+ "logps/rejected": -70.2969970703125,
954
+ "loss": 0.4011,
955
+ "rewards/accuracies": 0.4437499940395355,
956
+ "rewards/chosen": 4.104133605957031,
957
+ "rewards/margins": 7.584607124328613,
958
+ "rewards/rejected": -3.4804725646972656,
959
+ "step": 570
960
+ },
961
+ {
962
+ "epoch": 2.0209059233449476,
963
+ "grad_norm": 69.01444268439244,
964
+ "learning_rate": 2.040176120880048e-07,
965
+ "logits/chosen": -2.742163896560669,
966
+ "logits/rejected": -2.739891529083252,
967
+ "logps/chosen": -66.87298583984375,
968
+ "logps/rejected": -92.02427673339844,
969
+ "loss": 0.3602,
970
+ "rewards/accuracies": 0.5062500238418579,
971
+ "rewards/chosen": 5.334141731262207,
972
+ "rewards/margins": 10.981359481811523,
973
+ "rewards/rejected": -5.647217750549316,
974
+ "step": 580
975
+ },
976
+ {
977
+ "epoch": 2.0557491289198606,
978
+ "grad_norm": 0.14017497005010535,
979
+ "learning_rate": 1.9122991606322655e-07,
980
+ "logits/chosen": -2.6563334465026855,
981
+ "logits/rejected": -2.6057848930358887,
982
+ "logps/chosen": -80.1333236694336,
983
+ "logps/rejected": -92.28086853027344,
984
+ "loss": 0.3413,
985
+ "rewards/accuracies": 0.5375000238418579,
986
+ "rewards/chosen": 6.459235191345215,
987
+ "rewards/margins": 12.748309135437012,
988
+ "rewards/rejected": -6.289073944091797,
989
+ "step": 590
990
+ },
991
+ {
992
+ "epoch": 2.0905923344947737,
993
+ "grad_norm": 0.4665024187650401,
994
+ "learning_rate": 1.7870375326612014e-07,
995
+ "logits/chosen": -2.6802449226379395,
996
+ "logits/rejected": -2.6527862548828125,
997
+ "logps/chosen": -52.02629470825195,
998
+ "logps/rejected": -71.58445739746094,
999
+ "loss": 0.3707,
1000
+ "rewards/accuracies": 0.45625001192092896,
1001
+ "rewards/chosen": 4.42466926574707,
1002
+ "rewards/margins": 9.960657119750977,
1003
+ "rewards/rejected": -5.535989284515381,
1004
+ "step": 600
1005
+ },
1006
+ {
1007
+ "epoch": 2.0905923344947737,
1008
+ "eval_logits/chosen": -2.706523895263672,
1009
+ "eval_logits/rejected": -2.6904332637786865,
1010
+ "eval_logps/chosen": -72.95612335205078,
1011
+ "eval_logps/rejected": -82.87360382080078,
1012
+ "eval_loss": 0.8992136716842651,
1013
+ "eval_rewards/accuracies": 0.3472222089767456,
1014
+ "eval_rewards/chosen": 0.6907990574836731,
1015
+ "eval_rewards/margins": 1.4761611223220825,
1016
+ "eval_rewards/rejected": -0.7853620052337646,
1017
+ "eval_runtime": 114.0517,
1018
+ "eval_samples_per_second": 17.536,
1019
+ "eval_steps_per_second": 0.552,
1020
+ "step": 600
1021
+ },
1022
+ {
1023
+ "epoch": 2.1254355400696863,
1024
+ "grad_norm": 2.1413870154178674,
1025
+ "learning_rate": 1.6645975735578165e-07,
1026
+ "logits/chosen": -2.7286293506622314,
1027
+ "logits/rejected": -2.6968209743499756,
1028
+ "logps/chosen": -73.44158172607422,
1029
+ "logps/rejected": -84.2601547241211,
1030
+ "loss": 0.3631,
1031
+ "rewards/accuracies": 0.48124998807907104,
1032
+ "rewards/chosen": 4.519391059875488,
1033
+ "rewards/margins": 10.450955390930176,
1034
+ "rewards/rejected": -5.931563377380371,
1035
+ "step": 610
1036
+ },
1037
+ {
1038
+ "epoch": 2.1602787456445993,
1039
+ "grad_norm": 7.439328512600098,
1040
+ "learning_rate": 1.5451809719331295e-07,
1041
+ "logits/chosen": -2.6922688484191895,
1042
+ "logits/rejected": -2.64046049118042,
1043
+ "logps/chosen": -66.28971099853516,
1044
+ "logps/rejected": -97.03089904785156,
1045
+ "loss": 0.3667,
1046
+ "rewards/accuracies": 0.518750011920929,
1047
+ "rewards/chosen": 5.458328723907471,
1048
+ "rewards/margins": 14.20256233215332,
1049
+ "rewards/rejected": -8.744232177734375,
1050
+ "step": 620
1051
+ },
1052
+ {
1053
+ "epoch": 2.1951219512195124,
1054
+ "grad_norm": 1.6583151417624387,
1055
+ "learning_rate": 1.4289844361876528e-07,
1056
+ "logits/chosen": -2.6175191402435303,
1057
+ "logits/rejected": -2.6281955242156982,
1058
+ "logps/chosen": -62.218841552734375,
1059
+ "logps/rejected": -105.92720031738281,
1060
+ "loss": 0.3438,
1061
+ "rewards/accuracies": 0.5249999761581421,
1062
+ "rewards/chosen": 4.9540114402771,
1063
+ "rewards/margins": 13.651411056518555,
1064
+ "rewards/rejected": -8.69740104675293,
1065
+ "step": 630
1066
+ },
1067
+ {
1068
+ "epoch": 2.229965156794425,
1069
+ "grad_norm": 0.8793732938037636,
1070
+ "learning_rate": 1.3161993704844647e-07,
1071
+ "logits/chosen": -2.642472505569458,
1072
+ "logits/rejected": -2.6352057456970215,
1073
+ "logps/chosen": -59.35064697265625,
1074
+ "logps/rejected": -101.8673324584961,
1075
+ "loss": 0.3678,
1076
+ "rewards/accuracies": 0.46875,
1077
+ "rewards/chosen": 4.605126857757568,
1078
+ "rewards/margins": 13.10174560546875,
1079
+ "rewards/rejected": -8.49661922454834,
1080
+ "step": 640
1081
+ },
1082
+ {
1083
+ "epoch": 2.264808362369338,
1084
+ "grad_norm": 10.705002096620795,
1085
+ "learning_rate": 1.2070115594596576e-07,
1086
+ "logits/chosen": -2.6452431678771973,
1087
+ "logits/rejected": -2.619326114654541,
1088
+ "logps/chosen": -69.06687927246094,
1089
+ "logps/rejected": -92.87254333496094,
1090
+ "loss": 0.3908,
1091
+ "rewards/accuracies": 0.42500001192092896,
1092
+ "rewards/chosen": 4.869100093841553,
1093
+ "rewards/margins": 11.985551834106445,
1094
+ "rewards/rejected": -7.116452217102051,
1095
+ "step": 650
1096
+ },
1097
+ {
1098
+ "epoch": 2.2996515679442506,
1099
+ "grad_norm": 1.1342333232981456,
1100
+ "learning_rate": 1.1016008621895228e-07,
1101
+ "logits/chosen": -2.6762545108795166,
1102
+ "logits/rejected": -2.6614186763763428,
1103
+ "logps/chosen": -68.25106811523438,
1104
+ "logps/rejected": -91.7679214477539,
1105
+ "loss": 0.3573,
1106
+ "rewards/accuracies": 0.4437499940395355,
1107
+ "rewards/chosen": 4.843748569488525,
1108
+ "rewards/margins": 12.167219161987305,
1109
+ "rewards/rejected": -7.323469638824463,
1110
+ "step": 660
1111
+ },
1112
+ {
1113
+ "epoch": 2.3344947735191637,
1114
+ "grad_norm": 0.6331495120745031,
1115
+ "learning_rate": 1.000140915918589e-07,
1116
+ "logits/chosen": -2.6580915451049805,
1117
+ "logits/rejected": -2.6534488201141357,
1118
+ "logps/chosen": -65.1015396118164,
1119
+ "logps/rejected": -91.0811538696289,
1120
+ "loss": 0.3679,
1121
+ "rewards/accuracies": 0.4625000059604645,
1122
+ "rewards/chosen": 3.4471499919891357,
1123
+ "rewards/margins": 11.26557731628418,
1124
+ "rewards/rejected": -7.818428039550781,
1125
+ "step": 670
1126
+ },
1127
+ {
1128
+ "epoch": 2.3693379790940767,
1129
+ "grad_norm": 2.2726954763261324,
1130
+ "learning_rate": 9.027988500365347e-08,
1131
+ "logits/chosen": -2.693312406539917,
1132
+ "logits/rejected": -2.680427312850952,
1133
+ "logps/chosen": -64.97016906738281,
1134
+ "logps/rejected": -90.56646728515625,
1135
+ "loss": 0.3555,
1136
+ "rewards/accuracies": 0.48750001192092896,
1137
+ "rewards/chosen": 3.795478343963623,
1138
+ "rewards/margins": 12.02349853515625,
1139
+ "rewards/rejected": -8.228020668029785,
1140
+ "step": 680
1141
+ },
1142
+ {
1143
+ "epoch": 2.40418118466899,
1144
+ "grad_norm": 0.22531368042473615,
1145
+ "learning_rate": 8.097350107751374e-08,
1146
+ "logits/chosen": -2.6593518257141113,
1147
+ "logits/rejected": -2.629815101623535,
1148
+ "logps/chosen": -72.8058853149414,
1149
+ "logps/rejected": -104.3930435180664,
1150
+ "loss": 0.3723,
1151
+ "rewards/accuracies": 0.4749999940395355,
1152
+ "rewards/chosen": 4.151662349700928,
1153
+ "rewards/margins": 11.557449340820312,
1154
+ "rewards/rejected": -7.405786991119385,
1155
+ "step": 690
1156
+ },
1157
+ {
1158
+ "epoch": 2.4390243902439024,
1159
+ "grad_norm": 2.7004181516244823,
1160
+ "learning_rate": 7.211026970787468e-08,
1161
+ "logits/chosen": -2.6526448726654053,
1162
+ "logits/rejected": -2.6184093952178955,
1163
+ "logps/chosen": -78.18067932128906,
1164
+ "logps/rejected": -97.9962158203125,
1165
+ "loss": 0.3672,
1166
+ "rewards/accuracies": 0.53125,
1167
+ "rewards/chosen": 4.823997497558594,
1168
+ "rewards/margins": 12.184492111206055,
1169
+ "rewards/rejected": -7.360495090484619,
1170
+ "step": 700
1171
+ },
1172
+ {
1173
+ "epoch": 2.4390243902439024,
1174
+ "eval_logits/chosen": -2.6822805404663086,
1175
+ "eval_logits/rejected": -2.666159152984619,
1176
+ "eval_logps/chosen": -75.6268539428711,
1177
+ "eval_logps/rejected": -86.10513305664062,
1178
+ "eval_loss": 0.9353695511817932,
1179
+ "eval_rewards/accuracies": 0.3492063581943512,
1180
+ "eval_rewards/chosen": -0.5110280513763428,
1181
+ "eval_rewards/margins": 1.7285233736038208,
1182
+ "eval_rewards/rejected": -2.239551305770874,
1183
+ "eval_runtime": 113.8285,
1184
+ "eval_samples_per_second": 17.57,
1185
+ "eval_steps_per_second": 0.553,
1186
+ "step": 700
1187
+ },
1188
+ {
1189
+ "epoch": 2.4738675958188154,
1190
+ "grad_norm": 0.0879833853193303,
1191
+ "learning_rate": 6.370479080833579e-08,
1192
+ "logits/chosen": -2.700568675994873,
1193
+ "logits/rejected": -2.6803741455078125,
1194
+ "logps/chosen": -84.60392761230469,
1195
+ "logps/rejected": -123.07450103759766,
1196
+ "loss": 0.3643,
1197
+ "rewards/accuracies": 0.5249999761581421,
1198
+ "rewards/chosen": 4.764341831207275,
1199
+ "rewards/margins": 15.03064250946045,
1200
+ "rewards/rejected": -10.266302108764648,
1201
+ "step": 710
1202
+ },
1203
+ {
1204
+ "epoch": 2.508710801393728,
1205
+ "grad_norm": 1.1049586982136799,
1206
+ "learning_rate": 5.5770910262027175e-08,
1207
+ "logits/chosen": -2.6708908081054688,
1208
+ "logits/rejected": -2.638007164001465,
1209
+ "logps/chosen": -60.929107666015625,
1210
+ "logps/rejected": -78.13368225097656,
1211
+ "loss": 0.4049,
1212
+ "rewards/accuracies": 0.45625001192092896,
1213
+ "rewards/chosen": 4.554830551147461,
1214
+ "rewards/margins": 11.321012496948242,
1215
+ "rewards/rejected": -6.766181945800781,
1216
+ "step": 720
1217
+ },
1218
+ {
1219
+ "epoch": 2.543554006968641,
1220
+ "grad_norm": 0.17464151203641626,
1221
+ "learning_rate": 4.832169711404716e-08,
1222
+ "logits/chosen": -2.702913284301758,
1223
+ "logits/rejected": -2.682276725769043,
1224
+ "logps/chosen": -61.146484375,
1225
+ "logps/rejected": -76.15191650390625,
1226
+ "loss": 0.384,
1227
+ "rewards/accuracies": 0.34375,
1228
+ "rewards/chosen": 4.145053863525391,
1229
+ "rewards/margins": 10.881394386291504,
1230
+ "rewards/rejected": -6.7363386154174805,
1231
+ "step": 730
1232
+ },
1233
+ {
1234
+ "epoch": 2.578397212543554,
1235
+ "grad_norm": 1.4963244248499354,
1236
+ "learning_rate": 4.1369422043543185e-08,
1237
+ "logits/chosen": -2.6026740074157715,
1238
+ "logits/rejected": -2.5997722148895264,
1239
+ "logps/chosen": -58.41170120239258,
1240
+ "logps/rejected": -99.71952056884766,
1241
+ "loss": 0.3645,
1242
+ "rewards/accuracies": 0.4375,
1243
+ "rewards/chosen": 3.420952320098877,
1244
+ "rewards/margins": 12.743513107299805,
1245
+ "rewards/rejected": -9.322561264038086,
1246
+ "step": 740
1247
+ },
1248
+ {
1249
+ "epoch": 2.6132404181184667,
1250
+ "grad_norm": 0.17470713028005413,
1251
+ "learning_rate": 3.492553715089692e-08,
1252
+ "logits/chosen": -2.6855924129486084,
1253
+ "logits/rejected": -2.6744086742401123,
1254
+ "logps/chosen": -59.6028938293457,
1255
+ "logps/rejected": -80.73404693603516,
1256
+ "loss": 0.3419,
1257
+ "rewards/accuracies": 0.4937500059604645,
1258
+ "rewards/chosen": 3.875894069671631,
1259
+ "rewards/margins": 9.938748359680176,
1260
+ "rewards/rejected": -6.062855243682861,
1261
+ "step": 750
1262
+ },
1263
+ {
1264
+ "epoch": 2.64808362369338,
1265
+ "grad_norm": 0.7912656232401193,
1266
+ "learning_rate": 2.9000657093309096e-08,
1267
+ "logits/chosen": -2.6053149700164795,
1268
+ "logits/rejected": -2.594897985458374,
1269
+ "logps/chosen": -45.781890869140625,
1270
+ "logps/rejected": -70.77694702148438,
1271
+ "loss": 0.3866,
1272
+ "rewards/accuracies": 0.38749998807907104,
1273
+ "rewards/chosen": 3.3890724182128906,
1274
+ "rewards/margins": 9.678762435913086,
1275
+ "rewards/rejected": -6.289690017700195,
1276
+ "step": 760
1277
+ },
1278
+ {
1279
+ "epoch": 2.682926829268293,
1280
+ "grad_norm": 0.03850043403510083,
1281
+ "learning_rate": 2.3604541599858524e-08,
1282
+ "logits/chosen": -2.7165355682373047,
1283
+ "logits/rejected": -2.698086738586426,
1284
+ "logps/chosen": -67.92710876464844,
1285
+ "logps/rejected": -97.56449890136719,
1286
+ "loss": 0.3756,
1287
+ "rewards/accuracies": 0.4625000059604645,
1288
+ "rewards/chosen": 3.9307703971862793,
1289
+ "rewards/margins": 12.25821590423584,
1290
+ "rewards/rejected": -8.327445030212402,
1291
+ "step": 770
1292
+ },
1293
+ {
1294
+ "epoch": 2.7177700348432055,
1295
+ "grad_norm": 0.23519403290223553,
1296
+ "learning_rate": 1.8746079394836706e-08,
1297
+ "logits/chosen": -2.656900405883789,
1298
+ "logits/rejected": -2.6522555351257324,
1299
+ "logps/chosen": -68.84635925292969,
1300
+ "logps/rejected": -102.07672119140625,
1301
+ "loss": 0.3447,
1302
+ "rewards/accuracies": 0.512499988079071,
1303
+ "rewards/chosen": 5.431530475616455,
1304
+ "rewards/margins": 14.358503341674805,
1305
+ "rewards/rejected": -8.926973342895508,
1306
+ "step": 780
1307
+ },
1308
+ {
1309
+ "epoch": 2.7526132404181185,
1310
+ "grad_norm": 0.605439153331925,
1311
+ "learning_rate": 1.4433273555842e-08,
1312
+ "logits/chosen": -2.6824231147766113,
1313
+ "logits/rejected": -2.6712381839752197,
1314
+ "logps/chosen": -57.7734260559082,
1315
+ "logps/rejected": -90.67354583740234,
1316
+ "loss": 0.3745,
1317
+ "rewards/accuracies": 0.46875,
1318
+ "rewards/chosen": 4.336173057556152,
1319
+ "rewards/margins": 11.832128524780273,
1320
+ "rewards/rejected": -7.495955467224121,
1321
+ "step": 790
1322
+ },
1323
+ {
1324
+ "epoch": 2.7874564459930316,
1325
+ "grad_norm": 8.838304462962064,
1326
+ "learning_rate": 1.0673228330749007e-08,
1327
+ "logits/chosen": -2.649266242980957,
1328
+ "logits/rejected": -2.6187663078308105,
1329
+ "logps/chosen": -66.30155944824219,
1330
+ "logps/rejected": -96.27810668945312,
1331
+ "loss": 0.3596,
1332
+ "rewards/accuracies": 0.46875,
1333
+ "rewards/chosen": 6.211874961853027,
1334
+ "rewards/margins": 15.416938781738281,
1335
+ "rewards/rejected": -9.205063819885254,
1336
+ "step": 800
1337
+ },
1338
+ {
1339
+ "epoch": 2.7874564459930316,
1340
+ "eval_logits/chosen": -2.6886420249938965,
1341
+ "eval_logits/rejected": -2.672652006149292,
1342
+ "eval_logps/chosen": -75.24068450927734,
1343
+ "eval_logps/rejected": -85.62494659423828,
1344
+ "eval_loss": 0.9344265460968018,
1345
+ "eval_rewards/accuracies": 0.3452380895614624,
1346
+ "eval_rewards/chosen": -0.33725666999816895,
1347
+ "eval_rewards/margins": 1.686206340789795,
1348
+ "eval_rewards/rejected": -2.023463249206543,
1349
+ "eval_runtime": 114.2576,
1350
+ "eval_samples_per_second": 17.504,
1351
+ "eval_steps_per_second": 0.551,
1352
+ "step": 800
1353
+ },
1354
+ {
1355
+ "epoch": 2.822299651567944,
1356
+ "grad_norm": 0.5583388154267291,
1357
+ "learning_rate": 7.472137435272619e-09,
1358
+ "logits/chosen": -2.6408095359802246,
1359
+ "logits/rejected": -2.6051411628723145,
1360
+ "logps/chosen": -79.69121551513672,
1361
+ "logps/rejected": -97.32070922851562,
1362
+ "loss": 0.3607,
1363
+ "rewards/accuracies": 0.5249999761581421,
1364
+ "rewards/chosen": 4.549583435058594,
1365
+ "rewards/margins": 12.599637985229492,
1366
+ "rewards/rejected": -8.050054550170898,
1367
+ "step": 810
1368
+ },
1369
+ {
1370
+ "epoch": 2.857142857142857,
1371
+ "grad_norm": 2.894525046100427,
1372
+ "learning_rate": 4.835273850400123e-09,
1373
+ "logits/chosen": -2.661485195159912,
1374
+ "logits/rejected": -2.6633734703063965,
1375
+ "logps/chosen": -79.87845611572266,
1376
+ "logps/rejected": -116.92414855957031,
1377
+ "loss": 0.3792,
1378
+ "rewards/accuracies": 0.512499988079071,
1379
+ "rewards/chosen": 5.126074314117432,
1380
+ "rewards/margins": 13.770315170288086,
1381
+ "rewards/rejected": -8.644243240356445,
1382
+ "step": 820
1383
+ },
1384
+ {
1385
+ "epoch": 2.89198606271777,
1386
+ "grad_norm": 0.19653863059909565,
1387
+ "learning_rate": 2.766981136500024e-09,
1388
+ "logits/chosen": -2.5719292163848877,
1389
+ "logits/rejected": -2.577578067779541,
1390
+ "logps/chosen": -54.612770080566406,
1391
+ "logps/rejected": -91.96238708496094,
1392
+ "loss": 0.3571,
1393
+ "rewards/accuracies": 0.44999998807907104,
1394
+ "rewards/chosen": 3.002854585647583,
1395
+ "rewards/margins": 11.87660026550293,
1396
+ "rewards/rejected": -8.873745918273926,
1397
+ "step": 830
1398
+ },
1399
+ {
1400
+ "epoch": 2.926829268292683,
1401
+ "grad_norm": 0.16246150387256475,
1402
+ "learning_rate": 1.2706662784136513e-09,
1403
+ "logits/chosen": -2.6564536094665527,
1404
+ "logits/rejected": -2.619932174682617,
1405
+ "logps/chosen": -71.76846313476562,
1406
+ "logps/rejected": -101.76387023925781,
1407
+ "loss": 0.3726,
1408
+ "rewards/accuracies": 0.4937500059604645,
1409
+ "rewards/chosen": 4.6055145263671875,
1410
+ "rewards/margins": 12.795547485351562,
1411
+ "rewards/rejected": -8.190032958984375,
1412
+ "step": 840
1413
+ },
1414
+ {
1415
+ "epoch": 2.961672473867596,
1416
+ "grad_norm": 0.90921168828385,
1417
+ "learning_rate": 3.4879407331657175e-10,
1418
+ "logits/chosen": -2.68048357963562,
1419
+ "logits/rejected": -2.6529481410980225,
1420
+ "logps/chosen": -80.40882873535156,
1421
+ "logps/rejected": -104.63221740722656,
1422
+ "loss": 0.3615,
1423
+ "rewards/accuracies": 0.518750011920929,
1424
+ "rewards/chosen": 5.170698165893555,
1425
+ "rewards/margins": 13.889617919921875,
1426
+ "rewards/rejected": -8.718920707702637,
1427
+ "step": 850
1428
+ },
1429
+ {
1430
+ "epoch": 2.996515679442509,
1431
+ "grad_norm": 1.296285561761296,
1432
+ "learning_rate": 2.8830705936344624e-12,
1433
+ "logits/chosen": -2.618508815765381,
1434
+ "logits/rejected": -2.619737386703491,
1435
+ "logps/chosen": -62.67692947387695,
1436
+ "logps/rejected": -99.06525421142578,
1437
+ "loss": 0.3391,
1438
+ "rewards/accuracies": 0.4625000059604645,
1439
+ "rewards/chosen": 5.471669673919678,
1440
+ "rewards/margins": 15.48302936553955,
1441
+ "rewards/rejected": -10.011358261108398,
1442
+ "step": 860
1443
+ },
1444
+ {
1445
+ "epoch": 3.0,
1446
+ "step": 861,
1447
+ "total_flos": 0.0,
1448
+ "train_loss": 0.49774221859604084,
1449
+ "train_runtime": 9884.8529,
1450
+ "train_samples_per_second": 5.566,
1451
+ "train_steps_per_second": 0.087
1452
+ }
1453
+ ],
1454
+ "logging_steps": 10,
1455
+ "max_steps": 861,
1456
+ "num_input_tokens_seen": 0,
1457
+ "num_train_epochs": 3,
1458
+ "save_steps": 100,
1459
+ "stateful_callbacks": {
1460
+ "TrainerControl": {
1461
+ "args": {
1462
+ "should_epoch_stop": false,
1463
+ "should_evaluate": false,
1464
+ "should_log": false,
1465
+ "should_save": true,
1466
+ "should_training_stop": true
1467
+ },
1468
+ "attributes": {}
1469
+ }
1470
+ },
1471
+ "total_flos": 0.0,
1472
+ "train_batch_size": 8,
1473
+ "trial_name": null,
1474
+ "trial_params": null
1475
+ }