nthakur
/

Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2

alignment-handbook

Generated from Trainer

Model card Files Files and versions

Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2 / README.md

nthakur's picture

End of training

288c652 verified about 1 year ago

|

history blame contribute delete

3.87 kB

	---
	license: apache-2.0
	library_name: peft
	tags:
	- alignment-handbook
	- trl
	- dpo
	- generated_from_trainer
	base_model: mistralai/Mistral-7B-Instruct-v0.2
	datasets:
	- nthakur/multilingual-ultrafeedback-binarized-dpo-v0.1
	- nthakur/multilingual-distilabel-intel-orca-dpo-pairs-v0.1
	- nthakur/multilingual-truthy-dpo-pairs-v0.1
	- nthakur/GSM8KInstruct-Parallel-instruct-dpo-v0.1
	model-index:
	- name: Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2

	This model is a fine-tuned version of [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the nthakur/multilingual-ultrafeedback-binarized-dpo-v0.1, the nthakur/multilingual-distilabel-intel-orca-dpo-pairs-v0.1, the nthakur/multilingual-truthy-dpo-pairs-v0.1 and the nthakur/GSM8KInstruct-Parallel-instruct-dpo-v0.1 datasets.
	It achieves the following results on the evaluation set:
	- Loss: 0.1324
	- Rewards/chosen: -2.6738
	- Rewards/rejected: -12.2394
	- Rewards/accuracies: 0.9377
	- Rewards/margins: 9.5656
	- Logps/rejected: -1515.8665
	- Logps/chosen: -607.0774
	- Logits/rejected: 0.4952
	- Logits/chosen: 0.3030

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0002
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 3
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 24
	- total_eval_batch_size: 12
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 1

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Rewards/chosen \| Rewards/rejected \| Rewards/accuracies \| Rewards/margins \| Logps/rejected \| Logps/chosen \| Logits/rejected \| Logits/chosen \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:--------------:\|:----------------:\|:------------------:\|:---------------:\|:--------------:\|:------------:\|:---------------:\|:-------------:\|
	\| 0.2695 \| 0.1361 \| 500 \| 0.2653 \| -0.4399 \| -4.5379 \| 0.8680 \| 4.0981 \| -745.7153 \| -383.6803 \| -1.3998 \| -1.5327 \|
	\| 0.4349 \| 0.2723 \| 1000 \| 0.3152 \| -2.6018 \| -7.1212 \| 0.8515 \| 4.5195 \| -1004.0471 \| -599.8698 \| 4.1724 \| 4.7868 \|
	\| 0.531 \| 0.4084 \| 1500 \| 0.4873 \| -2.4253 \| -8.0681 \| 0.7855 \| 5.6428 \| -1098.7278 \| -582.2241 \| -1.5195 \| -1.6538 \|
	\| 0.1681 \| 0.5446 \| 2000 \| 0.2003 \| -3.9555 \| -13.1169 \| 0.9089 \| 9.1613 \| -1603.6106 \| -735.2488 \| -0.1888 \| -0.3742 \|
	\| 0.1778 \| 0.6807 \| 2500 \| 0.2004 \| -3.4745 \| -11.9768 \| 0.9242 \| 8.5023 \| -1489.6012 \| -687.1464 \| -0.7118 \| -0.9608 \|
	\| 0.1342 \| 0.8169 \| 3000 \| 0.1452 \| -3.0928 \| -12.8477 \| 0.9340 \| 9.7549 \| -1576.6960 \| -648.9738 \| 0.6727 \| 0.5428 \|
	\| 0.1252 \| 0.9530 \| 3500 \| 0.1328 \| -2.7014 \| -12.3976 \| 0.9383 \| 9.6962 \| -1531.6849 \| -609.8344 \| 0.5002 \| 0.3026 \|


	### Framework versions

	- PEFT 0.7.1
	- Transformers 4.41.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.20.0
	- Tokenizers 0.19.1