sentence-transformers-from-synthetic-data

davanstrien 's Collections

haiku

updated Jun 21, 2024

Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model

bigcode/self-oss-instruct-sc2-exec-filter-50k

Viewer • Updated Nov 4, 2024 • 50.7k • 265 • 104
Note Input dataset for generating synthetic data. We use the `instruction` column as a starting point.
davanstrien/similarity-dataset-sc2-8b

Viewer • Updated May 30, 2024 • 2.32k • 90 • 6

Note The dataset was generated from our pipeline. The `instruction` column from the input dataset becomes the anchor, alongside a generated positive and negative pair. This results in a triplets dataset we can use to train a Sentence Transformers model. You can find the code used here: https://github.com/davanstrien/awesome-synthetic-datasets
davanstrien/code-prompt-similarity-model

Sentence Similarity • 0.1B • Updated May 29, 2024 • 4 • 6
Note A fine-tuned Sentence Transformers model using the above dataset. You can see we get a nice bump in performance from minimal fine-tuning.
davanstrien/abstract-wiki

Viewer • Updated Jun 11, 2024 • 5k • 44 • 2

🎉 Free Image Generator Now Available!