add citation
Browse files
README.md
CHANGED
|
@@ -12,4 +12,20 @@ We release the scripts to evaluate our model's performance [here](https://github
|
|
| 12 |
|
| 13 |
## Training
|
| 14 |
|
| 15 |
-
Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset [CoRNStack](https://gangiswag.github.io/cornstack/), filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## Training
|
| 14 |
|
| 15 |
+
Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset [CoRNStack](https://gangiswag.github.io/cornstack/), filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
|
| 16 |
+
|
| 17 |
+
# Citation
|
| 18 |
+
|
| 19 |
+
If you find the model, dataset, or training code useful, please cite our work:
|
| 20 |
+
|
| 21 |
+
```bibtex
|
| 22 |
+
@misc{suresh2025cornstackhighqualitycontrastivedata,
|
| 23 |
+
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
|
| 24 |
+
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
|
| 25 |
+
year={2025},
|
| 26 |
+
eprint={2412.01007},
|
| 27 |
+
archivePrefix={arXiv},
|
| 28 |
+
primaryClass={cs.CL},
|
| 29 |
+
url={https://arxiv.org/abs/2412.01007},
|
| 30 |
+
}
|
| 31 |
+
```
|