File size: 1,727 Bytes
34eb973
 
 
4c73846
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
language:
- ko
license: apache-2.0
tags:
- sentence-transformers
- sentence-similarity
- transformers
---

## PwC-Embedding-expr

We trained the **PwC-Embedding-expr** model on top of the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) embedding model.  
To enhance performance in Korean, we applied our curated augmentation to STS datasets and fine-tuned the E5 model using a carefully balanced ratio across datasets.


### To-do
- [ ] MTEB Leaderboard  
- [ ] Technical Report


## MTEB
PwC-Embedding_expr was evaluated on the Korean subset of MTEB.  
A leaderboard link will be added once it is published.

| Task             | PwC-Embedding_expr | multilingual-e5-large | Max Result |
|------------------|--------------------|-----------------------|------------|
| KLUE-STS         | 0.88               | 0.83                  | 0.90       |
| KLUE-TC          | 0.73               | 0.61                  | 0.73       |
| Ko-StrategyQA    | 0.80               | 0.80                  | 0.83       |
| KorSTS           | 0.84               | 0.81                  | 0.98       |
| MIRACL-Reranking | 0.72               | 0.65                  | 0.72       |
| MIRACL-Retrieval | 0.65               | 0.59                  | 0.72       |
| **Average**      | **0.77**           | 0.71                  | 0.81       |


## Model
- Base Model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
- Model Size: 0.56B
- Embedding Dimension: 1024
- Max Input Tokens: 514


## Requirements
It works with the dependencies included in the latest version of MTEB.


## Citation

TBD (technical report expected September 2025)