|
--- |
|
language: |
|
- ko |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
base_model: |
|
- microsoft/Multilingual-MiniLM-L12-H384 |
|
--- |
|
|
|
# Frony Embed V1 (tiny) |
|
This is an efficient embedding model designed specifically for the Korean language. |
|
It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks. |
|
The model demonstrates strong retrieval capabilities across:<br> |
|
|
|
* Korean–Korean |
|
* Korean–English |
|
* English–Korean |
|
|
|
To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss. |
|
All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base Model:** microsoft/Multilingual-MiniLM-L12-H384 |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 384 / 192 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Languages:** ko, en |
|
- **License:** apache-2.0 |
|
|
|
### Datasets |
|
This model is trained from many sources data including **AI 허브**.<br> |
|
Total trained query and document pair is 100,000.<br> |
|
|
|
### Training Details |
|
The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br> |
|
Training was divided into two stages: Pre-training and Post-training. |
|
|
|
* In the pre-training stage, the model was trained using in-batch negatives. |
|
* In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**. |
|
|
|
Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br> |
|
The types of data augmentation applied are as follows: |
|
|
|
| Augmentation* | Description | |
|
-----------|-----------| |
|
| Pair concatenation | Multi-query & Multi-passage | |
|
| Language transfer | Korean to English on query & passage | |
|
| Style transfer | Plain sentences to Markdown description | |
|
**Augmentation was carried out using the Gemma-3-12B* |
|
|
|
### Evaluation |
|
The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups. |
|
Three groups are subsets extracted from AI 허브 datasets. |
|
One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini. |
|
The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br> |
|
The following table presents the average retrieval performance across five dataset groups. |
|
|
|
| Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | |
|
|--------------|-----------|-----------|-----------|------------|------------|-------------| |
|
| frony-embed-medium | Open | 337M | 0.6649 | 0.8040 | 0.8458 | 0.8876 | |
|
| frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 | |
|
| frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 | |
|
| frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 | |
|
| frony-embed-tiny | **Open** | 21M* | 0.5084 | **0.6757** | 0.7278 | 0.7845 | |
|
| frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 | |
|
| bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 | |
|
| multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 | |
|
| snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 | |
|
| jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 | |
|
| upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 | |
|
| openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 | |
|
**Transformer blocks only* |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1") |
|
|
|
# Run inference |
|
# '<Q>' is special token for query. |
|
queries = [ |
|
'<Q>안녕하세요', |
|
] |
|
embeddings = model.encode(queries) |
|
|
|
# '<P>' is special token for passage. |
|
passages = [ |
|
'<P>반갑습니다', |
|
] |
|
embeddings = model.encode(passages) |
|
|
|
# Matryoshka Embeddings (half of the original dimension) |
|
# '<Q>' is special token for query. |
|
queries = [ |
|
'<Q>안녕하세요', |
|
] |
|
embeddings = model.encode(queries, normalize_embeddings=False, convert_to_tensor=True)[:, :192] |
|
embeddings = F.normalize(embeddings, p=2, dim=-1) |
|
``` |
|
|
|
## Contact |
|
Feel free to open an issue or pull request if you have any questions or suggestions about this project. |
|
You also can email ([email protected]). |