File size: 3,022 Bytes
2960d40
 
602c219
 
 
 
 
 
 
 
 
 
 
 
 
 
91a79bd
 
 
 
 
 
 
7d4e785
 
 
 
 
 
b0d6224
7d4e785
b0d6224
7d4e785
 
 
 
 
 
 
cfc0f6e
 
 
c400a42
 
 
 
 
 
cfc0f6e
c400a42
 
cfc0f6e
 
 
 
 
 
 
 
 
c400a42
cfc0f6e
 
b0d6224
7d4e785
79f5310
 
 
 
 
 
 
 
 
 
 
7d4e785
 
 
 
79f5310
 
 
 
 
7d4e785
c400a42
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: apache-2.0
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: text-classification
tags:
- language detection
- German
- English
- French
- Spanish
- GEFS
- Language dectetor
datasets:
- papluca/language-identification
language:
- de
- en
- fr
- es
---

# German, English, French and Spanish Language Detector

The ImranzamanML/GEFS-language-detector is a fined tuned model by using the dataset of papluca [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) and the base model [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) .

This language detection model demonstrated exceptional performance, achieving an impressive F1 score close to 100%. This result significantly exceeds typical benchmarks and underscores the model's accuracy and reliability in identifying languages.
## Supported languages
Currently this model support 4 languages but in future more languages will be added. 

Following languages supported by the model:
- german (de)
- english (en)
- spanish (es)
- french (fr)

# Use a pipeline as a high-level helper

```python
from transformers import pipeline

text=["Mir gefällt die Art und Weise, Sprachen zu erkennen",
      "I like the way to detect languages",
      "Me gusta la forma de detectar idiomas",
      "J'aime la façon de détecter les langues"]
pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector")
lang_detect=pipe(text, top_k=1)
print("The detected language is", lang_detect)
```

# Load model directly

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector")
model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector")

```

## Model Training

    Epoch	  Training Loss	    Validation Loss
    1	      0.002600	        0.000148  
    2	      0.001000	        0.000015
    3	      0.000000	        0.000011
    4	      0.001800	        0.000009
    5	      0.002700	        0.000016
    6	      0.001600	        0.000012
    7	      0.001300	        0.000009
    8	      0.001200	        0.000008
    9	      0.000900	        0.000007
    10	      0.000900	        0.000007


## Testing Results

    Language   Precision   Recall	F1 	     Accuracy
    de	       0.9997	   0.9998	0.9998   0.9999
    en	       1.0000	   1.0000	1.0000	 1.0000
    fr	       0.9995	   0.9996	0.9996	 0.9996
    es	       0.9994	   0.9996	0.9995	 0.9996

## About Author

- **Name**: Muhammad Imran Zaman
- **Company**: [Theum AG](https://theum.com/en/index.htm?t=)
- **Professional Links**:
    - Kaggle: [Profile](https://www.kaggle.com/muhammadimran112233)
    - LinkedIn: [Profile](linkedin.com/in/muhammad-imran-zaman)
    - Google Scholar: [Profile](https://scholar.google.com/citations?user=ulVFpy8AAAAJ&hl=en)
    - YouTube: [Channel](https://www.youtube.com/@consolioo)
    - GitHub: [Channel](https://github.com/Imran-ml)
-