File size: 2,393 Bytes
445ca75
 
30cec52
 
 
 
 
 
 
 
 
 
445ca75
 
b3a0ae3
445ca75
30cec52
445ca75
 
 
 
 
30cec52
 
 
 
445ca75
 
 
 
 
30cec52
 
 
445ca75
30cec52
445ca75
30cec52
 
 
 
445ca75
30cec52
 
445ca75
30cec52
 
 
 
445ca75
30cec52
 
 
445ca75
30cec52
 
 
 
 
 
445ca75
 
 
30cec52
 
 
 
445ca75
30cec52
445ca75
30cec52
 
 
445ca75
 
 
dce5a9a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
library_name: transformers
license: mit
datasets:
- ai4bharat/IndicVoices
language:
- hi
- gu
- mr
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
---

# Open-Sarika

This is a speech recognition and translation model for Indian languages (Hindi, Gujarati, and Marathi). The model can transcribe speech in these languages and translate between them. This is an open-source implementation inspired by Sarvam AI's Sarika model.

## Model Details

### Model Description

- **Model type:** Speech Recognition and Translation (based on Whisper architecture)
- **Language(s):** Hindi (hi), Gujarati (gu), Marathi (mr)
- **License:** MIT
- **Base Model:** openai/whisper-large-v3

## Uses

### Direct Use

The model can be used for:
1. Transcribing speech in Hindi, Gujarati, and Marathi
2. Translating speech between these languages

Here's a simple example to get started:

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

model_id = "theharshithh/open-sarika-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
model.config.forced_decoder_ids = None

# Load and process audio
audio_path = "your_audio.wav"
audio, rate = librosa.load(audio_path, sr=16000)

# Generate transcription
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to(device)
with torch.no_grad():
    output_ids = model.generate(**inputs)
transcription = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
```

### Training Data

The model was trained on a variety of datasets, including:
- Project Vaani dataset: A large-scale Indian language collection project by the Indian Institute of Science (IISc) in collaboration with ARTPARK, funded by Google
- High-quality speech recordings in Hindi, Gujarati, and Marathi from AI4Bharat
- Real-world speech data from various sources

### Hardware Requirements

- Minimum RAM: 8GB
- GPU: Recommended for faster inference
- Storage: Model size is approximately 1.5GB

## Model Card Contact

For issues and feedback, please create an issue on the model's repository: https://huggingface.co/theharshithh/open-sarika-v1

## Github

Github Repo: https://github.com/theharshithh/open-sarika