Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ base_model: facebook/hf-seamless-m4t-medium
|
|
19 |
|
20 |
This is a **SeamlessLanguagePairs** model that processes audio and text inputs with both translation awareness and language pair embeddings to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, taking into account both whether the subtitle is translated and the specific language pair involved.
|
21 |
|
22 |
-
The model extends the SeamlessM4T architecture with both translation features and language pair embeddings, providing the most granular control for multilingual
|
23 |
|
24 |
### Key Features
|
25 |
|
@@ -137,6 +137,14 @@ print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
|
|
137 |
- **Output**: Single regression value (TTE in seconds)
|
138 |
- **Task**: Subtitle editing time prediction
|
139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
## Data Format
|
141 |
|
142 |
Your input data should be a list of dictionaries with:
|
@@ -193,12 +201,12 @@ data = [
|
|
193 |
|
194 |
The model was trained with the following specifications:
|
195 |
|
196 |
-
- **Dataset**: Multimodal audio-subtitle pairs with translation and language pair annotations
|
197 |
- **Train/Test Split**: 80/20 with random seed 42
|
198 |
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
|
199 |
- **Text Processing**: Max 256 tokens
|
200 |
- **Translation Feature**: Binary flag indicating original vs translated content
|
201 |
-
- **Language Pairs**: 21
|
202 |
- **Normalization**: None (raw TTE values in seconds)
|
203 |
- **Caching**: Audio segments cached and compressed for efficiency
|
204 |
|
|
|
19 |
|
20 |
This is a **SeamlessLanguagePairs** model that processes audio and text inputs with both translation awareness and language pair embeddings to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, taking into account both whether the subtitle is translated and the specific language pair involved.
|
21 |
|
22 |
+
The model extends the SeamlessM4T architecture with both translation features and language pair embeddings, providing the most granular control for multilingual scenarios across **5 languages: English, French, Spanish, Italian, and German** with **21 different translation pairs** between them (e.g., EN→FR, ES→DE, IT→EN, etc.).
|
23 |
|
24 |
### Key Features
|
25 |
|
|
|
137 |
- **Output**: Single regression value (TTE in seconds)
|
138 |
- **Task**: Subtitle editing time prediction
|
139 |
|
140 |
+
## Supported Language Pairs
|
141 |
+
|
142 |
+
The model supports 21 specific translation pairs between 5 languages:
|
143 |
+
|
144 |
+
**Languages**: English (EN), French (FR), Spanish (ES), Italian (IT), German (DE)
|
145 |
+
|
146 |
+
**Translation Pairs**: All combinations between the 5 languages create various directional pairs (e.g., EN→FR, FR→EN, ES→IT, DE→ES, etc.). The model uses language pair IDs (0-20) to identify specific translation directions, with ID 21 reserved for "other" pairs.
|
147 |
+
|
148 |
## Data Format
|
149 |
|
150 |
Your input data should be a list of dictionaries with:
|
|
|
201 |
|
202 |
The model was trained with the following specifications:
|
203 |
|
204 |
+
- **Dataset**: Multimodal audio-subtitle pairs with translation and language pair annotations (5 languages: EN, FR, ES, IT, DE with 21 pairs)
|
205 |
- **Train/Test Split**: 80/20 with random seed 42
|
206 |
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
|
207 |
- **Text Processing**: Max 256 tokens
|
208 |
- **Translation Feature**: Binary flag indicating original vs translated content
|
209 |
+
- **Language Pairs**: 21 translation pairs from 5 languages (EN, FR, ES, IT, DE) plus "other" category
|
210 |
- **Normalization**: None (raw TTE values in seconds)
|
211 |
- **Caching**: Audio segments cached and compressed for efficiency
|
212 |
|