giuseppe-tanzi commited on
Commit
0bc59cd
·
verified ·
1 Parent(s): a580541

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +232 -0
  2. config.json +12 -0
  3. data_collator.py +88 -0
  4. example_usage.py +49 -0
  5. pytorch_model.bin +3 -0
  6. requirements.txt +8 -0
README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ tags:
5
+ - audio
6
+ - text
7
+ - multimodal
8
+ - seamless
9
+ - subtitle-editing-time-prediction
10
+ - translation-aware
11
+ - language-pairs
12
+ license: apache-2.0
13
+ library_name: transformers
14
+ base_model: facebook/hf-seamless-m4t-medium
15
+ ---
16
+
17
+ # videoloc/seamless-langpairs
18
+
19
+ ## Model Description
20
+
21
+ This is a **SeamlessLanguagePairs** model that processes audio and text inputs with both translation awareness and language pair embeddings to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, taking into account both whether the subtitle is translated and the specific language pair involved.
22
+
23
+ The model extends the SeamlessM4T architecture with both translation features and language pair embeddings, providing the most granular control for multilingual video localization scenarios with support for 21 different language pairs.
24
+
25
+ ### Key Features
26
+
27
+ - **Language Pair Embeddings**: Fine-grained control for 21 language pairs plus "other"
28
+ - **Translation-Aware Processing**: Distinguishes between original and translated content
29
+ - **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
30
+ - **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
31
+ - **Enhanced Architecture**: Adds both translation and language pair embeddings
32
+ - **TTE Prediction**: Predicts editing time required for subtitle segments
33
+ - **Direct Output**: Raw time values in seconds for immediate use
34
+
35
+ ## Model Architecture
36
+
37
+ The model extends the basic SeamlessM4T architecture with both translation and language pair awareness:
38
+
39
+ 1. **Audio Processing**:
40
+ - SeamlessM4T speech encoder (frozen) processes raw audio input
41
+ - Audio projection layer maps speech encoder output to 1024 dimensions
42
+ - Mean pooling over sequence length to get fixed-size audio embedding
43
+
44
+ 2. **Text Processing**:
45
+ - SeamlessM4T text encoder (frozen) processes tokenized text input
46
+ - Text projection layer maps text encoder output to 1024 dimensions
47
+ - Mean pooling over sequence length to get fixed-size text embedding
48
+
49
+ 3. **Translation Feature Processing**:
50
+ - Binary translation flag (0/1) indicating original vs translated content
51
+ - Translation projection layer maps binary input to 32 dimensions
52
+ - Learned embedding helps model distinguish translation effects
53
+
54
+ 4. **Language Pair Processing**:
55
+ - Categorical language pair ID (0-20) for specific language combinations
56
+ - Language pair embedding layer maps IDs to 64-dimensional vectors
57
+ - Captures language-specific temporal alignment patterns
58
+
59
+ 5. **Feature Fusion**:
60
+ - Audio, text, translation, and language pair embeddings are concatenated (2144 total dimensions)
61
+ - Simple concatenation without complex cross-modal interactions
62
+
63
+ 6. **Regression Head**:
64
+ - Multi-layer perceptron: 2144 → 1024 → 512 → 256 → 1
65
+ - ReLU activations and dropout for regularization
66
+ - Single output for TTE prediction (regression, in seconds)
67
+
68
+ ## Quick Start
69
+
70
+ ### Installation
71
+ ```bash
72
+ pip install transformers torch torchaudio huggingface_hub
73
+ ```
74
+
75
+ ### Basic Usage
76
+ ```python
77
+ from transformers import AutoModel, AutoConfig
78
+ from huggingface_hub import hf_hub_download
79
+ import torch
80
+ import numpy as np
81
+ import importlib.util
82
+
83
+ # Load model
84
+ model = AutoModel.from_pretrained("videoloc/seamless-langpairs")
85
+ config = AutoConfig.from_pretrained("videoloc/seamless-langpairs")
86
+
87
+ # Load the data collator (included in this repo)
88
+ collator_file = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="data_collator.py")
89
+ spec = importlib.util.spec_from_file_location("data_collator", collator_file)
90
+ collator_module = importlib.util.module_from_spec(spec)
91
+ spec.loader.exec_module(collator_module)
92
+
93
+ # Initialize data collator
94
+ data_collator = collator_module.DataCollatorSimpleSeamless(
95
+ processor="facebook/hf-seamless-m4t-medium",
96
+ max_audio_length_sec=8.0,
97
+ max_text_length=256
98
+ )
99
+
100
+ # Prepare your data with translation and language pair information
101
+ your_data = [
102
+ {
103
+ 'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz
104
+ 'raw_text': "Your subtitle text here",
105
+ 'is_translation': 1, # 1 for translated content, 0 for original
106
+ 'language_pair_id': 5, # 0-20 for specific language pairs
107
+ }
108
+ ]
109
+
110
+ # Process and run inference
111
+ batch = data_collator(your_data)
112
+ model.eval()
113
+ with torch.no_grad():
114
+ outputs = model(**batch)
115
+ tte_prediction = outputs.logits.item()
116
+
117
+ print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
118
+ ```
119
+
120
+ ## Model Details
121
+
122
+ - **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
123
+ - **Audio Encoder**: Frozen SeamlessM4T speech encoder
124
+ - **Text Encoder**: Frozen SeamlessM4T text encoder
125
+ - **Hidden Size**: 1024
126
+ - **Translation Embedding**: 32 dimensions
127
+ - **Language Pair Embedding**: 64 dimensions
128
+ - **Number of Language Pairs**: 21 (plus "other")
129
+ - **Audio Input**: 16kHz, max 8.0 seconds
130
+ - **Text Input**: Max 256 tokens
131
+ - **Translation Input**: Binary flag (0/1)
132
+ - **Language Pair Input**: Categorical ID (0-20)
133
+ - **Output**: Single regression value (TTE in seconds)
134
+ - **Task**: Subtitle editing time prediction
135
+
136
+ ## Data Format
137
+
138
+ Your input data should be a list of dictionaries with:
139
+ - `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
140
+ - `raw_text`: String of subtitle text
141
+ - `is_translation`: Binary flag (1 for translated, 0 for original content)
142
+ - `language_pair_id`: Integer ID (0-20) for specific language pair
143
+ - `labels`: Target TTE values in seconds (optional, for training)
144
+
145
+ Example:
146
+ ```python
147
+ data = [
148
+ {
149
+ 'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz
150
+ 'raw_text': "Subtitle text content",
151
+ 'is_translation': 1, # 1 = translated, 0 = original
152
+ 'language_pair_id': 5, # 0-20 for language pairs
153
+ 'labels': 2.5 # optional TTE target value in seconds
154
+ }
155
+ ]
156
+ ```
157
+
158
+ ## Performance Metrics
159
+
160
+ - **Best Eval RMSE**: 33.34
161
+
162
+ ## Training Details
163
+
164
+ - **Base Model**: facebook/hf-seamless-m4t-medium
165
+ - **Model Type**: seamless_lang_pairs
166
+ - **Epochs**: 10
167
+ - **Batch Size (Train)**: 32
168
+ - **Batch Size (Eval)**: 64
169
+ - **Learning Rate**: 1.2e-4
170
+ - **LR Scheduler**: cosine_with_restarts
171
+ - **Warmup Ratio**: 0.05
172
+ - **Weight Decay**: 0.001
173
+ - **Optimizer**: AdamW (torch)
174
+ - **Max Grad Norm**: 1.0
175
+ - **FP16**: True
176
+ - **Early Stopping Patience**: 5
177
+ - **Audio Max Length**: 8.0 seconds
178
+ - **Text Max Length**: 256 tokens
179
+ - **Sample Rate**: 16kHz
180
+ - **Translation Feature**: Binary flag (0/1)
181
+ - **Language Pairs**: 21 pairs + other
182
+ - **Language Pair Embedding**: 64 dimensions
183
+ - **Normalization**: None (raw values)
184
+ - **Dataset Split**: 80/20 train/test
185
+ - **Random Seed**: 42
186
+ - **Metric**: RMSE (lower is better)
187
+
188
+ ## Training Configuration
189
+
190
+ The model was trained with the following specifications:
191
+
192
+ - **Dataset**: Multimodal audio-subtitle pairs with translation and language pair annotations
193
+ - **Train/Test Split**: 80/20 with random seed 42
194
+ - **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
195
+ - **Text Processing**: Max 256 tokens
196
+ - **Translation Feature**: Binary flag indicating original vs translated content
197
+ - **Language Pairs**: 21 most frequent language pairs plus "other" category
198
+ - **Normalization**: None (raw TTE values in seconds)
199
+ - **Caching**: Audio segments cached and compressed for efficiency
200
+
201
+ ## Language Pairs Supported
202
+
203
+ The model supports embeddings for 21 language pairs (IDs 0-20). The exact mapping depends on your training data, but typically includes popular combinations like:
204
+ - English ↔ Spanish, French, German, Italian, Portuguese
205
+ - Cross-European language pairs
206
+ - English ↔ Asian languages (Chinese, Japanese, Korean)
207
+ - Other high-frequency translation pairs in your dataset
208
+
209
+ ## Usage Notes
210
+
211
+ - This is the **most advanced** variant with both translation and language pair features
212
+ - For simpler models, see `seamless-basic` (audio+text only) or `seamless-translation` (with translation flag)
213
+ - Model expects 16kHz audio input (automatically resampled by data collator)
214
+ - Both translation flag and language pair ID significantly impact predictions
215
+ - Language pair embeddings capture language-specific temporal patterns
216
+ - No feature normalization applied - outputs raw TTE predictions in seconds
217
+ - Optimized for fine-grained subtitle editing time estimation tasks
218
+
219
+ ## Limitations
220
+
221
+ - Maximum audio length: 8.0 seconds
222
+ - Maximum text length: 256 tokens
223
+ - Requires both translation and language pair annotations in training data
224
+ - Language pair embeddings are dataset-specific (top 21 pairs from training)
225
+ - Designed for TTE prediction, not general audio-text matching
226
+ - Performance may vary on out-of-domain content and unseen language pairs
227
+ - Requires specific data preprocessing (use included data collator)
228
+
229
+ ## Related Models
230
+
231
+ - **seamless-basic**: Basic audio+text model without translation or language features
232
+ - **seamless-translation**: Includes translation awareness but no language pair embeddings
config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HFSeamlessLanguagePairs"
4
+ ],
5
+ "dropout_prob": 0.1,
6
+ "hidden_size": 1024,
7
+ "model_type": "seamless_language_pairs",
8
+ "num_language_pairs": 21,
9
+ "seamless_model_name": "facebook/hf-seamless-m4t-medium",
10
+ "torch_dtype": "float32",
11
+ "transformers_version": "4.50.2"
12
+ }
data_collator.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ from transformers import AutoProcessor
4
+ from typing import Dict, List, Union
5
+ import logging
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ class DataCollatorSimpleSeamless:
10
+ def __init__(
11
+ self,
12
+ processor: str,
13
+ sample_rate: int = 16000,
14
+ max_audio_length_sec: float = 8.0,
15
+ max_text_length: int = 256,
16
+ normalization_type: str = "none"
17
+ ):
18
+ """Initialize the data collator.
19
+
20
+ Args:
21
+ processor: The processor to use.
22
+ sample_rate: Audio sample rate.
23
+ max_audio_length_sec: Maximum audio length in seconds.
24
+ max_text_length: Maximum text length.
25
+ normalization_type: Type of normalization to apply to labels. Options: "log1p", "none"
26
+ """
27
+ logger.info(f"Loading processor: {processor}")
28
+ self.processor = AutoProcessor.from_pretrained(processor)
29
+
30
+ self.sample_rate = sample_rate
31
+ self.max_audio_sample_length = int(max_audio_length_sec * sample_rate)
32
+ self.max_text_length = max_text_length
33
+ self.normalization_type = normalization_type
34
+
35
+ def __call__(self, batch: List[Dict[str, Union[np.ndarray, str, float]]]) -> Dict[str, torch.Tensor]:
36
+ """Process a batch of raw features into model inputs."""
37
+ # Extract raw data
38
+ raw_audios = [item['raw_audio'] for item in batch]
39
+ raw_texts = [item['raw_text'] for item in batch]
40
+
41
+ raw_audios = [torch.tensor(audio) for audio in raw_audios]
42
+
43
+ audio_inputs = self.processor(
44
+ audios=raw_audios,
45
+ sampling_rate=self.sample_rate,
46
+ return_tensors="pt",
47
+ padding="longest",
48
+ truncation=True,
49
+ max_length=self.max_audio_sample_length,
50
+ )
51
+
52
+ text_inputs = self.processor(
53
+ text=raw_texts,
54
+ return_tensors="pt",
55
+ padding="longest",
56
+ truncation=True,
57
+ max_length=self.max_text_length,
58
+ )
59
+
60
+ # Extract translation features
61
+ is_translation = torch.tensor([item.get('is_translation', 0) for item in batch], dtype=torch.float32)
62
+
63
+ # Extract language pair features
64
+ language_pair_id = torch.tensor([item.get('language_pair_id', 0) for item in batch], dtype=torch.long)
65
+
66
+ if 'labels' in batch[0]:
67
+ labels = [item['labels'] for item in batch]
68
+ labels = torch.tensor(labels, dtype=torch.float32)
69
+
70
+ # Apply normalization based on type
71
+ if self.normalization_type == "log1p":
72
+ labels = torch.log1p(labels)
73
+ elif self.normalization_type == "none":
74
+ pass
75
+ else:
76
+ raise ValueError(f"Unknown normalization type: {self.normalization_type}")
77
+ else:
78
+ labels = None
79
+
80
+ return {
81
+ 'input_features': audio_inputs['input_features'],
82
+ 'audio_attention_mask': audio_inputs.get('attention_mask', None) if audio_inputs.get('attention_mask') is not None else None,
83
+ 'input_ids': text_inputs['input_ids'],
84
+ 'text_attention_mask': text_inputs['attention_mask'],
85
+ 'is_translation': is_translation,
86
+ 'language_pair_id': language_pair_id,
87
+ **({'labels': labels} if labels is not None else {})
88
+ }
example_usage.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # Example usage for videoloc/seamless-langpairs
3
+
4
+ from transformers import AutoModel, AutoConfig
5
+ from huggingface_hub import hf_hub_download
6
+ import torch
7
+ import numpy as np
8
+ import importlib.util
9
+
10
+ def load_model_and_collator():
11
+ model = AutoModel.from_pretrained("videoloc/seamless-langpairs")
12
+ config = AutoConfig.from_pretrained("videoloc/seamless-langpairs")
13
+
14
+ # Load data collator
15
+ collator_file = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="data_collator.py")
16
+ spec = importlib.util.spec_from_file_location("data_collator", collator_file)
17
+ collator_module = importlib.util.module_from_spec(spec)
18
+ spec.loader.exec_module(collator_module)
19
+
20
+ data_collator = collator_module.DataCollatorSimpleSeamless(
21
+ processor="facebook/hf-seamless-m4t-medium",
22
+ max_audio_length_sec=8.0,
23
+ max_text_length=256
24
+ )
25
+
26
+ return model, data_collator
27
+
28
+ def example_inference():
29
+ model, collator = load_model_and_collator()
30
+
31
+ # Example data with translation and language pair awareness
32
+ data = [{
33
+ 'raw_audio': np.random.randn(16000 * 3), # 3 seconds at 16kHz
34
+ 'raw_text': "Example subtitle text for temporal alignment",
35
+ 'is_translation': 1, # 1 for translated content, 0 for original
36
+ 'language_pair_id': 5, # 0-20 for specific language pairs
37
+ }]
38
+
39
+ batch = collator(data)
40
+ model.eval()
41
+ with torch.no_grad():
42
+ outputs = model(**batch)
43
+ tte_prediction = outputs.logits.item()
44
+
45
+ print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
46
+ return tte_prediction
47
+
48
+ if __name__ == "__main__":
49
+ example_inference()
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e3037a762e659d5e3acaf60ecdd58a76aea92fc01b50f1cb70fb200b802e2a6
3
+ size 4858339608
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.50.2
2
+ torch>=2.6.0
3
+ torchaudio>=2.6.0
4
+ huggingface_hub>=0.33.0
5
+ numpy>=2.2.3
6
+ sentencepiece>=0.2.0
7
+ accelerate>=1.5.2
8
+ soundfile>=0.13.1