giuseppe-tanzi commited on
Commit
8525e7c
·
verified ·
1 Parent(s): 0620c95

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +191 -0
  2. config.json +11 -0
  3. data_collator.py +88 -0
  4. example_usage.py +47 -0
  5. pytorch_model.bin +3 -0
  6. requirements.txt +8 -0
README.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ tags:
5
+ - audio
6
+ - text
7
+ - multimodal
8
+ - seamless
9
+ - subtitle-editing-time-prediction
10
+ library_name: transformers
11
+ pipeline_tag: audio-regression
12
+ ---
13
+
14
+ # videoloc/seamless-basic
15
+
16
+ ## Model Description
17
+
18
+ This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment.
19
+
20
+ The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations.
21
+
22
+ ### Key Features
23
+
24
+ - **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
25
+ - **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
26
+ - **TTE Prediction**: Predicts editing time required for subtitle segments
27
+ - **Efficient Architecture**: Optimized for inference with gradient checkpointing support
28
+ - **Direct Output**: Raw time values in seconds for immediate use
29
+
30
+ ## Model Architecture
31
+
32
+ The model consists of the following components:
33
+
34
+ 1. **Audio Processing**:
35
+ - SeamlessM4T speech encoder (frozen) processes raw audio input
36
+ - Audio projection layer maps speech encoder output to 1024 dimensions
37
+ - Mean pooling over sequence length to get fixed-size audio embedding
38
+
39
+ 2. **Text Processing**:
40
+ - SeamlessM4T text encoder (frozen) processes tokenized text input
41
+ - Text projection layer maps text encoder output to 1024 dimensions
42
+ - Mean pooling over sequence length to get fixed-size text embedding
43
+
44
+ 3. **Feature Fusion**:
45
+ - Audio and text embeddings are concatenated (2048 total dimensions)
46
+ - No additional cross-modal attention or complex fusion mechanisms
47
+
48
+ 4. **Regression Head**:
49
+ - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1
50
+ - ReLU activations and dropout for regularization
51
+ - Single output for TTE prediction (regression, in seconds)
52
+
53
+ ## Quick Start
54
+
55
+ ### Installation
56
+ ```bash
57
+ pip install transformers torch torchaudio huggingface_hub
58
+ ```
59
+
60
+ ### Basic Usage
61
+ ```python
62
+ from transformers import AutoModel, AutoConfig
63
+ from huggingface_hub import hf_hub_download
64
+ import torch
65
+ import numpy as np
66
+ import importlib.util
67
+
68
+ # Load model
69
+ model = AutoModel.from_pretrained("videoloc/seamless-basic")
70
+ config = AutoConfig.from_pretrained("videoloc/seamless-basic")
71
+
72
+ # Load the data collator (included in this repo)
73
+ collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
74
+ spec = importlib.util.spec_from_file_location("data_collator", collator_file)
75
+ collator_module = importlib.util.module_from_spec(spec)
76
+ spec.loader.exec_module(collator_module)
77
+
78
+ # Initialize data collator
79
+ data_collator = collator_module.DataCollatorSimpleSeamless(
80
+ processor="facebook/hf-seamless-m4t-medium",
81
+ max_audio_length_sec=8.0,
82
+ max_text_length=256
83
+ # normalization_type="none" is default
84
+ )
85
+
86
+ # Prepare your data
87
+ your_data = [
88
+ {
89
+ 'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz
90
+ 'raw_text': "Your subtitle text here",
91
+ # Note: No translation features needed for basic model
92
+ }
93
+ ]
94
+
95
+ # Process and run inference
96
+ batch = data_collator(your_data)
97
+ model.eval()
98
+ with torch.no_grad():
99
+ outputs = model(**batch)
100
+ tte_prediction = outputs.logits.item()
101
+
102
+ print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds")
103
+ ```
104
+
105
+ ## Model Details
106
+
107
+ - **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
108
+ - **Audio Encoder**: Frozen SeamlessM4T speech encoder
109
+ - **Text Encoder**: Frozen SeamlessM4T text encoder
110
+ - **Hidden Size**: 1024
111
+ - **Audio Input**: 16kHz, max 8.0 seconds
112
+ - **Text Input**: Max 256 tokens
113
+ - **Output**: Single regression value (TTE in seconds)
114
+ - **Task**: Subtitle editing time prediction
115
+
116
+ ## Data Format
117
+
118
+ Your input data should be a list of dictionaries with:
119
+ - `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
120
+ - `raw_text`: String of subtitle text
121
+ - `labels`: Target TTE values in seconds (optional, for training)
122
+
123
+ Example:
124
+ ```python
125
+ data = [
126
+ {
127
+ 'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz
128
+ 'raw_text': "Subtitle text content",
129
+ 'labels': 2.5 # optional TTE target value in seconds
130
+ }
131
+ ]
132
+ ```
133
+
134
+ ## Performance Metrics
135
+
136
+ - **Best Eval RMSE**: 33.34
137
+
138
+ ## Training Details
139
+
140
+ - **Base Model**: facebook/hf-seamless-m4t-medium
141
+ - **Epochs**: 10
142
+ - **Batch Size (Train)**: 32
143
+ - **Batch Size (Eval)**: 64
144
+ - **Learning Rate**: 1.2e-4
145
+ - **LR Scheduler**: cosine_with_restarts
146
+ - **Warmup Ratio**: 0.05
147
+ - **Weight Decay**: 0.001
148
+ - **Optimizer**: AdamW (torch)
149
+ - **Max Grad Norm**: 1.0
150
+ - **FP16**: True
151
+ - **Early Stopping Patience**: 5
152
+ - **Audio Max Length**: 8.0 seconds
153
+ - **Text Max Length**: 256 tokens
154
+ - **Sample Rate**: 16kHz
155
+ - **Normalization**: None (raw values)
156
+ - **Dataset Split**: 80/20 train/test
157
+ - **Random Seed**: 42
158
+ - **Metric**: RMSE (lower is better)
159
+ - **Audio Caching**: Enabled with compression
160
+ - **Workers**: 8
161
+
162
+ ## Training Configuration
163
+
164
+ The model was trained with the following specifications:
165
+
166
+ - **Dataset**: Multimodal audio-subtitle pairs with TTE annotations
167
+ - **Train/Test Split**: 80/20 with random seed 42
168
+ - **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
169
+ - **Text Processing**: Max 256 tokens
170
+ - **Normalization**: None (raw TTE values in seconds)
171
+ - **Caching**: Audio segments cached and compressed for efficiency
172
+
173
+ ## Usage Notes
174
+
175
+ - This is the **basic** variant - processes only audio and text
176
+ - For translation-aware models, see `seamless-translation` and `seamless-langpairs`
177
+ - Model expects 16kHz audio input (automatically resampled by data collator)
178
+ - Text is processed with SeamlessM4T text encoder
179
+ - No feature normalization applied - outputs raw TTE predictions in seconds
180
+ - Optimized for subtitle editing time estimation tasks
181
+
182
+ ## Limitations
183
+
184
+ - Designed for TTE prediction, not general audio-text matching
185
+ - Performance may vary on out-of-domain content or different editing workflows
186
+ - Requires specific data preprocessing (use included data collator)
187
+
188
+ ## Related Models
189
+
190
+ - **seamless-translation**: Adds translation awareness features
191
+ - **seamless-langpairs**: Includes language pair embeddings for multilingual scenarios
config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HFSeamlessBasic"
4
+ ],
5
+ "dropout_prob": 0.1,
6
+ "hidden_size": 1024,
7
+ "model_type": "seamless_basic",
8
+ "seamless_model_name": "facebook/hf-seamless-m4t-medium",
9
+ "torch_dtype": "float32",
10
+ "transformers_version": "4.50.2"
11
+ }
data_collator.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ from transformers import AutoProcessor
4
+ from typing import Dict, List, Union
5
+ import logging
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ class DataCollatorSimpleSeamless:
10
+ def __init__(
11
+ self,
12
+ processor: str,
13
+ sample_rate: int = 16000,
14
+ max_audio_length_sec: float = 8.0,
15
+ max_text_length: int = 256,
16
+ normalization_type: str = "none"
17
+ ):
18
+ """Initialize the data collator.
19
+
20
+ Args:
21
+ processor: The processor to use.
22
+ sample_rate: Audio sample rate.
23
+ max_audio_length_sec: Maximum audio length in seconds.
24
+ max_text_length: Maximum text length.
25
+ normalization_type: Type of normalization to apply to labels. Options: "log1p", "none"
26
+ """
27
+ logger.info(f"Loading processor: {processor}")
28
+ self.processor = AutoProcessor.from_pretrained(processor)
29
+
30
+ self.sample_rate = sample_rate
31
+ self.max_audio_sample_length = int(max_audio_length_sec * sample_rate)
32
+ self.max_text_length = max_text_length
33
+ self.normalization_type = normalization_type
34
+
35
+ def __call__(self, batch: List[Dict[str, Union[np.ndarray, str, float]]]) -> Dict[str, torch.Tensor]:
36
+ """Process a batch of raw features into model inputs."""
37
+ # Extract raw data
38
+ raw_audios = [item['raw_audio'] for item in batch]
39
+ raw_texts = [item['raw_text'] for item in batch]
40
+
41
+ raw_audios = [torch.tensor(audio) for audio in raw_audios]
42
+
43
+ audio_inputs = self.processor(
44
+ audios=raw_audios,
45
+ sampling_rate=self.sample_rate,
46
+ return_tensors="pt",
47
+ padding="longest",
48
+ truncation=True,
49
+ max_length=self.max_audio_sample_length,
50
+ )
51
+
52
+ text_inputs = self.processor(
53
+ text=raw_texts,
54
+ return_tensors="pt",
55
+ padding="longest",
56
+ truncation=True,
57
+ max_length=self.max_text_length,
58
+ )
59
+
60
+ # Extract translation features
61
+ is_translation = torch.tensor([item.get('is_translation', 0) for item in batch], dtype=torch.float32)
62
+
63
+ # Extract language pair features
64
+ language_pair_id = torch.tensor([item.get('language_pair_id', 0) for item in batch], dtype=torch.long)
65
+
66
+ if 'labels' in batch[0]:
67
+ labels = [item['labels'] for item in batch]
68
+ labels = torch.tensor(labels, dtype=torch.float32)
69
+
70
+ # Apply normalization based on type
71
+ if self.normalization_type == "log1p":
72
+ labels = torch.log1p(labels)
73
+ elif self.normalization_type == "none":
74
+ pass
75
+ else:
76
+ raise ValueError(f"Unknown normalization type: {self.normalization_type}")
77
+ else:
78
+ labels = None
79
+
80
+ return {
81
+ 'input_features': audio_inputs['input_features'],
82
+ 'audio_attention_mask': audio_inputs.get('attention_mask', None) if audio_inputs.get('attention_mask') is not None else None,
83
+ 'input_ids': text_inputs['input_ids'],
84
+ 'text_attention_mask': text_inputs['attention_mask'],
85
+ 'is_translation': is_translation,
86
+ 'language_pair_id': language_pair_id,
87
+ **({'labels': labels} if labels is not None else {})
88
+ }
example_usage.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # Example usage for videoloc/seamless-basic
3
+
4
+ from transformers import AutoModel, AutoConfig
5
+ from huggingface_hub import hf_hub_download
6
+ import torch
7
+ import numpy as np
8
+ import importlib.util
9
+
10
+ def load_model_and_collator():
11
+ model = AutoModel.from_pretrained("videoloc/seamless-basic")
12
+ config = AutoConfig.from_pretrained("videoloc/seamless-basic")
13
+
14
+ # Load data collator
15
+ collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
16
+ spec = importlib.util.spec_from_file_location("data_collator", collator_file)
17
+ collator_module = importlib.util.module_from_spec(spec)
18
+ spec.loader.exec_module(collator_module)
19
+
20
+ data_collator = collator_module.DataCollatorSimpleSeamless(
21
+ processor="facebook/hf-seamless-m4t-medium",
22
+ max_audio_length_sec=8.0,
23
+ max_text_length=256
24
+ )
25
+
26
+ return model, data_collator
27
+
28
+ def example_inference():
29
+ model, collator = load_model_and_collator()
30
+
31
+ # Example data: audio segment + subtitle text to predict editing time
32
+ data = [{
33
+ 'raw_audio': np.random.randn(16000 * 3), # 3 seconds at 16kHz
34
+ 'raw_text': "Hello, welcome to our presentation today.",
35
+ }]
36
+
37
+ batch = collator(data)
38
+ model.eval()
39
+ with torch.no_grad():
40
+ outputs = model(**batch)
41
+ tte_prediction = outputs.logits.item()
42
+
43
+ print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
44
+ return tte_prediction
45
+
46
+ if __name__ == "__main__":
47
+ example_inference()
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88d20bd96bdcb428c064083bb2e2eef54b770f03ccf8d3d60a1bb464e51c2b92
3
+ size 4857939849
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.50.2
2
+ torch>=2.6.0
3
+ torchaudio>=2.6.0
4
+ huggingface_hub>=0.33.0
5
+ numpy>=2.2.3
6
+ sentencepiece>=0.2.0
7
+ accelerate>=1.5.2
8
+ soundfile>=0.13.1