File size: 6,888 Bytes
8525e7c
 
 
 
 
 
 
 
 
 
867d40b
0638a03
8525e7c
 
 
 
 
 
 
 
a6453d8
8525e7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f7b56f
 
 
 
 
 
 
 
 
8525e7c
 
 
 
 
 
 
 
d848202
 
 
 
 
8525e7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14dc36f
8525e7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6453d8
8525e7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc568ca
 
0638a03
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
language:
- multilingual
tags:
- audio
- text
- multimodal
- seamless
- subtitle-editing-time-prediction
library_name: transformers
base_model: facebook/hf-seamless-m4t-medium
license: cc-by-nc-4.0
---

# videoloc/seamless-basic

## Model Description

This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment.

The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations across 5 languages: **English, French, Spanish, Italian, and German**.

### Key Features

- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
- **TTE Prediction**: Predicts editing time required for subtitle segments
- **Direct Output**: Raw time values in seconds for immediate use

## Model Architecture

The model consists of the following components:

1. **Audio Processing**: 
   - SeamlessM4T speech encoder (frozen) processes raw audio input
   - Audio projection layer maps speech encoder output to 1024 dimensions
   - Mean pooling over sequence length to get fixed-size audio embedding

2. **Text Processing**:
   - SeamlessM4T text encoder (frozen) processes tokenized text input  
   - Text projection layer maps text encoder output to 1024 dimensions
   - Mean pooling over sequence length to get fixed-size text embedding

3. **Feature Fusion**:
   - Audio and text embeddings are concatenated (2048 total dimensions)
   - No additional cross-modal attention or complex fusion mechanisms

4. **Regression Head**:
   - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1
   - ReLU activations and dropout for regularization
   - Single output for TTE prediction (regression, in seconds)

## Quick Start

### Installation
```bash
pip install transformers torch torchaudio huggingface_hub
```

### Basic Usage
```python
from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util

# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)

# Now load the model using the custom class
config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic")
model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic")

# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)

# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
    processor="facebook/hf-seamless-m4t-medium",
    max_audio_length_sec=8.0,
    max_text_length=256
)

# Prepare your data
your_data = [
    {
        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
        'raw_text': "Your subtitle text here",
        # Note: No translation features needed for basic model
    }
]

# Process and run inference
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
    outputs = model(**batch)
    tte_prediction = outputs.logits.item()
    
print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds")
```

## Model Details

- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
- **Audio Encoder**: Frozen SeamlessM4T speech encoder  
- **Text Encoder**: Frozen SeamlessM4T text encoder
- **Hidden Size**: 1024
- **Audio Input**: 16kHz
- **Output**: Single regression value (TTE in seconds)
- **Task**: Subtitle editing time prediction

## Data Format

Your input data should be a list of dictionaries with:
- `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
- `raw_text`: String of subtitle text  
- `labels`: Target TTE values in seconds (optional, for training)

Example:
```python
data = [
    {
        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
        'raw_text': "Subtitle text content",
        'labels': 2.5  # optional TTE target value in seconds
    }
]
```

## Performance Metrics

- **Best Eval RMSE**: 33.34

## Training Details

- **Base Model**: facebook/hf-seamless-m4t-medium
- **Epochs**: 10
- **Batch Size (Train)**: 32
- **Batch Size (Eval)**: 64
- **Learning Rate**: 1.2e-4
- **LR Scheduler**: cosine_with_restarts
- **Warmup Ratio**: 0.05
- **Weight Decay**: 0.001
- **Optimizer**: AdamW (torch)
- **Max Grad Norm**: 1.0
- **FP16**: True
- **Early Stopping Patience**: 5
- **Audio Max Length**: 8.0 seconds
- **Text Max Length**: 256 tokens
- **Sample Rate**: 16kHz
- **Normalization**: None (raw values)
- **Dataset Split**: 80/20 train/test
- **Random Seed**: 42
- **Metric**: RMSE (lower is better)

## Training Configuration

The model was trained with the following specifications:

- **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE)
- **Train/Test Split**: 80/20 with random seed 42
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
- **Text Processing**: Max 256 tokens
- **Normalization**: None (raw TTE values in seconds)
- **Caching**: Audio segments cached and compressed for efficiency

## Usage Notes

- This is the **basic** variant - processes only audio and text
- For translation-aware models, see `seamless-translation` and `seamless-langpairs`
- Model expects 16kHz audio input (automatically resampled by data collator)
- Text is processed with SeamlessM4T text encoder
- No feature normalization applied - outputs raw TTE predictions in seconds
- Optimized for subtitle editing time estimation tasks

## Limitations

- Designed for TTE prediction, not general audio-text matching
- Performance may vary on out-of-domain content or different editing workflows
- Requires specific data preprocessing (use included data collator)

## Related Models

- **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Adds translation awareness features
- **[seamless-langpairs](https://huggingface.co/videoloc/seamless-langpairs)**: Includes language pair embeddings for multilingual scenarios
- **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions