|
--- |
|
language: |
|
- multilingual |
|
tags: |
|
- audio |
|
- text |
|
- multimodal |
|
- seamless |
|
- subtitle-editing-time-prediction |
|
library_name: transformers |
|
base_model: facebook/hf-seamless-m4t-medium |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
# videoloc/seamless-basic |
|
|
|
## Model Description |
|
|
|
This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment. |
|
|
|
The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations across 5 languages: **English, French, Spanish, Italian, and German**. |
|
|
|
### Key Features |
|
|
|
- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs |
|
- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability) |
|
- **TTE Prediction**: Predicts editing time required for subtitle segments |
|
- **Direct Output**: Raw time values in seconds for immediate use |
|
|
|
## Model Architecture |
|
|
|
The model consists of the following components: |
|
|
|
1. **Audio Processing**: |
|
- SeamlessM4T speech encoder (frozen) processes raw audio input |
|
- Audio projection layer maps speech encoder output to 1024 dimensions |
|
- Mean pooling over sequence length to get fixed-size audio embedding |
|
|
|
2. **Text Processing**: |
|
- SeamlessM4T text encoder (frozen) processes tokenized text input |
|
- Text projection layer maps text encoder output to 1024 dimensions |
|
- Mean pooling over sequence length to get fixed-size text embedding |
|
|
|
3. **Feature Fusion**: |
|
- Audio and text embeddings are concatenated (2048 total dimensions) |
|
- No additional cross-modal attention or complex fusion mechanisms |
|
|
|
4. **Regression Head**: |
|
- Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1 |
|
- ReLU activations and dropout for regularization |
|
- Single output for TTE prediction (regression, in seconds) |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
```bash |
|
pip install transformers torch torchaudio huggingface_hub |
|
``` |
|
|
|
### Basic Usage |
|
```python |
|
from transformers import AutoModel, AutoConfig |
|
from huggingface_hub import hf_hub_download |
|
import torch |
|
import numpy as np |
|
import importlib.util |
|
|
|
# Load model - custom architecture requires importing the model class |
|
model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py") |
|
spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files) |
|
modeling_module = importlib.util.module_from_spec(spec) |
|
spec.loader.exec_module(modeling_module) |
|
|
|
# Now load the model using the custom class |
|
config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic") |
|
model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic") |
|
|
|
# Load the data collator (included in this repo) |
|
collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py") |
|
spec = importlib.util.spec_from_file_location("data_collator", collator_file) |
|
collator_module = importlib.util.module_from_spec(spec) |
|
spec.loader.exec_module(collator_module) |
|
|
|
# Initialize data collator |
|
data_collator = collator_module.DataCollatorSimpleSeamless( |
|
processor="facebook/hf-seamless-m4t-medium", |
|
max_audio_length_sec=8.0, |
|
max_text_length=256 |
|
) |
|
|
|
# Prepare your data |
|
your_data = [ |
|
{ |
|
'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz |
|
'raw_text': "Your subtitle text here", |
|
# Note: No translation features needed for basic model |
|
} |
|
] |
|
|
|
# Process and run inference |
|
batch = data_collator(your_data) |
|
model.eval() |
|
with torch.no_grad(): |
|
outputs = model(**batch) |
|
tte_prediction = outputs.logits.item() |
|
|
|
print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds") |
|
``` |
|
|
|
## Model Details |
|
|
|
- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium) |
|
- **Audio Encoder**: Frozen SeamlessM4T speech encoder |
|
- **Text Encoder**: Frozen SeamlessM4T text encoder |
|
- **Hidden Size**: 1024 |
|
- **Audio Input**: 16kHz |
|
- **Output**: Single regression value (TTE in seconds) |
|
- **Task**: Subtitle editing time prediction |
|
|
|
## Data Format |
|
|
|
Your input data should be a list of dictionaries with: |
|
- `raw_audio`: NumPy array of audio samples (16kHz sampling rate) |
|
- `raw_text`: String of subtitle text |
|
- `labels`: Target TTE values in seconds (optional, for training) |
|
|
|
Example: |
|
```python |
|
data = [ |
|
{ |
|
'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz |
|
'raw_text': "Subtitle text content", |
|
'labels': 2.5 # optional TTE target value in seconds |
|
} |
|
] |
|
``` |
|
|
|
## Performance Metrics |
|
|
|
- **Best Eval RMSE**: 33.34 |
|
|
|
## Training Details |
|
|
|
- **Base Model**: facebook/hf-seamless-m4t-medium |
|
- **Epochs**: 10 |
|
- **Batch Size (Train)**: 32 |
|
- **Batch Size (Eval)**: 64 |
|
- **Learning Rate**: 1.2e-4 |
|
- **LR Scheduler**: cosine_with_restarts |
|
- **Warmup Ratio**: 0.05 |
|
- **Weight Decay**: 0.001 |
|
- **Optimizer**: AdamW (torch) |
|
- **Max Grad Norm**: 1.0 |
|
- **FP16**: True |
|
- **Early Stopping Patience**: 5 |
|
- **Audio Max Length**: 8.0 seconds |
|
- **Text Max Length**: 256 tokens |
|
- **Sample Rate**: 16kHz |
|
- **Normalization**: None (raw values) |
|
- **Dataset Split**: 80/20 train/test |
|
- **Random Seed**: 42 |
|
- **Metric**: RMSE (lower is better) |
|
|
|
## Training Configuration |
|
|
|
The model was trained with the following specifications: |
|
|
|
- **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE) |
|
- **Train/Test Split**: 80/20 with random seed 42 |
|
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset |
|
- **Text Processing**: Max 256 tokens |
|
- **Normalization**: None (raw TTE values in seconds) |
|
- **Caching**: Audio segments cached and compressed for efficiency |
|
|
|
## Usage Notes |
|
|
|
- This is the **basic** variant - processes only audio and text |
|
- For translation-aware models, see `seamless-translation` and `seamless-langpairs` |
|
- Model expects 16kHz audio input (automatically resampled by data collator) |
|
- Text is processed with SeamlessM4T text encoder |
|
- No feature normalization applied - outputs raw TTE predictions in seconds |
|
- Optimized for subtitle editing time estimation tasks |
|
|
|
## Limitations |
|
|
|
- Designed for TTE prediction, not general audio-text matching |
|
- Performance may vary on out-of-domain content or different editing workflows |
|
- Requires specific data preprocessing (use included data collator) |
|
|
|
## Related Models |
|
|
|
- **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Adds translation awareness features |
|
- **[seamless-langpairs](https://huggingface.co/videoloc/seamless-langpairs)**: Includes language pair embeddings for multilingual scenarios |
|
- **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions |