seamless-basic / README.md
giuseppe-tanzi's picture
Update README.md
0638a03 verified
---
language:
- multilingual
tags:
- audio
- text
- multimodal
- seamless
- subtitle-editing-time-prediction
library_name: transformers
base_model: facebook/hf-seamless-m4t-medium
license: cc-by-nc-4.0
---
# videoloc/seamless-basic
## Model Description
This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment.
The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations across 5 languages: **English, French, Spanish, Italian, and German**.
### Key Features
- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
- **TTE Prediction**: Predicts editing time required for subtitle segments
- **Direct Output**: Raw time values in seconds for immediate use
## Model Architecture
The model consists of the following components:
1. **Audio Processing**:
- SeamlessM4T speech encoder (frozen) processes raw audio input
- Audio projection layer maps speech encoder output to 1024 dimensions
- Mean pooling over sequence length to get fixed-size audio embedding
2. **Text Processing**:
- SeamlessM4T text encoder (frozen) processes tokenized text input
- Text projection layer maps text encoder output to 1024 dimensions
- Mean pooling over sequence length to get fixed-size text embedding
3. **Feature Fusion**:
- Audio and text embeddings are concatenated (2048 total dimensions)
- No additional cross-modal attention or complex fusion mechanisms
4. **Regression Head**:
- Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1
- ReLU activations and dropout for regularization
- Single output for TTE prediction (regression, in seconds)
## Quick Start
### Installation
```bash
pip install transformers torch torchaudio huggingface_hub
```
### Basic Usage
```python
from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util
# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)
# Now load the model using the custom class
config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic")
model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic")
# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)
# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
processor="facebook/hf-seamless-m4t-medium",
max_audio_length_sec=8.0,
max_text_length=256
)
# Prepare your data
your_data = [
{
'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz
'raw_text': "Your subtitle text here",
# Note: No translation features needed for basic model
}
]
# Process and run inference
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
outputs = model(**batch)
tte_prediction = outputs.logits.item()
print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds")
```
## Model Details
- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
- **Audio Encoder**: Frozen SeamlessM4T speech encoder
- **Text Encoder**: Frozen SeamlessM4T text encoder
- **Hidden Size**: 1024
- **Audio Input**: 16kHz
- **Output**: Single regression value (TTE in seconds)
- **Task**: Subtitle editing time prediction
## Data Format
Your input data should be a list of dictionaries with:
- `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
- `raw_text`: String of subtitle text
- `labels`: Target TTE values in seconds (optional, for training)
Example:
```python
data = [
{
'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz
'raw_text': "Subtitle text content",
'labels': 2.5 # optional TTE target value in seconds
}
]
```
## Performance Metrics
- **Best Eval RMSE**: 33.34
## Training Details
- **Base Model**: facebook/hf-seamless-m4t-medium
- **Epochs**: 10
- **Batch Size (Train)**: 32
- **Batch Size (Eval)**: 64
- **Learning Rate**: 1.2e-4
- **LR Scheduler**: cosine_with_restarts
- **Warmup Ratio**: 0.05
- **Weight Decay**: 0.001
- **Optimizer**: AdamW (torch)
- **Max Grad Norm**: 1.0
- **FP16**: True
- **Early Stopping Patience**: 5
- **Audio Max Length**: 8.0 seconds
- **Text Max Length**: 256 tokens
- **Sample Rate**: 16kHz
- **Normalization**: None (raw values)
- **Dataset Split**: 80/20 train/test
- **Random Seed**: 42
- **Metric**: RMSE (lower is better)
## Training Configuration
The model was trained with the following specifications:
- **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE)
- **Train/Test Split**: 80/20 with random seed 42
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
- **Text Processing**: Max 256 tokens
- **Normalization**: None (raw TTE values in seconds)
- **Caching**: Audio segments cached and compressed for efficiency
## Usage Notes
- This is the **basic** variant - processes only audio and text
- For translation-aware models, see `seamless-translation` and `seamless-langpairs`
- Model expects 16kHz audio input (automatically resampled by data collator)
- Text is processed with SeamlessM4T text encoder
- No feature normalization applied - outputs raw TTE predictions in seconds
- Optimized for subtitle editing time estimation tasks
## Limitations
- Designed for TTE prediction, not general audio-text matching
- Performance may vary on out-of-domain content or different editing workflows
- Requires specific data preprocessing (use included data collator)
## Related Models
- **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Adds translation awareness features
- **[seamless-langpairs](https://huggingface.co/videoloc/seamless-langpairs)**: Includes language pair embeddings for multilingual scenarios
- **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions