|
|
--- |
|
|
language: |
|
|
- multilingual |
|
|
tags: |
|
|
- audio |
|
|
- text |
|
|
- multimodal |
|
|
- seamless |
|
|
- subtitle-editing-time-prediction |
|
|
- translation-aware |
|
|
- language-pairs |
|
|
library_name: transformers |
|
|
base_model: facebook/hf-seamless-m4t-medium |
|
|
--- |
|
|
|
|
|
# videoloc/seamless-langpairs |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a **SeamlessLanguagePairs** model that processes audio and text inputs with both translation awareness and language pair embeddings to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, taking into account both whether the subtitle is translated and the specific language pair involved. |
|
|
|
|
|
The model extends the SeamlessM4T architecture with both translation features and language pair embeddings, providing the most granular control for multilingual scenarios across **5 languages: English, French, Spanish, Italian, and German** with **21 different translation pairs** between them (e.g., EN→FR, ES→DE, IT→EN, etc.). |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Language Pair Embeddings**: Fine-grained control for 21 language pairs plus "other" |
|
|
- **Translation-Aware Processing**: Distinguishes between original and translated content |
|
|
- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs |
|
|
- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability) |
|
|
- **Enhanced Architecture**: Adds both translation and language pair embeddings |
|
|
- **TTE Prediction**: Predicts editing time required for subtitle segments |
|
|
- **Direct Output**: Raw time values in seconds for immediate use |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The model extends the basic SeamlessM4T architecture with both translation and language pair awareness: |
|
|
|
|
|
1. **Audio Processing**: |
|
|
- SeamlessM4T speech encoder (frozen) processes raw audio input |
|
|
- Audio projection layer maps speech encoder output to 1024 dimensions |
|
|
- Mean pooling over sequence length to get fixed-size audio embedding |
|
|
|
|
|
2. **Text Processing**: |
|
|
- SeamlessM4T text encoder (frozen) processes tokenized text input |
|
|
- Text projection layer maps text encoder output to 1024 dimensions |
|
|
- Mean pooling over sequence length to get fixed-size text embedding |
|
|
|
|
|
3. **Translation Feature Processing**: |
|
|
- Binary translation flag (0/1) indicating original vs translated content |
|
|
- Translation projection layer maps binary input to 32 dimensions |
|
|
- Learned embedding helps model distinguish translation effects |
|
|
|
|
|
4. **Language Pair Processing**: |
|
|
- Categorical language pair ID (0-20) for specific language combinations |
|
|
- Language pair embedding layer maps IDs to 64-dimensional vectors |
|
|
- Captures language-specific temporal alignment patterns |
|
|
|
|
|
5. **Feature Fusion**: |
|
|
- Audio, text, translation, and language pair embeddings are concatenated (2144 total dimensions) |
|
|
- Simple concatenation without complex cross-modal interactions |
|
|
|
|
|
6. **Regression Head**: |
|
|
- Multi-layer perceptron: 2144 → 1024 → 512 → 256 → 1 |
|
|
- ReLU activations and dropout for regularization |
|
|
- Single output for TTE prediction (regression, in seconds) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers torch torchaudio huggingface_hub |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from transformers import AutoModel, AutoConfig |
|
|
from huggingface_hub import hf_hub_download |
|
|
import torch |
|
|
import numpy as np |
|
|
import importlib.util |
|
|
|
|
|
# Load model - custom architecture requires importing the model class |
|
|
model_files = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="modeling_seamless_langpairs.py") |
|
|
spec = importlib.util.spec_from_file_location("modeling_seamless_langpairs", model_files) |
|
|
modeling_module = importlib.util.module_from_spec(spec) |
|
|
spec.loader.exec_module(modeling_module) |
|
|
|
|
|
# Now load the model using the custom class |
|
|
config = modeling_module.SeamlessLanguagePairsConfig.from_pretrained("videoloc/seamless-langpairs") |
|
|
model = modeling_module.HFSeamlessLanguagePairs.from_pretrained("videoloc/seamless-langpairs") |
|
|
|
|
|
# Load the data collator (included in this repo) |
|
|
collator_file = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="data_collator.py") |
|
|
spec = importlib.util.spec_from_file_location("data_collator", collator_file) |
|
|
collator_module = importlib.util.module_from_spec(spec) |
|
|
spec.loader.exec_module(collator_module) |
|
|
|
|
|
# Initialize data collator |
|
|
data_collator = collator_module.DataCollatorSimpleSeamless( |
|
|
processor="facebook/hf-seamless-m4t-medium", |
|
|
max_audio_length_sec=8.0, |
|
|
max_text_length=256 |
|
|
) |
|
|
|
|
|
# Prepare your data with translation and language pair information |
|
|
your_data = [ |
|
|
{ |
|
|
'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz |
|
|
'raw_text': "Your subtitle text here", |
|
|
'is_translation': 1, # 1 for translated content, 0 for original |
|
|
'language_pair_id': 5, # 0-20 for specific language pairs |
|
|
} |
|
|
] |
|
|
|
|
|
# Process and run inference |
|
|
batch = data_collator(your_data) |
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
outputs = model(**batch) |
|
|
tte_prediction = outputs.logits.item() |
|
|
|
|
|
print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds") |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium) |
|
|
- **Audio Encoder**: Frozen SeamlessM4T speech encoder |
|
|
- **Text Encoder**: Frozen SeamlessM4T text encoder |
|
|
- **Hidden Size**: 1024 |
|
|
- **Translation Embedding**: 32 dimensions |
|
|
- **Language Pair Embedding**: 64 dimensions |
|
|
- **Number of Language Pairs**: 21 (plus "other") |
|
|
- **Audio Input**: 16kHz |
|
|
- **Translation Input**: Binary flag (0/1) |
|
|
- **Language Pair Input**: Categorical ID (0-20) |
|
|
- **Output**: Single regression value (TTE in seconds) |
|
|
- **Task**: Subtitle editing time prediction |
|
|
|
|
|
## Supported Language Pairs |
|
|
|
|
|
The model supports 21 specific translation pairs between 5 languages: |
|
|
|
|
|
**Languages**: English (EN), French (FR), Spanish (ES), Italian (IT), German (DE) |
|
|
|
|
|
**Translation Pairs**: All combinations between the 5 languages create various directional pairs (e.g., EN→FR, FR→EN, ES→IT, DE→ES, etc.). The model uses language pair IDs (0-20) to identify specific translation directions, with ID 21 reserved for "other" pairs. |
|
|
|
|
|
## Data Format |
|
|
|
|
|
Your input data should be a list of dictionaries with: |
|
|
- `raw_audio`: NumPy array of audio samples (16kHz sampling rate) |
|
|
- `raw_text`: String of subtitle text |
|
|
- `is_translation`: Binary flag (1 for translated, 0 for original content) |
|
|
- `language_pair_id`: Integer ID (0-20) for specific language pair |
|
|
- `labels`: Target TTE values in seconds (optional, for training) |
|
|
|
|
|
Example: |
|
|
```python |
|
|
data = [ |
|
|
{ |
|
|
'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz |
|
|
'raw_text': "Subtitle text content", |
|
|
'is_translation': 1, # 1 = translated, 0 = original |
|
|
'language_pair_id': 5, # 0-20 for language pairs |
|
|
'labels': 2.5 # optional TTE target value in seconds |
|
|
} |
|
|
] |
|
|
``` |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
- **Best Eval RMSE**: 33.34 |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base Model**: facebook/hf-seamless-m4t-medium |
|
|
- **Model Type**: seamless_lang_pairs |
|
|
- **Epochs**: 10 |
|
|
- **Batch Size (Train)**: 32 |
|
|
- **Batch Size (Eval)**: 64 |
|
|
- **Learning Rate**: 1.2e-4 |
|
|
- **LR Scheduler**: cosine_with_restarts |
|
|
- **Warmup Ratio**: 0.05 |
|
|
- **Weight Decay**: 0.001 |
|
|
- **Optimizer**: AdamW (torch) |
|
|
- **Max Grad Norm**: 1.0 |
|
|
- **FP16**: True |
|
|
- **Early Stopping Patience**: 5 |
|
|
- **Audio Max Length**: 8.0 seconds |
|
|
- **Text Max Length**: 256 tokens |
|
|
- **Sample Rate**: 16kHz |
|
|
- **Translation Feature**: Binary flag (0/1) |
|
|
- **Language Pairs**: 21 pairs + other |
|
|
- **Language Pair Embedding**: 64 dimensions |
|
|
- **Normalization**: None (raw values) |
|
|
- **Dataset Split**: 80/20 train/test |
|
|
- **Random Seed**: 42 |
|
|
- **Metric**: RMSE (lower is better) |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
The model was trained with the following specifications: |
|
|
|
|
|
- **Dataset**: Multimodal audio-subtitle pairs with translation and language pair annotations (5 languages: EN, FR, ES, IT, DE with 21 pairs) |
|
|
- **Train/Test Split**: 80/20 with random seed 42 |
|
|
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset |
|
|
- **Text Processing**: Max 256 tokens |
|
|
- **Translation Feature**: Binary flag indicating original vs translated content |
|
|
- **Language Pairs**: 21 translation pairs from 5 languages (EN, FR, ES, IT, DE) plus "other" category |
|
|
- **Normalization**: None (raw TTE values in seconds) |
|
|
- **Caching**: Audio segments cached and compressed for efficiency |
|
|
|
|
|
## Usage Notes |
|
|
|
|
|
- This is the **most advanced** variant with both translation and language pair features |
|
|
- For simpler models, see `seamless-basic` (audio+text only) or `seamless-translation` (with translation flag) |
|
|
- Model expects 16kHz audio input (automatically resampled by data collator) |
|
|
- Both translation flag and language pair ID significantly impact predictions |
|
|
- Language pair embeddings capture language-specific temporal patterns |
|
|
- No feature normalization applied - outputs raw TTE predictions in seconds |
|
|
- Optimized for fine-grained subtitle editing time estimation tasks |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Requires both translation and language pair annotations in training data |
|
|
- Language pair embeddings are dataset-specific (top 21 pairs from training) |
|
|
- Designed for TTE prediction, not general audio-text matching |
|
|
- Performance may vary on out-of-domain content and unseen language pairs |
|
|
- Requires specific data preprocessing (use included data collator) |
|
|
|
|
|
## Related Models |
|
|
|
|
|
- **[seamless-basic](https://huggingface.co/videoloc/seamless-basic)**: Basic audio+text model without translation or language features |
|
|
- **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Includes translation awareness but no language pair embeddings |
|
|
- **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions |
|
|
|