File size: 6,888 Bytes

---
language:
- multilingual
tags:
- audio
- text
- multimodal
- seamless
- subtitle-editing-time-prediction
library_name: transformers
base_model: facebook/hf-seamless-m4t-medium
license: cc-by-nc-4.0
---

# videoloc/seamless-basic

## Model Description

This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment.

The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations across 5 languages: **English, French, Spanish, Italian, and German**.

### Key Features

- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
- **TTE Prediction**: Predicts editing time required for subtitle segments
- **Direct Output**: Raw time values in seconds for immediate use

## Model Architecture

The model consists of the following components:

1. **Audio Processing**: 
   - SeamlessM4T speech encoder (frozen) processes raw audio input
   - Audio projection layer maps speech encoder output to 1024 dimensions
   - Mean pooling over sequence length to get fixed-size audio embedding

2. **Text Processing**:
   - SeamlessM4T text encoder (frozen) processes tokenized text input  
   - Text projection layer maps text encoder output to 1024 dimensions
   - Mean pooling over sequence length to get fixed-size text embedding

3. **Feature Fusion**:
   - Audio and text embeddings are concatenated (2048 total dimensions)
   - No additional cross-modal attention or complex fusion mechanisms

4. **Regression Head**:
   - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1
   - ReLU activations and dropout for regularization
   - Single output for TTE prediction (regression, in seconds)

## Quick Start

### Installation
```bash
pip install transformers torch torchaudio huggingface_hub
```

### Basic Usage
```python
from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util

# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)

# Now load the model using the custom class
config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic")
model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic")

# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)

# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
    processor="facebook/hf-seamless-m4t-medium",
    max_audio_length_sec=8.0,
    max_text_length=256
)

# Prepare your data
your_data = [
    {
        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
        'raw_text': "Your subtitle text here",
        # Note: No translation features needed for basic model
    }
]

# Process and run inference
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
    outputs = model(**batch)
    tte_prediction = outputs.logits.item()
    
print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds")
```

## Model Details

- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
- **Audio Encoder**: Frozen SeamlessM4T speech encoder  
- **Text Encoder**: Frozen SeamlessM4T text encoder
- **Hidden Size**: 1024
- **Audio Input**: 16kHz
- **Output**: Single regression value (TTE in seconds)
- **Task**: Subtitle editing time prediction

## Data Format

Your input data should be a list of dictionaries with:
- `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
- `raw_text`: String of subtitle text  
- `labels`: Target TTE values in seconds (optional, for training)

Example:
```python
data = [
    {
        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
        'raw_text': "Subtitle text content",
        'labels': 2.5  # optional TTE target value in seconds
    }
]
```

## Performance Metrics

- **Best Eval RMSE**: 33.34

## Training Details

- **Base Model**: facebook/hf-seamless-m4t-medium
- **Epochs**: 10
- **Batch Size (Train)**: 32
- **Batch Size (Eval)**: 64
- **Learning Rate**: 1.2e-4
- **LR Scheduler**: cosine_with_restarts
- **Warmup Ratio**: 0.05
- **Weight Decay**: 0.001
- **Optimizer**: AdamW (torch)
- **Max Grad Norm**: 1.0
- **FP16**: True
- **Early Stopping Patience**: 5
- **Audio Max Length**: 8.0 seconds
- **Text Max Length**: 256 tokens
- **Sample Rate**: 16kHz
- **Normalization**: None (raw values)
- **Dataset Split**: 80/20 train/test
- **Random Seed**: 42
- **Metric**: RMSE (lower is better)

## Training Configuration

The model was trained with the following specifications:

- **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE)
- **Train/Test Split**: 80/20 with random seed 42
- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
- **Text Processing**: Max 256 tokens
- **Normalization**: None (raw TTE values in seconds)
- **Caching**: Audio segments cached and compressed for efficiency

## Usage Notes

- This is the **basic** variant - processes only audio and text
- For translation-aware models, see `seamless-translation` and `seamless-langpairs`
- Model expects 16kHz audio input (automatically resampled by data collator)
- Text is processed with SeamlessM4T text encoder
- No feature normalization applied - outputs raw TTE predictions in seconds
- Optimized for subtitle editing time estimation tasks

## Limitations

- Designed for TTE prediction, not general audio-text matching
- Performance may vary on out-of-domain content or different editing workflows
- Requires specific data preprocessing (use included data collator)

## Related Models

- **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Adds translation awareness features
- **[seamless-langpairs](https://huggingface.co/videoloc/seamless-langpairs)**: Includes language pair embeddings for multilingual scenarios
- **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions