Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +232 -0
config.json +12 -0
data_collator.py +88 -0
example_usage.py +49 -0
pytorch_model.bin +3 -0
requirements.txt +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,232 @@

+---
+language:
+- multilingual
+tags:
+- audio
+- text
+- multimodal
+- seamless
+- subtitle-editing-time-prediction
+- translation-aware
+- language-pairs
+license: apache-2.0
+library_name: transformers
+base_model: facebook/hf-seamless-m4t-medium
+---
+# videoloc/seamless-langpairs
+## Model Description
+This is a **SeamlessLanguagePairs** model that processes audio and text inputs with both translation awareness and language pair embeddings to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, taking into account both whether the subtitle is translated and the specific language pair involved.
+The model extends the SeamlessM4T architecture with both translation features and language pair embeddings, providing the most granular control for multilingual video localization scenarios with support for 21 different language pairs.
+### Key Features
+- **Language Pair Embeddings**: Fine-grained control for 21 language pairs plus "other"
+- **Translation-Aware Processing**: Distinguishes between original and translated content
+- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
+- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
+- **Enhanced Architecture**: Adds both translation and language pair embeddings
+- **TTE Prediction**: Predicts editing time required for subtitle segments
+- **Direct Output**: Raw time values in seconds for immediate use
+## Model Architecture
+The model extends the basic SeamlessM4T architecture with both translation and language pair awareness:
+1. **Audio Processing**:
+   - SeamlessM4T speech encoder (frozen) processes raw audio input
+   - Audio projection layer maps speech encoder output to 1024 dimensions
+   - Mean pooling over sequence length to get fixed-size audio embedding
+2. **Text Processing**:
+   - SeamlessM4T text encoder (frozen) processes tokenized text input
+   - Text projection layer maps text encoder output to 1024 dimensions
+   - Mean pooling over sequence length to get fixed-size text embedding
+3. **Translation Feature Processing**:
+   - Binary translation flag (0/1) indicating original vs translated content
+   - Translation projection layer maps binary input to 32 dimensions
+   - Learned embedding helps model distinguish translation effects
+4. **Language Pair Processing**:
+   - Categorical language pair ID (0-20) for specific language combinations
+   - Language pair embedding layer maps IDs to 64-dimensional vectors
+   - Captures language-specific temporal alignment patterns
+5. **Feature Fusion**:
+   - Audio, text, translation, and language pair embeddings are concatenated (2144 total dimensions)
+   - Simple concatenation without complex cross-modal interactions
+6. **Regression Head**:
+   - Multi-layer perceptron: 2144 → 1024 → 512 → 256 → 1
+   - ReLU activations and dropout for regularization
+   - Single output for TTE prediction (regression, in seconds)
+## Quick Start
+### Installation
+```bash
+pip install transformers torch torchaudio huggingface_hub
+```
+### Basic Usage
+```python
+from transformers import AutoModel, AutoConfig
+from huggingface_hub import hf_hub_download
+import torch
+import numpy as np
+import importlib.util
+# Load model
+model = AutoModel.from_pretrained("videoloc/seamless-langpairs")
+config = AutoConfig.from_pretrained("videoloc/seamless-langpairs")
+# Load the data collator (included in this repo)
+collator_file = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="data_collator.py")
+spec = importlib.util.spec_from_file_location("data_collator", collator_file)
+collator_module = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(collator_module)
+# Initialize data collator
+data_collator = collator_module.DataCollatorSimpleSeamless(
+    processor="facebook/hf-seamless-m4t-medium",
+    max_audio_length_sec=8.0,
+    max_text_length=256
+)
+# Prepare your data with translation and language pair information
+your_data = [
+    {
+        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
+        'raw_text': "Your subtitle text here",
+        'is_translation': 1,       # 1 for translated content, 0 for original
+        'language_pair_id': 5,     # 0-20 for specific language pairs
+    }
+]
+# Process and run inference
+batch = data_collator(your_data)
+model.eval()
+with torch.no_grad():
+    outputs = model(**batch)
+    tte_prediction = outputs.logits.item()
+print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
+```
+## Model Details
+- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
+- **Audio Encoder**: Frozen SeamlessM4T speech encoder
+- **Text Encoder**: Frozen SeamlessM4T text encoder
+- **Hidden Size**: 1024
+- **Translation Embedding**: 32 dimensions
+- **Language Pair Embedding**: 64 dimensions
+- **Number of Language Pairs**: 21 (plus "other")
+- **Audio Input**: 16kHz, max 8.0 seconds
+- **Text Input**: Max 256 tokens
+- **Translation Input**: Binary flag (0/1)
+- **Language Pair Input**: Categorical ID (0-20)
+- **Output**: Single regression value (TTE in seconds)
+- **Task**: Subtitle editing time prediction
+## Data Format
+Your input data should be a list of dictionaries with:
+- `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
+- `raw_text`: String of subtitle text
+- `is_translation`: Binary flag (1 for translated, 0 for original content)
+- `language_pair_id`: Integer ID (0-20) for specific language pair
+- `labels`: Target TTE values in seconds (optional, for training)
+Example:
+```python
+data = [
+    {
+        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
+        'raw_text': "Subtitle text content",
+        'is_translation': 1,     # 1 = translated, 0 = original
+        'language_pair_id': 5,   # 0-20 for language pairs
+        'labels': 2.5  # optional TTE target value in seconds
+    }
+]
+```
+## Performance Metrics
+- **Best Eval RMSE**: 33.34
+## Training Details
+- **Base Model**: facebook/hf-seamless-m4t-medium
+- **Model Type**: seamless_lang_pairs
+- **Epochs**: 10
+- **Batch Size (Train)**: 32
+- **Batch Size (Eval)**: 64
+- **Learning Rate**: 1.2e-4
+- **LR Scheduler**: cosine_with_restarts
+- **Warmup Ratio**: 0.05
+- **Weight Decay**: 0.001
+- **Optimizer**: AdamW (torch)
+- **Max Grad Norm**: 1.0
+- **FP16**: True
+- **Early Stopping Patience**: 5
+- **Audio Max Length**: 8.0 seconds
+- **Text Max Length**: 256 tokens
+- **Sample Rate**: 16kHz
+- **Translation Feature**: Binary flag (0/1)
+- **Language Pairs**: 21 pairs + other
+- **Language Pair Embedding**: 64 dimensions
+- **Normalization**: None (raw values)
+- **Dataset Split**: 80/20 train/test
+- **Random Seed**: 42
+- **Metric**: RMSE (lower is better)
+## Training Configuration
+The model was trained with the following specifications:
+- **Dataset**: Multimodal audio-subtitle pairs with translation and language pair annotations
+- **Train/Test Split**: 80/20 with random seed 42
+- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
+- **Text Processing**: Max 256 tokens
+- **Translation Feature**: Binary flag indicating original vs translated content
+- **Language Pairs**: 21 most frequent language pairs plus "other" category
+- **Normalization**: None (raw TTE values in seconds)
+- **Caching**: Audio segments cached and compressed for efficiency
+## Language Pairs Supported
+The model supports embeddings for 21 language pairs (IDs 0-20). The exact mapping depends on your training data, but typically includes popular combinations like:
+- English ↔ Spanish, French, German, Italian, Portuguese
+- Cross-European language pairs
+- English ↔ Asian languages (Chinese, Japanese, Korean)
+- Other high-frequency translation pairs in your dataset
+## Usage Notes
+- This is the **most advanced** variant with both translation and language pair features
+- For simpler models, see `seamless-basic` (audio+text only) or `seamless-translation` (with translation flag)
+- Model expects 16kHz audio input (automatically resampled by data collator)
+- Both translation flag and language pair ID significantly impact predictions
+- Language pair embeddings capture language-specific temporal patterns
+- No feature normalization applied - outputs raw TTE predictions in seconds
+- Optimized for fine-grained subtitle editing time estimation tasks
+## Limitations
+- Maximum audio length: 8.0 seconds
+- Maximum text length: 256 tokens
+- Requires both translation and language pair annotations in training data
+- Language pair embeddings are dataset-specific (top 21 pairs from training)
+- Designed for TTE prediction, not general audio-text matching
+- Performance may vary on out-of-domain content and unseen language pairs
+- Requires specific data preprocessing (use included data collator)
+## Related Models
+- **seamless-basic**: Basic audio+text model without translation or language features
+- **seamless-translation**: Includes translation awareness but no language pair embeddings

config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "architectures": [
+    "HFSeamlessLanguagePairs"
+  ],
+  "dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "model_type": "seamless_language_pairs",
+  "num_language_pairs": 21,
+  "seamless_model_name": "facebook/hf-seamless-m4t-medium",
+  "torch_dtype": "float32",
+  "transformers_version": "4.50.2"
+}

data_collator.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import torch
+import numpy as np
+from transformers import AutoProcessor
+from typing import Dict, List, Union
+import logging
+logger = logging.getLogger(__name__)
+class DataCollatorSimpleSeamless:
+    def __init__(
+        self,
+        processor: str,
+        sample_rate: int = 16000,
+        max_audio_length_sec: float = 8.0,
+        max_text_length: int = 256,
+        normalization_type: str = "none"
+    ):
+        """Initialize the data collator.
+        Args:
+            processor: The processor to use.
+            sample_rate: Audio sample rate.
+            max_audio_length_sec: Maximum audio length in seconds.
+            max_text_length: Maximum text length.
+            normalization_type: Type of normalization to apply to labels. Options: "log1p", "none"
+        """
+        logger.info(f"Loading processor: {processor}")
+        self.processor = AutoProcessor.from_pretrained(processor)
+        self.sample_rate = sample_rate
+        self.max_audio_sample_length = int(max_audio_length_sec * sample_rate)
+        self.max_text_length = max_text_length
+        self.normalization_type = normalization_type
+    def __call__(self, batch: List[Dict[str, Union[np.ndarray, str, float]]]) -> Dict[str, torch.Tensor]:
+        """Process a batch of raw features into model inputs."""
+        # Extract raw data
+        raw_audios = [item['raw_audio'] for item in batch]
+        raw_texts = [item['raw_text'] for item in batch]
+        raw_audios = [torch.tensor(audio) for audio in raw_audios]
+        audio_inputs = self.processor(
+            audios=raw_audios,
+            sampling_rate=self.sample_rate,
+            return_tensors="pt",
+            padding="longest",
+            truncation=True,
+            max_length=self.max_audio_sample_length,
+        )
+        text_inputs = self.processor(
+            text=raw_texts,
+            return_tensors="pt",
+            padding="longest",
+            truncation=True,
+            max_length=self.max_text_length,
+        )
+        # Extract translation features
+        is_translation = torch.tensor([item.get('is_translation', 0) for item in batch], dtype=torch.float32)
+        # Extract language pair features
+        language_pair_id = torch.tensor([item.get('language_pair_id', 0) for item in batch], dtype=torch.long)
+        if 'labels' in batch[0]:
+            labels = [item['labels'] for item in batch]
+            labels = torch.tensor(labels, dtype=torch.float32)
+            # Apply normalization based on type
+            if self.normalization_type == "log1p":
+                labels = torch.log1p(labels)
+            elif self.normalization_type == "none":
+                pass
+            else:
+                raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+        else:
+            labels = None
+        return {
+            'input_features': audio_inputs['input_features'],
+            'audio_attention_mask': audio_inputs.get('attention_mask', None) if audio_inputs.get('attention_mask') is not None else None,
+            'input_ids': text_inputs['input_ids'],
+            'text_attention_mask': text_inputs['attention_mask'],
+            'is_translation': is_translation,
+            'language_pair_id': language_pair_id,
+            **({'labels': labels} if labels is not None else {})
+        }

example_usage.py ADDED Viewed

	@@ -0,0 +1,49 @@

+#!/usr/bin/env python3
+# Example usage for videoloc/seamless-langpairs
+from transformers import AutoModel, AutoConfig
+from huggingface_hub import hf_hub_download
+import torch
+import numpy as np
+import importlib.util
+def load_model_and_collator():
+    model = AutoModel.from_pretrained("videoloc/seamless-langpairs")
+    config = AutoConfig.from_pretrained("videoloc/seamless-langpairs")
+    # Load data collator
+    collator_file = hf_hub_download(repo_id="videoloc/seamless-langpairs", filename="data_collator.py")
+    spec = importlib.util.spec_from_file_location("data_collator", collator_file)
+    collator_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(collator_module)
+    data_collator = collator_module.DataCollatorSimpleSeamless(
+        processor="facebook/hf-seamless-m4t-medium",
+        max_audio_length_sec=8.0,
+        max_text_length=256
+    )
+    return model, data_collator
+def example_inference():
+    model, collator = load_model_and_collator()
+    # Example data with translation and language pair awareness
+    data = [{
+        'raw_audio': np.random.randn(16000 * 3),  # 3 seconds at 16kHz
+        'raw_text': "Example subtitle text for temporal alignment",
+        'is_translation': 1,     # 1 for translated content, 0 for original
+        'language_pair_id': 5,   # 0-20 for specific language pairs
+    }]
+    batch = collator(data)
+    model.eval()
+    with torch.no_grad():
+        outputs = model(**batch)
+        tte_prediction = outputs.logits.item()
+    print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
+    return tte_prediction
+if __name__ == "__main__":
+    example_inference()

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e3037a762e659d5e3acaf60ecdd58a76aea92fc01b50f1cb70fb200b802e2a6
+size 4858339608

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+transformers>=4.50.2
+torch>=2.6.0
+torchaudio>=2.6.0
+huggingface_hub>=0.33.0
+numpy>=2.2.3
+sentencepiece>=0.2.0
+accelerate>=1.5.2
+soundfile>=0.13.1