Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +191 -0
config.json +11 -0
data_collator.py +88 -0
example_usage.py +47 -0
pytorch_model.bin +3 -0
requirements.txt +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,191 @@

+---
+language:
+- multilingual
+tags:
+- audio
+- text
+- multimodal
+- seamless
+- subtitle-editing-time-prediction
+library_name: transformers
+pipeline_tag: audio-regression
+---
+# videoloc/seamless-basic
+## Model Description
+This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment.
+The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations.
+### Key Features
+- **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
+- **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
+- **TTE Prediction**: Predicts editing time required for subtitle segments
+- **Efficient Architecture**: Optimized for inference with gradient checkpointing support
+- **Direct Output**: Raw time values in seconds for immediate use
+## Model Architecture
+The model consists of the following components:
+1. **Audio Processing**:
+   - SeamlessM4T speech encoder (frozen) processes raw audio input
+   - Audio projection layer maps speech encoder output to 1024 dimensions
+   - Mean pooling over sequence length to get fixed-size audio embedding
+2. **Text Processing**:
+   - SeamlessM4T text encoder (frozen) processes tokenized text input
+   - Text projection layer maps text encoder output to 1024 dimensions
+   - Mean pooling over sequence length to get fixed-size text embedding
+3. **Feature Fusion**:
+   - Audio and text embeddings are concatenated (2048 total dimensions)
+   - No additional cross-modal attention or complex fusion mechanisms
+4. **Regression Head**:
+   - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1
+   - ReLU activations and dropout for regularization
+   - Single output for TTE prediction (regression, in seconds)
+## Quick Start
+### Installation
+```bash
+pip install transformers torch torchaudio huggingface_hub
+```
+### Basic Usage
+```python
+from transformers import AutoModel, AutoConfig
+from huggingface_hub import hf_hub_download
+import torch
+import numpy as np
+import importlib.util
+# Load model
+model = AutoModel.from_pretrained("videoloc/seamless-basic")
+config = AutoConfig.from_pretrained("videoloc/seamless-basic")
+# Load the data collator (included in this repo)
+collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
+spec = importlib.util.spec_from_file_location("data_collator", collator_file)
+collator_module = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(collator_module)
+# Initialize data collator
+    data_collator = collator_module.DataCollatorSimpleSeamless(
+        processor="facebook/hf-seamless-m4t-medium",
+        max_audio_length_sec=8.0,
+        max_text_length=256
+        # normalization_type="none" is default
+    )
+# Prepare your data
+your_data = [
+    {
+        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
+        'raw_text': "Your subtitle text here",
+        # Note: No translation features needed for basic model
+    }
+]
+# Process and run inference
+batch = data_collator(your_data)
+model.eval()
+with torch.no_grad():
+    outputs = model(**batch)
+    tte_prediction = outputs.logits.item()
+print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds")
+```
+## Model Details
+- **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
+- **Audio Encoder**: Frozen SeamlessM4T speech encoder
+- **Text Encoder**: Frozen SeamlessM4T text encoder
+- **Hidden Size**: 1024
+- **Audio Input**: 16kHz, max 8.0 seconds
+- **Text Input**: Max 256 tokens
+- **Output**: Single regression value (TTE in seconds)
+- **Task**: Subtitle editing time prediction
+## Data Format
+Your input data should be a list of dictionaries with:
+- `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
+- `raw_text`: String of subtitle text
+- `labels`: Target TTE values in seconds (optional, for training)
+Example:
+```python
+data = [
+    {
+        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
+        'raw_text': "Subtitle text content",
+        'labels': 2.5  # optional TTE target value in seconds
+    }
+]
+```
+## Performance Metrics
+- **Best Eval RMSE**: 33.34
+## Training Details
+- **Base Model**: facebook/hf-seamless-m4t-medium
+- **Epochs**: 10
+- **Batch Size (Train)**: 32
+- **Batch Size (Eval)**: 64
+- **Learning Rate**: 1.2e-4
+- **LR Scheduler**: cosine_with_restarts
+- **Warmup Ratio**: 0.05
+- **Weight Decay**: 0.001
+- **Optimizer**: AdamW (torch)
+- **Max Grad Norm**: 1.0
+- **FP16**: True
+- **Early Stopping Patience**: 5
+- **Audio Max Length**: 8.0 seconds
+- **Text Max Length**: 256 tokens
+- **Sample Rate**: 16kHz
+- **Normalization**: None (raw values)
+- **Dataset Split**: 80/20 train/test
+- **Random Seed**: 42
+- **Metric**: RMSE (lower is better)
+- **Audio Caching**: Enabled with compression
+- **Workers**: 8
+## Training Configuration
+The model was trained with the following specifications:
+- **Dataset**: Multimodal audio-subtitle pairs with TTE annotations
+- **Train/Test Split**: 80/20 with random seed 42
+- **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
+- **Text Processing**: Max 256 tokens
+- **Normalization**: None (raw TTE values in seconds)
+- **Caching**: Audio segments cached and compressed for efficiency
+## Usage Notes
+- This is the **basic** variant - processes only audio and text
+- For translation-aware models, see `seamless-translation` and `seamless-langpairs`
+- Model expects 16kHz audio input (automatically resampled by data collator)
+- Text is processed with SeamlessM4T text encoder
+- No feature normalization applied - outputs raw TTE predictions in seconds
+- Optimized for subtitle editing time estimation tasks
+## Limitations
+- Designed for TTE prediction, not general audio-text matching
+- Performance may vary on out-of-domain content or different editing workflows
+- Requires specific data preprocessing (use included data collator)
+## Related Models
+- **seamless-translation**: Adds translation awareness features
+- **seamless-langpairs**: Includes language pair embeddings for multilingual scenarios

config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "architectures": [
+    "HFSeamlessBasic"
+  ],
+  "dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "model_type": "seamless_basic",
+  "seamless_model_name": "facebook/hf-seamless-m4t-medium",
+  "torch_dtype": "float32",
+  "transformers_version": "4.50.2"
+}

data_collator.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import torch
+import numpy as np
+from transformers import AutoProcessor
+from typing import Dict, List, Union
+import logging
+logger = logging.getLogger(__name__)
+class DataCollatorSimpleSeamless:
+    def __init__(
+        self,
+        processor: str,
+        sample_rate: int = 16000,
+        max_audio_length_sec: float = 8.0,
+        max_text_length: int = 256,
+        normalization_type: str = "none"
+    ):
+        """Initialize the data collator.
+        Args:
+            processor: The processor to use.
+            sample_rate: Audio sample rate.
+            max_audio_length_sec: Maximum audio length in seconds.
+            max_text_length: Maximum text length.
+            normalization_type: Type of normalization to apply to labels. Options: "log1p", "none"
+        """
+        logger.info(f"Loading processor: {processor}")
+        self.processor = AutoProcessor.from_pretrained(processor)
+        self.sample_rate = sample_rate
+        self.max_audio_sample_length = int(max_audio_length_sec * sample_rate)
+        self.max_text_length = max_text_length
+        self.normalization_type = normalization_type
+    def __call__(self, batch: List[Dict[str, Union[np.ndarray, str, float]]]) -> Dict[str, torch.Tensor]:
+        """Process a batch of raw features into model inputs."""
+        # Extract raw data
+        raw_audios = [item['raw_audio'] for item in batch]
+        raw_texts = [item['raw_text'] for item in batch]
+        raw_audios = [torch.tensor(audio) for audio in raw_audios]
+        audio_inputs = self.processor(
+            audios=raw_audios,
+            sampling_rate=self.sample_rate,
+            return_tensors="pt",
+            padding="longest",
+            truncation=True,
+            max_length=self.max_audio_sample_length,
+        )
+        text_inputs = self.processor(
+            text=raw_texts,
+            return_tensors="pt",
+            padding="longest",
+            truncation=True,
+            max_length=self.max_text_length,
+        )
+        # Extract translation features
+        is_translation = torch.tensor([item.get('is_translation', 0) for item in batch], dtype=torch.float32)
+        # Extract language pair features
+        language_pair_id = torch.tensor([item.get('language_pair_id', 0) for item in batch], dtype=torch.long)
+        if 'labels' in batch[0]:
+            labels = [item['labels'] for item in batch]
+            labels = torch.tensor(labels, dtype=torch.float32)
+            # Apply normalization based on type
+            if self.normalization_type == "log1p":
+                labels = torch.log1p(labels)
+            elif self.normalization_type == "none":
+                pass
+            else:
+                raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+        else:
+            labels = None
+        return {
+            'input_features': audio_inputs['input_features'],
+            'audio_attention_mask': audio_inputs.get('attention_mask', None) if audio_inputs.get('attention_mask') is not None else None,
+            'input_ids': text_inputs['input_ids'],
+            'text_attention_mask': text_inputs['attention_mask'],
+            'is_translation': is_translation,
+            'language_pair_id': language_pair_id,
+            **({'labels': labels} if labels is not None else {})
+        }

example_usage.py ADDED Viewed

	@@ -0,0 +1,47 @@

+#!/usr/bin/env python3
+# Example usage for videoloc/seamless-basic
+from transformers import AutoModel, AutoConfig
+from huggingface_hub import hf_hub_download
+import torch
+import numpy as np
+import importlib.util
+def load_model_and_collator():
+    model = AutoModel.from_pretrained("videoloc/seamless-basic")
+    config = AutoConfig.from_pretrained("videoloc/seamless-basic")
+    # Load data collator
+    collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py")
+    spec = importlib.util.spec_from_file_location("data_collator", collator_file)
+    collator_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(collator_module)
+    data_collator = collator_module.DataCollatorSimpleSeamless(
+        processor="facebook/hf-seamless-m4t-medium",
+        max_audio_length_sec=8.0,
+        max_text_length=256
+    )
+    return model, data_collator
+def example_inference():
+    model, collator = load_model_and_collator()
+    # Example data: audio segment + subtitle text to predict editing time
+    data = [{
+        'raw_audio': np.random.randn(16000 * 3),  # 3 seconds at 16kHz
+        'raw_text': "Hello, welcome to our presentation today.",
+    }]
+    batch = collator(data)
+    model.eval()
+    with torch.no_grad():
+        outputs = model(**batch)
+        tte_prediction = outputs.logits.item()
+    print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
+    return tte_prediction
+if __name__ == "__main__":
+    example_inference()

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88d20bd96bdcb428c064083bb2e2eef54b770f03ccf8d3d60a1bb464e51c2b92
+size 4857939849

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+transformers>=4.50.2
+torch>=2.6.0
+torchaudio>=2.6.0
+huggingface_hub>=0.33.0
+numpy>=2.2.3
+sentencepiece>=0.2.0
+accelerate>=1.5.2
+soundfile>=0.13.1