👨🍳 SAUTE: Speaker-Aware Utterance Embedding Unit
SAUTE is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines EDU-level utterance embeddings, speaker-sensitive memory, and efficient linear attention to encode rich conversational context with minimal overhead.
🧠 Overview
SAUTE is tailored for:
- 🗣️ Multi-turn conversations
- 👥 Multi-speaker interactions
- 🧵 Long-range dialog dependencies
It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms.
🧱 Architecture
🔍 SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings.
- EDU-Level Encoder: Mean-pooled BERT outputs per utterance.
- Speaker Memory: Outer-product-based accumulation per speaker.
- Contextualization Layer: Integrates memory summaries with current token representations.
🚀 Key Features
- 🧠 Speaker-Aware Memory: Structured per-speaker representation of dialogue context.
- ⚡ Linear Attention: Efficient and scalable to long dialogues.
- 🧩 Pretrained Transformer Compatible: Can plug into frozen or fine-tuned BERT models.
- 🪶 Lightweight: ~4M parameters less than 2-layer with strong MLM performance improvements.
📈 Performance (on SODA, Masked Language Modeling)
| Model | Avg MLM Acc | Best MLM Acc |
|---|---|---|
| BERT-base (frozen) | 33.45 | 45.89 |
| + 1-layer Transformer | 68.20 | 76.69 |
| + 2-layer Transformer | 71.81 | 79.54 |
| + 1-layer SAUTE (Ours) | 72.05 | 80.40% |
| + 3-layer Transformer | 73.5 | 80.84 |
| + 3-layer SAUTE (Ours) | 75.65 | 85.55% |
SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers.
📚 Citation / Paper
📄 SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)
🔧 How to Use
from saute_model import SAUTEConfig, UtteranceEmbedings
from transformers import BertTokenizerFast
# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute")
# Prepare inputs (example)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
speaker_names=speaker_names
)
- Downloads last month
- 2