Hecto: FFNN + GRU Mixture-of-Experts for AG News

Hecto is a lightweight, interpretable Mixture-of-Experts (MoE) architecture combining:

  • A feedforward expert for static feature abstraction, and
  • A GRU expert for sequential reasoning,

These experts are gated by a sparse, learnable Top-1 router conditioned on the [CLS] token embedding.


Model Architecture

  • Base Encoder: DistilBERT (distilbert-base-uncased)
  • Experts:
    • Expert 0: 2-layer FFNN (256 → 128 → 4, Tanh activation)
    • Expert 1: GRU (256 → 128 → 4)
  • Gating:
    • Top-1 sparse routing
    • Temperature-controlled softmax (τ = 1.5)
    • Entropy and load-balancing regularization

Training Setup

Detail Value
Dataset AG News (5k sampled)
Loss Function Cross-Entropy + Entropy + Diversity
Optimizer AdamW
Epochs 5
Batch Size 16
Learning Rate 2e-5
Seeds Used [0, 1, 2] (averaged)

Performance (Average over 3 seeds)

Metric Value
Accuracy 90.02%
F1 Score 89.91%
Inference Time 0.0083 sec/sample
Expert Usage FFNN = 20.1%, GRU = 79.9%

The model routes the majority of samples to the GRU expert, especially for classes like "Sports" and "Sci/Tech". This suggests stronger reliance on sequential reasoning across AG News categories.


Files Included

  • pytorch_model.bin: Model weights
  • config.json: Custom MoE architecture config
  • tokenizer_config.json, tokenizer.json, vocab.txt, special_tokens_map.json: Tokenizer files (DistilBERT)

Example Usage

from transformers import AutoTokenizer
from your_model_file import Hecto  # Replace with your local Hecto class definition
import torch
from torch.nn.functional import softmax

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("ruhzi/hecto-ffnn-gru")

# Reconstruct the model architecture
model = Hecto("ff", "gru", frozen=False)
model.load_state_dict(torch.load("pytorch_model.bin"))
model.eval()

# Tokenize input
inputs = tokenizer("NASA launches new satellite to study space weather.", return_tensors="pt")

# Run inference
with torch.no_grad():
    logits, _, gate_probs = model(**inputs)
    probs = softmax(logits, dim=-1)

print("Predicted class:", probs.argmax().item())
print("Gate routing probabilities:", gate_probs)

Note: Hecto is a custom model and must be defined in your environment before loading weights.
To make your model easily reusable, consider including a modeling_hecto.py file in your repository.

Citation

If you use this model or architecture in your research, please cite:

@article{pandey2025hecto,
  title = {Hecto: Modular Sparse Experts for Adaptive and Interpretable Reasoning},
  author = {Pandey, Sanskar and Chopra, Ruhaan and Bhat, Saad Murtaza and Abhyudaya, Ark},
  journal = {arXiv preprint arXiv:2506.22919},
  year = {2025},
  month = {June},
  note = {Version 1 submitted June 28, 2025; version 2 updated July 1, 2025}
}

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ruhzi/hecto-ffnn-gru