AetherMind-KD-Student

A Robust and Efficient Knowledge-Distilled Model for Natural Language Inference (NLI)

Repository: samerzaher80/AetherMind-KD-Student
License: MIT


📘 Overview

AetherMind-KD-Student is a 184M-parameter Natural Language Inference (NLI) model distilled from a DeBERTa-v3 teacher using a multi-stage, adversarial-aware knowledge distillation pipeline.
The model is designed to provide:

  • High accuracy on standard NLI benchmarks
  • Strong robustness on adversarial datasets
  • Excellent zero-shot generalization to unseen datasets
  • High inference efficiency on consumer GPUs

This makes it suitable for research and practical applications that require fast and reliable sentence-level reasoning.


🧠 Key Features

✔ Knowledge Distillation from Large DeBERTa-v3 Teachers

  • Teacher: DeBERTa-v3-based NLI model
  • Student: 184M-parameter transformer
  • Combined objective:
    • 70% KLDivLoss on teacher soft logits
    • 30% CrossEntropyLoss on gold labels
  • Temperature scaling (T ≈ 3.0) for softened targets

✔ Multi-Stage Curriculum

Teacher supervision was applied over a curriculum of NLI datasets:

  1. SNLI – core NLI patterns
  2. MNLI – multi-domain robustness
  3. ANLI R1–R3 – adversarial reasoning

✔ Training Enhancements

  • BalancedBatchSampler to keep entailment/neutral/contradiction distributions balanced per batch
  • Emphasis on contradiction and neutral classes via loss weighting and sampling
  • Careful scheduling and early stopping based on validation performance

📚 Datasets

✅ Used During Training / Distillation

Dataset Role
SNLI Base NLI training (entailment, neutral, contradiction)
MNLI Multi-genre generalization (matched + mismatched)
ANLI (R1–R3) Adversarial robustness and hard examples

🚫 Not Used in Training (Zero-Shot Evaluation Only)

The following datasets were not used during training or distillation. All results on them are pure zero-shot:

Dataset Type Notes
RTE (GLUE) Textual entailment Zero-shot generalization
HANS Heuristic / syntactic bias test Zero-shot
SciTail Science-domain entailment Evaluated in binary setting
XNLI (English) Cross-lingual NLI test Zero-shot on English split

🏗 Model Architecture

The model follows a compact transformer architecture:

  • 12 Transformer encoder layers
  • Hidden size: 768
  • 12 attention heads
  • Intermediate feed-forward size as in BERT/DeBERTa-base-style models
  • Final classification head with 3 output logits:
    • 0 = entailment
    • 1 = neutral
    • 2 = contradiction

Total parameters: 184,424,451

The design target is to match or exceed the performance of larger teacher models while remaining efficient enough for real-time inference on a single consumer GPU.


🔥 Knowledge Distillation Strategy

Objective

The total loss is a weighted combination:

  • Knowledge Distillation Loss (KLDivLoss)
    • Encourages student logits to match the teacher’s softened output distribution
  • Supervised Loss (CrossEntropy)
    • Encourages correct prediction of the gold label

Formally:

L_total = 0.7 · L_KD + 0.3 · L_CE

where L_KD uses temperature-scaled teacher logits.

Additional Techniques

  • Balanced batches w.r.t. class labels
  • Emphasis on contradiction / neutral examples during later stages
  • Adversarial samples from ANLI to harden reasoning under distribution shifts

📊 Evaluation Results

1️⃣ Core NLI Benchmarks

Dataset Split Accuracy Macro-F1
MNLI (matched) validation 90.47% 90.42%
MNLI (mismatched) validation 90.12% 90.07%
SNLI test ~88–89% ~88–89%

2️⃣ Adversarial NLI (ANLI)

Dataset Split Accuracy Macro-F1
ANLI R1 test_r1 73.60% 73.61%
ANLI R2 test_r2 57.70% 57.60%
ANLI R3 test_r3 53.67% 53.68%

These scores indicate strong robustness, especially considering the model’s size.


3️⃣ Zero-Shot Generalization

These datasets were never seen during training. All scores are zero-shot.

RTE (GLUE)

  • Accuracy: 86.28%
  • Macro-F1: 86.20%

HANS

  • Accuracy: 77.74%
  • Macro-F1: 76.60%

The strong performance on HANS suggests reduced dependence on shallow lexical heuristics.

SciTail (Binary Setting)

SciTail originally has entailment vs neutral classes. For evaluation, the model’s 3-way logits are mapped to:

  • Entailment → entailment
  • Neutral + contradiction → non-entailment
Split Accuracy Macro-F1
Train 82.37% 80.99%
Dev 78.83% 78.81%

XNLI (English, zero-shot)

  • Accuracy: 90.92%
  • Macro-F1: 90.94%

This demonstrates strong cross-domain and cross-benchmark generalization, even without explicit multilingual or XNLI-specific training.

Results

Task Dataset Split Accuracy Macro-F1
Natural Language Inference MNLI (matched) validation 90.47% 90.42%
Natural Language Inference MNLI (mismatched) validation 90.12% 90.07%
Natural Language Inference SNLI test ~88–89% ~88–89%
Adversarial NLI ANLI R1 test_r1 73.60% 73.61%
Adversarial NLI ANLI R2 test_r2 57.70% 57.60%
Adversarial NLI ANLI R3 test_r3 53.67% 53.68%
Zero-shot RTE (GLUE) validation 86.28% 86.20%
Zero-shot HANS validation 77.74% 76.60%
Zero-shot (binary) SciTail dev 78.83% 78.81%
Zero-shot XNLI (English) test 90.92% 90.94%

⚡ Efficiency

Metric Value
Total parameters 184,424,451
Inference speed ≈ 308.51 samples/second
Hardware RTX 3050 (8 GB), CUDA 11.8

These numbers make the model a good choice for production environments and large-scale batch inference.


🧪 Intended Use

Recommended Uses

  • Research on NLI, robustness, and knowledge distillation
  • As a drop-in NLI component for:
    • Scientific text understanding
    • Claim verification prototypes
    • General English reasoning tasks
  • Zero-shot probing on new NLI-style benchmarks

Not Recommended For

  • Safety-critical applications (medical diagnosis, legal decisions, etc.) without human experts in the loop
  • High-stakes multilingual use cases (model is trained and validated on English only)
  • Long-document reasoning beyond typical transformer context length

⚠ Limitations

  • Performance on ANLI R3 remains challenging, consistent with broader model behavior in the literature
  • No dedicated multilingual training (XNLI non-English languages not evaluated)
  • No explicit calibration of probabilities (users may wish to post-calibrate logits/probabilities)

🔮 Future Work

Planned and possible future enhancements include:

  • Adversarial fine-tuning specifically for ANLI R3
  • Cross-lingual extensions using full XNLI
  • Domain adapters for biomedical and clinical NLI (e.g., MedNLI)
  • Integration in larger cognitive reasoning systems with memory and tool-use (outside the scope of this model card)

📦 Files in This Repository

  • config.json – model configuration
  • model.safetensors – model weights
  • tokenizer.json – tokenizer model
  • tokenizer_config.json – tokenizer configuration
  • special_tokens_map.json – special tokens metadata
  • spm.model – SentencePiece model (if applicable)
  • added_tokens.json – additional tokens (if any)
  • training_args.bin – training arguments (optional, for reproducibility)
  • trainer_state.json – trainer state (optional, for reproducibility)

💻 Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "samerzaher80/AetherMind-KD-Student"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "A cat is sleeping on the sofa."
hypothesis = "An animal is resting indoors."

inputs = tokenizer(premise, hypothesis, return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
pred = logits.argmax(dim=-1).item()

id2label = {0: "entailment", 1: "neutral", 2: "contradiction"}
print(id2label[pred])

📜 Citation

If you use this model in your research, please cite:

@misc{aethermind2025kdstudent,
  title        = {AetherMind-KD-Student: A Robust and Efficient Knowledge-Distilled NLI Model},
  author       = {Sameer S. Najm},
  year         = {2025},
  howpublished = {Hugging Face model repository},
  note         = {\url{https://huggingface.co/samerzaher80/AetherMind-KD-Student}}
}

👤 Author

Sameer S. Najm
AI Researcher & Founder, Sam IT Solutions – Iraq


🪪 License

This model is released under the MIT License.

Downloads last month
5
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train samerzaher80/AetherMind-KD-Student