AetherMind-KD-Student

A Robust and Efficient Knowledge-Distilled Model for Natural Language Inference (NLI)

Repository: samerzaher80/AetherMind-KD-Student
License: MIT

📘 Overview

AetherMind-KD-Student is a 184M-parameter Natural Language Inference (NLI) model distilled from a DeBERTa-v3 teacher using a multi-stage, adversarial-aware knowledge distillation pipeline.
The model is designed to provide:

High accuracy on standard NLI benchmarks
Strong robustness on adversarial datasets
Excellent zero-shot generalization to unseen datasets
High inference efficiency on consumer GPUs

This makes it suitable for research and practical applications that require fast and reliable sentence-level reasoning.

🧠 Key Features

✔ Knowledge Distillation from Large DeBERTa-v3 Teachers

Teacher: DeBERTa-v3-based NLI model
Student: 184M-parameter transformer
Combined objective:
- 70% KLDivLoss on teacher soft logits
- 30% CrossEntropyLoss on gold labels
Temperature scaling (T ≈ 3.0) for softened targets

✔ Multi-Stage Curriculum

Teacher supervision was applied over a curriculum of NLI datasets:

SNLI – core NLI patterns
MNLI – multi-domain robustness
ANLI R1–R3 – adversarial reasoning

✔ Training Enhancements

BalancedBatchSampler to keep entailment/neutral/contradiction distributions balanced per batch
Emphasis on contradiction and neutral classes via loss weighting and sampling
Careful scheduling and early stopping based on validation performance

📚 Datasets

✅ Used During Training / Distillation

Dataset	Role
SNLI	Base NLI training (entailment, neutral, contradiction)
MNLI	Multi-genre generalization (matched + mismatched)
ANLI (R1–R3)	Adversarial robustness and hard examples

🚫 Not Used in Training (Zero-Shot Evaluation Only)

The following datasets were not used during training or distillation. All results on them are pure zero-shot:

Dataset	Type	Notes
RTE (GLUE)	Textual entailment	Zero-shot generalization
HANS	Heuristic / syntactic bias test	Zero-shot
SciTail	Science-domain entailment	Evaluated in binary setting
XNLI (English)	Cross-lingual NLI test	Zero-shot on English split

🏗 Model Architecture

The model follows a compact transformer architecture:

12 Transformer encoder layers
Hidden size: 768
12 attention heads
Intermediate feed-forward size as in BERT/DeBERTa-base-style models
Final classification head with 3 output logits:
- 0 = entailment
- 1 = neutral
- 2 = contradiction

Total parameters: 184,424,451

The design target is to match or exceed the performance of larger teacher models while remaining efficient enough for real-time inference on a single consumer GPU.

🔥 Knowledge Distillation Strategy

Objective

The total loss is a weighted combination:

Knowledge Distillation Loss (KLDivLoss)
- Encourages student logits to match the teacher’s softened output distribution
Supervised Loss (CrossEntropy)
- Encourages correct prediction of the gold label

Formally:

L_total = 0.7 · L_KD + 0.3 · L_CE

where L_KD uses temperature-scaled teacher logits.

Additional Techniques

Balanced batches w.r.t. class labels
Emphasis on contradiction / neutral examples during later stages
Adversarial samples from ANLI to harden reasoning under distribution shifts

📊 Evaluation Results

1️⃣ Core NLI Benchmarks

Dataset	Split	Accuracy	Macro-F1
MNLI (matched)	validation	90.47%	90.42%
MNLI (mismatched)	validation	90.12%	90.07%
SNLI	test	~88–89%	~88–89%

2️⃣ Adversarial NLI (ANLI)

Dataset	Split	Accuracy	Macro-F1
ANLI R1	test_r1	73.60%	73.61%
ANLI R2	test_r2	57.70%	57.60%
ANLI R3	test_r3	53.67%	53.68%

These scores indicate strong robustness, especially considering the model’s size.

3️⃣ Zero-Shot Generalization

These datasets were never seen during training. All scores are zero-shot.

RTE (GLUE)

Accuracy: 86.28%
Macro-F1: 86.20%

HANS

Accuracy: 77.74%
Macro-F1: 76.60%

The strong performance on HANS suggests reduced dependence on shallow lexical heuristics.

SciTail (Binary Setting)

SciTail originally has entailment vs neutral classes. For evaluation, the model’s 3-way logits are mapped to:

Entailment → entailment
Neutral + contradiction → non-entailment

Split	Accuracy	Macro-F1
Train	82.37%	80.99%
Dev	78.83%	78.81%

XNLI (English, zero-shot)

Accuracy: 90.92%
Macro-F1: 90.94%

This demonstrates strong cross-domain and cross-benchmark generalization, even without explicit multilingual or XNLI-specific training.

Results

Task	Dataset	Split	Accuracy	Macro-F1
Natural Language Inference	MNLI (matched)	validation	90.47%	90.42%
Natural Language Inference	MNLI (mismatched)	validation	90.12%	90.07%
Natural Language Inference	SNLI	test	~88–89%	~88–89%
Adversarial NLI	ANLI R1	test_r1	73.60%	73.61%
Adversarial NLI	ANLI R2	test_r2	57.70%	57.60%
Adversarial NLI	ANLI R3	test_r3	53.67%	53.68%
Zero-shot	RTE (GLUE)	validation	86.28%	86.20%
Zero-shot	HANS	validation	77.74%	76.60%
Zero-shot (binary)	SciTail	dev	78.83%	78.81%
Zero-shot	XNLI (English)	test	90.92%	90.94%

⚡ Efficiency

Metric	Value
Total parameters	184,424,451
Inference speed	≈ 308.51 samples/second
Hardware	RTX 3050 (8 GB), CUDA 11.8

These numbers make the model a good choice for production environments and large-scale batch inference.

🧪 Intended Use

Recommended Uses

Research on NLI, robustness, and knowledge distillation
As a drop-in NLI component for:
- Scientific text understanding
- Claim verification prototypes
- General English reasoning tasks
Zero-shot probing on new NLI-style benchmarks

Not Recommended For

Safety-critical applications (medical diagnosis, legal decisions, etc.) without human experts in the loop
High-stakes multilingual use cases (model is trained and validated on English only)
Long-document reasoning beyond typical transformer context length

⚠ Limitations

Performance on ANLI R3 remains challenging, consistent with broader model behavior in the literature
No dedicated multilingual training (XNLI non-English languages not evaluated)
No explicit calibration of probabilities (users may wish to post-calibrate logits/probabilities)

🔮 Future Work

Planned and possible future enhancements include:

Adversarial fine-tuning specifically for ANLI R3
Cross-lingual extensions using full XNLI
Domain adapters for biomedical and clinical NLI (e.g., MedNLI)
Integration in larger cognitive reasoning systems with memory and tool-use (outside the scope of this model card)

📦 Files in This Repository

config.json – model configuration
model.safetensors – model weights
tokenizer.json – tokenizer model
tokenizer_config.json – tokenizer configuration
special_tokens_map.json – special tokens metadata
spm.model – SentencePiece model (if applicable)
added_tokens.json – additional tokens (if any)
training_args.bin – training arguments (optional, for reproducibility)
trainer_state.json – trainer state (optional, for reproducibility)

💻 Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "samerzaher80/AetherMind-KD-Student"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "A cat is sleeping on the sofa."
hypothesis = "An animal is resting indoors."

inputs = tokenizer(premise, hypothesis, return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
pred = logits.argmax(dim=-1).item()

id2label = {0: "entailment", 1: "neutral", 2: "contradiction"}
print(id2label[pred])

📜 Citation

If you use this model in your research, please cite:

@misc{aethermind2025kdstudent,
  title        = {AetherMind-KD-Student: A Robust and Efficient Knowledge-Distilled NLI Model},
  author       = {Sameer S. Najm},
  year         = {2025},
  howpublished = {Hugging Face model repository},
  note         = {\url{https://huggingface.co/samerzaher80/AetherMind-KD-Student}}
}

👤 Author

Sameer S. Najm
AI Researcher & Founder, Sam IT Solutions – Iraq

🪪 License

This model is released under the MIT License.

Downloads last month: 5

Safetensors

Model size

0.2B params

Tensor type

F32

samerzaher80
/

AetherMind-KD-Student

AetherMind-KD-Student

📘 Overview

🧠 Key Features

✔ Knowledge Distillation from Large DeBERTa-v3 Teachers

✔ Multi-Stage Curriculum

✔ Training Enhancements

📚 Datasets

✅ Used During Training / Distillation

🚫 Not Used in Training (Zero-Shot Evaluation Only)

🏗 Model Architecture

🔥 Knowledge Distillation Strategy

Objective

Additional Techniques

📊 Evaluation Results

1️⃣ Core NLI Benchmarks

2️⃣ Adversarial NLI (ANLI)

3️⃣ Zero-Shot Generalization

RTE (GLUE)

HANS

SciTail (Binary Setting)

XNLI (English, zero-shot)

Results

⚡ Efficiency

🧪 Intended Use

Recommended Uses

Not Recommended For

⚠ Limitations

🔮 Future Work

📦 Files in This Repository

💻 Usage Example

📜 Citation

👤 Author

🪪 License

Datasets used to train samerzaher80/AetherMind-KD-Student

AetherMind-KD-Student

📘 Overview

🧠 Key Features

✔ Knowledge Distillation from Large DeBERTa-v3 Teachers

✔ Multi-Stage Curriculum

✔ Training Enhancements

📚 Datasets

✅ Used During Training / Distillation

🚫 Not Used in Training (Zero-Shot Evaluation Only)

🏗 Model Architecture

🔥 Knowledge Distillation Strategy

Objective

Additional Techniques

📊 Evaluation Results

1️⃣ Core NLI Benchmarks

2️⃣ Adversarial NLI (ANLI)

3️⃣ Zero-Shot Generalization

RTE (GLUE)

HANS

SciTail (Binary Setting)

XNLI (English, zero-shot)

Results

⚡ Efficiency

🧪 Intended Use

Recommended Uses

Not Recommended For

⚠ Limitations

🔮 Future Work

📦 Files in This Repository

💻 Usage Example

📜 Citation

👤 Author

🪪 License

Datasets used to train samerzaher80/AetherMind-KD-Student

🎉 Free Image Generator Now Available!