LeeSek
/

binary-dockerfile-model

binary-classification

Model card Files Files and versions

LeeSek commited on Jun 3

Commit

7532aef

·

verified ·

1 Parent(s): 686fe96

Upload README.md

Files changed (1) hide show

README.md +113 -0

README.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# 🧱 Dockerfile Quality Classifier – Binary Model
+This model predicts whether a given Dockerfile is:
+- ✅ **GOOD** – clean and adheres to best practices (no top rule violations)
+- ❌ **BAD** – violates at least one important rule (from Hadolint)
+It is the first step in a full ML-based Dockerfile linter.
+---
+## 🧠 Model Overview
+- **Architecture:** Fine-tuned `microsoft/codebert-base`
+- **Task:** Binary classification (`good` vs `bad`)
+- **Input:** Full Dockerfile content as plain text
+- **Output:** `[prob_good, prob_bad]` — softmax scores
+- **Max input length:** 512 tokens
+---
+## 📚 Training Details
+- **Data source:** Real-world and synthetic Dockerfiles
+- **Labels:** Based on [Hadolint](https://github.com/hadolint/hadolint) top 30 rules
+- **Bad examples:** At least one rule violated
+- **Good examples:** Fully clean files
+- **Dataset balance:** 50/50
+---
+## 🧪 Evaluation Results
+Evaluation on a held-out test set of 1,650 Dockerfiles:
+| Class | Precision | Recall | F1-score | Support |
+|-------|-----------|--------|----------|---------|
+| good  | 0.96      | 0.91   | 0.93     | 150     |
+| bad   | 0.99      | 1.00   | 0.99     | 1500    |
+| **Accuracy** |       |        | **0.99** | 1650    |
+---
+## 🚀 Quick Start
+### 🧪 Step 1 — Create test script
+Save this as `test_binary_predict.py`:
+```python
+import sys
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+from pathlib import Path
+path = Path(sys.argv[1])
+text = path.read_text(encoding="utf-8")
+tokenizer = AutoTokenizer.from_pretrained("LeeSek/binary-dockerfile-model")
+model = AutoModelForSequenceClassification.from_pretrained("LeeSek/binary-dockerfile-model")
+model.eval()
+inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
+with torch.no_grad():
+    logits = model(**inputs).logits
+    probs = torch.nn.functional.softmax(logits, dim=1).squeeze()
+label = "GOOD" if torch.argmax(probs).item() == 0 else "BAD"
+print(f"Prediction: {label} — Probabilities: good={probs[0]:.3f}, bad={probs[1]:.3f}")
+```
+---
+### 📄 Step 2 — Create a test Dockerfile
+Save the following as `Dockerfile`:
+```dockerfile
+FROM node:18
+WORKDIR /app
+COPY . .
+RUN npm install
+CMD ["node", "index.js"]
+```
+---
+### ▶️ Step 3 — Run the prediction
+```bash
+python test_binary_predict.py Dockerfile
+```
+Expected output:
+```
+Prediction: GOOD — Probabilities: good=0.998, bad=0.002
+```
+---
+## 📘 License
+MIT
+---
+## 🙌 Credits
+- Model powered by [Hugging Face Transformers](https://huggingface.co/transformers)
+- Tokenizer: CodeBERT
+- Rule definitions: [Hadolint](https://github.com/hadolint/hadolint)