rabbitfishai
/

docmap-opensource-leads-classifier-deberta-v3-base

@@ -1,24 +1,124 @@
 ---
-license: apache-2.0
-pipeline_tag: text-classification
 language:
 - en
 tags:
-- leads
 - healthcare
-- classifier
-- intent
-- multi-label
-- deberta
-base_model:
-- microsoft/deberta-large-mnli
 ---
-# DocMap Leads Classifier (DeBERTa-v3-base)
-Predicts:
-- intent (single label)
-- symptoms (multi-label)
-- specialties (multi-label)
-Load the model and heads using your service code; apply sigmoid thresholds (e.g., 0.4–0.5) for multi-label outputs.

 ---
+license: mit
 language:
 - en
 tags:
+- text-classification
+- multi-label-classification
+- intent-classification
 - healthcare
+- social-media
+- uk
+library_name: transformers
+pipeline_tag: text-classification
 ---
+## DocMap Leads Classifier (UK healthcare, social media)
+Multi-head encoder classifier built on `microsoft/deberta-v3-base` to identify:
+- Intent (single label)
+- Symptoms (multi-label)
+- Specialties (multi-label)
+Trained on DocMap leads_v1 JSONL (Supabase), using simple weak labels derived from keywords/zero-shot prompts. See `label_config.json` for the canonical label spaces.
+### What’s in this repo
+- `model.safetensors`: classifier heads + encoder weights
+- `label_config.json`: lists of `intents`, `symptoms`, `specialties`
+- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `spm.model`
+- `README.md`, `.gitattributes`
+- (Checkpoints may be removed for smaller repo size)
+### Intended use
+- Lead identification from public social media text in a UK healthcare context.
+- Outputs: intent label and multi-label sets for symptoms/specialties.
+Not a medical device. Do not use for diagnosis; for triage or marketing pre-filtering only.
+### Training
+- Base: `microsoft/deberta-v3-base`
+- Epochs: 3, batch size: 16, lr: 2e-5, max_len: 256
+- Split: 10% validation
+- Threshold sweep on validation suggested `0.3` as default for multi-label heads.
+### Quick start (Python)
+This model uses a lightweight custom head. Load with the snippet below (no HF widget).
+```python
+import json, torch
+import torch.nn as nn
+from transformers import AutoTokenizer, AutoModel
+repo_id = "YOUR_USER_OR_ORG/docmap-leads-classifier-v1"  # change to your repo
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Load labels
+from huggingface_hub import hf_hub_download
+cfg_path = hf_hub_download(repo_id, "label_config.json")
+with open(cfg_path, "r") as f:
+    cfg = json.load(f)
+tokenizer = AutoTokenizer.from_pretrained(repo_id)
+class LeadsClassifier(nn.Module):
+    def __init__(self, base_model_name, num_intents, num_symptoms, num_specialties):
+        super().__init__()
+        self.encoder = AutoModel.from_pretrained(base_model_name)
+        hidden = self.encoder.config.hidden_size
+        self.dropout = nn.Dropout(0.1)
+        self.intent_head = nn.Linear(hidden, num_intents)
+        self.sym_head = nn.Linear(hidden, num_symptoms)
+        self.spec_head = nn.Linear(hidden, num_specialties)
+    def forward(self, input_ids=None, attention_mask=None):
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        cls = self.dropout(out.last_hidden_state[:, 0, :])
+        return {
+            "intent_logits": self.intent_head(cls),
+            "sym_logits": self.sym_head(cls),
+            "spec_logits": self.spec_head(cls),
+        }
+model = LeadsClassifier(
+    base_model_name="microsoft/deberta-v3-base",
+    num_intents=len(cfg["intents"]),
+    num_symptoms=len(cfg["symptoms"]),
+    num_specialties=len(cfg["specialties"]),
+).to(device)
+# Load weights
+sd_path = hf_hub_download(repo_id, "model.safetensors")
+model.load_state_dict(torch.load(sd_path, map_location=device, weights_only=True))
+model.eval()
+def predict(texts, thr=0.3):
+    batch = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt").to(device)
+    with torch.no_grad():
+        out = model(**batch)
+        intent = torch.softmax(out["intent_logits"], dim=-1).argmax(dim=-1).tolist()
+        sym_prob = torch.sigmoid(out["sym_logits"])
+        spec_prob = torch.sigmoid(out["spec_logits"])
+    intents = [cfg["intents"][i] for i in intent]
+    symptoms = [[cfg["symptoms"][j] for j, p in enumerate(row) if p >= thr] for row in sym_prob.tolist()]
+    specialties = [[cfg["specialties"][j] for j, p in enumerate(row) if p >= thr] for row in spec_prob.tolist()]
+    return [{"intent": i, "symptoms": s, "specialties": sp} for i, s, sp in zip(intents, symptoms, specialties)]
+print(predict(["Any advice on fever? Based in Glasgow, started 3 days ago."], thr=0.3))
+```
+### Inference thresholds
+- Default multi-label threshold: **0.3** (from validation sweep).
+- Tune per use-case; 0.5 is stricter, 0.2 more sensitive.
+### Limitations and risks
+- Weakly supervised labels; potential label noise and leakage.
+- Social media domain; may not generalize to clinical text.
+- Not for medical diagnosis or emergency advice.
+### License
+- MIT (inherits base model’s MIT license).
+### Citation
+Please cite the base model and this repository if you use it in research or production.