--- language: - en - bem - ny tags: - multi-task - sentiment-analysis - topic-classification - language-identification - multilingual - transformer - zambia - lusaka license: apache-2.0 library_name: transformers pipeline_tag: text-classification base_model: - Kelvinmbewe/mbert_Lusaka_Language_Analysis - Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis - Kelvinmbewe/mbert_LusakaLang_Topic model-index: - name: LusakaLang-MultiTask results: - task: type: text-classification name: Language Identification dataset: name: LusakaLang Language Data type: lusakalang split: test metrics: - type: accuracy value: 0.97 name: accuracy - type: f1 value: 0.96 name: f1_macro - type: accuracy value: 0.9322 name: accuracy - type: f1 value: 0.9216 name: f1_macro - type: f1 value: 0.8649 name: f1_negative - type: f1 value: 0.95 name: f1_neutral - type: f1 value: 0.95 name: f1_positive - type: accuracy value: 0.91 name: accuracy - type: f1 value: 0.9 name: f1_macro --- ## **LusakaLang MultiTask Model** This model is a unified transformer architecture built on top of `bert-base-multilingual-cased`, designed to perform three tasks simultaneously: 1. Language Identification 2. Sentiment Analysis 3. Topic Classification The system integrates three fine‑tuned LusakaLang checkpoints: - mbert_Lusaka_Language_Analysis - mbert_LusakaLang_Sentiment_Analysis - mbert_LusakaLang_Topic All tasks share a single mBERT encoder, supported by three independent classifier heads. This architecture enhances computational efficiency, reduces memory overhead and promotes consistent, harmonized predictions across all tasks. --- ## **Why This Model Matters** Zambian communication is inherently multilingual, fluid, and deeply shaped by context. A single message may blend English, Bemba, Nyanja, local slang, and frequent code‑switching, often expressed through culturally grounded idioms and subtle emotional cues. This model is designed specifically for that environment, where meaning depends not only on the words used but on how languages interact within a single utterance. It excels at identifying the dominant language or detecting when multiple languages are being used together, interpreting sentiment even when it is conveyed indirectly or through culturally specific phrasing, and classifying text into practical topics such as driver behaviour, payment issues, app performance, customer support, and ride availability. By capturing these nuances, the model provides a more accurate and context‑aware understanding of real Zambian communication. --- ## **How to Use This Model** ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import torch.nn.functional as F class LusakaLangMultiTask: """ LusakaLang MultiTask Model: - Language Identification - Sentiment Analysis - Topic Classification """ def __init__(self, path="Kelvinmbewe/LusakaLang-MultiTask", lang_temp=1.0, sent_temp=1.0, topic_temp=1.0): # Load tokenizer and models self.tokenizer = AutoTokenizer.from_pretrained(path) self.lang_model = AutoModelForSequenceClassification.from_pretrained( "Kelvinmbewe/mbert_Lusaka_Language_Analysis" ) self.sent_model = AutoModelForSequenceClassification.from_pretrained( "Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis" ) self.topic_model = AutoModelForSequenceClassification.from_pretrained( "Kelvinmbewe/mbert_LusakaLang_Topic" ) # ID2Label mappings self.lang_id2label = self.lang_model.config.id2label self.sent_id2label = self.sent_model.config.id2label self.topic_id2label = self.topic_model.config.id2label # Temperature scaling self.lang_temp = lang_temp self.sent_temp = sent_temp self.topic_temp = topic_temp # Device setup self.device = "cuda" if torch.cuda.is_available() else "cpu" self.lang_model.to(self.device) self.sent_model.to(self.device) self.topic_model.to(self.device) def predict_batch(self, texts, conf_threshold=0.5, batch_size=16): """ Predict language, sentiment, and topic for a list of texts. Returns a list of dicts: [{"text": ..., "language":..., "sentiment":..., "topic":...}, ...] """ results = [] for i in range(0, len(texts), batch_size): batch_texts = texts[i:i + batch_size] inputs = self.tokenizer( batch_texts, return_tensors="pt", truncation=True, padding=True ).to(self.device) with torch.no_grad(): # Language lang_logits = self.lang_model(**inputs).logits / self.lang_temp lang_probs = F.softmax(lang_logits, dim=-1) lang_conf, lang_idx = torch.max(lang_probs, dim=-1) # Sentiment sent_logits = self.sent_model(**inputs).logits / self.sent_temp sent_probs = F.softmax(sent_logits, dim=-1) sent_conf, sent_idx = torch.max(sent_probs, dim=-1) # Topic topic_logits = self.topic_model(**inputs).logits / self.topic_temp topic_probs = F.softmax(topic_logits, dim=-1) topic_conf, topic_idx = torch.max(topic_probs, dim=-1) for j, text in enumerate(batch_texts): results.append({ "text": text, "language": self.lang_id2label[lang_idx[j].item()] if lang_conf[j].item() >= conf_threshold else "unknown", "language_conf": round(lang_conf[j].item(), 3), "sentiment": self.sent_id2label[sent_idx[j].item()], "sentiment_conf": round(sent_conf[j].item(), 3), "topic": self.topic_id2label[topic_idx[j].item()] if topic_conf[j].item() >= conf_threshold else "unknown", "topic_conf": round(topic_conf[j].item(), 3) }) return results # ================= Example Usage ================= llm = LusakaLangMultiTask(lang_temp=1.2, sent_temp=0.93, topic_temp=1.5) samples = [ "Driver was rude, shouting all the way", "Payment failed, money deducted but no ride", "Support did not reply to my complaint", "Umudriver alisala sana, alelanda ifintu ifipusa", ] predictions = llm.predict_batch(samples, conf_threshold=0.5) for p in predictions: print(f"TEXT: {p['text']}") print(f" Language : {p['language']} (conf={p['language_conf']})") print(f" Sentiment: {p['sentiment']} (conf={p['sentiment_conf']})") print(f" Topic : {p['topic']} (conf={p['topic_conf']})\n") ``` ## Sample Output ```python TEXT: Driver was rude, shouting all the way Language : English (conf=0.999) Sentiment: Negative (conf=0.968) Topic : Driver Behaviour (conf=0.807) TEXT: Payment failed, money deducted but no ride Language : English (conf=0.996) Sentiment: Neutral (conf=0.708) Topic : Payment Issue (conf=0.634) TEXT: Support did not reply to my complaint Language : English (conf=1.0) Sentiment: Negative (conf=0.960) Topic : Customer Support (conf=0.984) TEXT: Umudriver alisala sana, alelanda ifintu ifipusa Language : Bemba (conf=0.958) Sentiment: Negative (conf=0.874) Topic : Driver Behaviour (conf=0.812) ``` ## Sentiment by Topic ```python negative neutral positive driver_behaviour 82% 14% 4% payment_issues 76% 20% 4% app_issues 68% 25% 7% support_issues 74% 21% 5% others 29% 56% 15% ``` ## Language vs Topic ```python driver payment app support others english 38% 24% 15% 10% 13% bemba 41% 18% 12% 14% 15% nyanja 36% 19% 17% 13% 15% mixed 22% 11% 18% 16% 33% ``` ## Sample Executive Summary ```python Out of 10,000 customer complaints: 61% are negative 34% relate to driver behaviour 21% involve payment issues 17% of texts were classified as mixed language Topic model shows lower confidence than sentiment model "Others" category remains relatively high (18%) ``` ```python =========================== Training Architecture =========================== πŸ“₯ Input β†’ 🧠 Core Engine β†’ πŸ“ˆ Output ------------------------------------------------------------------------------------ Text (Any Language) β†’ Tokenizer πŸ”€ β†’ Language 🌍 β†’ Shared mBERT Encoder 🧠 β†’ Bemba / Nyanja / β†’ CLS Vector 🎯 β†’ English / Mixed ------------------------------------------------------------------------------------ User Feedback πŸ’¬ β†’ Tokenizer πŸ”€ β†’ Sentiment ❀️ β†’ Shared Encoder 🧠 β†’ Negative / Neutral / β†’ CLS Vector 🎯 β†’ Positive ------------------------------------------------------------------------------------ Ride Context πŸš— β†’ Tokenizer πŸ”€ β†’ Topic πŸ—‚οΈ β†’ Shared Encoder 🧠 β†’ Driver / Payment / β†’ CLS Vector 🎯 β†’ Support / App / Availability ------------------------------------------------------------------------------------ ```