README.md · Kelvinmbewe/mbert_LusakaLang

mbert_LusakaLang_MultiTask / README.md

Kelvinmbewe

Update language model

5315ff1 verified 17 days ago

preview code

raw

history blame contribute delete

10 kB

metadata

language:
  - en
  - bem
  - ny
tags:
  - multi-task
  - sentiment-analysis
  - topic-classification
  - language-identification
  - multilingual
  - transformer
  - zambia
  - lusaka
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model:
  - Kelvinmbewe/mbert_Lusaka_Language_Analysis
  - Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis
  - Kelvinmbewe/mbert_LusakaLang_Topic
model-index:
  - name: LusakaLang-MultiTask
    results:
      - task:
          type: text-classification
          name: Language Identification
        dataset:
          name: LusakaLang Language Data
          type: lusakalang
          split: test
        metrics:
          - type: accuracy
            value: 0.97
            name: accuracy
          - type: f1
            value: 0.96
            name: f1_macro
          - type: accuracy
            value: 0.9322
            name: accuracy
          - type: f1
            value: 0.9216
            name: f1_macro
          - type: f1
            value: 0.8649
            name: f1_negative
          - type: f1
            value: 0.95
            name: f1_neutral
          - type: f1
            value: 0.95
            name: f1_positive
          - type: accuracy
            value: 0.91
            name: accuracy
          - type: f1
            value: 0.9
            name: f1_macro

LusakaLang MultiTask Model

This model is a unified transformer architecture built on top of bert-base-multilingual-cased, designed to perform three tasks simultaneously:

Language Identification
Sentiment Analysis
Topic Classification

The system integrates three fine‑tuned LusakaLang checkpoints:

mbert_Lusaka_Language_Analysis
mbert_LusakaLang_Sentiment_Analysis
mbert_LusakaLang_Topic

All tasks share a single mBERT encoder, supported by three independent classifier heads. This architecture enhances computational efficiency, reduces memory overhead and promotes consistent, harmonized predictions across all tasks.

Why This Model Matters

Zambian communication is inherently multilingual, fluid, and deeply shaped by context. A single message may blend English, Bemba, Nyanja, local slang, and frequent code‑switching, often expressed through culturally grounded idioms and subtle emotional cues. This model is designed specifically for that environment, where meaning depends not only on the words used but on how languages interact within a single utterance.

It excels at identifying the dominant language or detecting when multiple languages are being used together, interpreting sentiment even when it is conveyed indirectly or through culturally specific phrasing, and classifying text into practical topics such as driver behaviour, payment issues, app performance, customer support, and ride availability. By capturing these nuances, the model provides a more accurate and context‑aware understanding of real Zambian communication.

How to Use This Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

class LusakaLangMultiTask:
    """
    LusakaLang MultiTask Model:
    - Language Identification
    - Sentiment Analysis
    - Topic Classification
    """

    def __init__(self, path="Kelvinmbewe/LusakaLang-MultiTask",
                 lang_temp=1.0, sent_temp=1.0, topic_temp=1.0):
        # Load tokenizer and models
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.lang_model = AutoModelForSequenceClassification.from_pretrained(
            "Kelvinmbewe/mbert_Lusaka_Language_Analysis"
        )
        self.sent_model = AutoModelForSequenceClassification.from_pretrained(
            "Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis"
        )
        self.topic_model = AutoModelForSequenceClassification.from_pretrained(
            "Kelvinmbewe/mbert_LusakaLang_Topic"
        )

        # ID2Label mappings
        self.lang_id2label = self.lang_model.config.id2label
        self.sent_id2label = self.sent_model.config.id2label
        self.topic_id2label = self.topic_model.config.id2label

        # Temperature scaling
        self.lang_temp = lang_temp
        self.sent_temp = sent_temp
        self.topic_temp = topic_temp

        # Device setup
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.lang_model.to(self.device)
        self.sent_model.to(self.device)
        self.topic_model.to(self.device)

    def predict_batch(self, texts, conf_threshold=0.5, batch_size=16):
        """
        Predict language, sentiment, and topic for a list of texts.
        Returns a list of dicts: [{"text": ..., "language":..., "sentiment":..., "topic":...}, ...]
        """
        results = []

        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]

            inputs = self.tokenizer(
                batch_texts,
                return_tensors="pt",
                truncation=True,
                padding=True
            ).to(self.device)

            with torch.no_grad():
                # Language
                lang_logits = self.lang_model(**inputs).logits / self.lang_temp
                lang_probs = F.softmax(lang_logits, dim=-1)
                lang_conf, lang_idx = torch.max(lang_probs, dim=-1)

                # Sentiment
                sent_logits = self.sent_model(**inputs).logits / self.sent_temp
                sent_probs = F.softmax(sent_logits, dim=-1)
                sent_conf, sent_idx = torch.max(sent_probs, dim=-1)

                # Topic
                topic_logits = self.topic_model(**inputs).logits / self.topic_temp
                topic_probs = F.softmax(topic_logits, dim=-1)
                topic_conf, topic_idx = torch.max(topic_probs, dim=-1)

            for j, text in enumerate(batch_texts):
                results.append({
                    "text": text,
                    "language": self.lang_id2label[lang_idx[j].item()]
                                if lang_conf[j].item() >= conf_threshold else "unknown",
                    "language_conf": round(lang_conf[j].item(), 3),
                    "sentiment": self.sent_id2label[sent_idx[j].item()],
                    "sentiment_conf": round(sent_conf[j].item(), 3),
                    "topic": self.topic_id2label[topic_idx[j].item()]
                                if topic_conf[j].item() >= conf_threshold else "unknown",
                    "topic_conf": round(topic_conf[j].item(), 3)
                })
        return results


# ================= Example Usage =================

llm = LusakaLangMultiTask(lang_temp=1.2, sent_temp=0.93, topic_temp=1.5)

samples = [
    "Driver was rude, shouting all the way",
    "Payment failed, money deducted but no ride",
    "Support did not reply to my complaint",
    "Umudriver alisala sana, alelanda ifintu ifipusa",
]

predictions = llm.predict_batch(samples, conf_threshold=0.5)

for p in predictions:
    print(f"TEXT: {p['text']}")
    print(f"  Language : {p['language']}  (conf={p['language_conf']})")
    print(f"  Sentiment: {p['sentiment']} (conf={p['sentiment_conf']})")
    print(f"  Topic    : {p['topic']}     (conf={p['topic_conf']})\n")

Sample Output

TEXT: Driver was rude, shouting all the way
  Language : English  (conf=0.999)
  Sentiment: Negative (conf=0.968)
  Topic    : Driver Behaviour (conf=0.807)

TEXT: Payment failed, money deducted but no ride
  Language : English  (conf=0.996)
  Sentiment: Neutral  (conf=0.708)
  Topic    : Payment Issue (conf=0.634)

TEXT: Support did not reply to my complaint
  Language : English  (conf=1.0)
  Sentiment: Negative (conf=0.960)
  Topic    : Customer Support (conf=0.984)

TEXT: Umudriver alisala sana, alelanda ifintu ifipusa
  Language : Bemba    (conf=0.958)
  Sentiment: Negative (conf=0.874)
  Topic    : Driver Behaviour (conf=0.812)

Sentiment by Topic

                   negative   neutral   positive
driver_behaviour     82%        14%        4%
payment_issues       76%        20%        4%
app_issues           68%        25%        7%
support_issues       74%        21%        5%
others               29%        56%       15%

Language vs Topic

                   driver   payment   app   support   others
english            38%      24%     15%     10%      13%
bemba              41%      18%     12%     14%      15%
nyanja             36%      19%     17%     13%      15%
mixed              22%      11%     18%     16%      33%

Sample Executive Summary

Out of 10,000 customer complaints:
61% are negative
34% relate to driver behaviour
21% involve payment issues
17% of texts were classified as mixed language
Topic model shows lower confidence than sentiment model
"Others" category remains relatively high (18%)

=========================== Training Architecture ===========================

📥 Input                →  🧠 Core Engine              →            📈 Output
------------------------------------------------------------------------------------
Text (Any Language)     →   Tokenizer 🔤                       →     Language 🌍
                        →   Shared mBERT Encoder 🧠            →     Bemba / Nyanja /
                        →   CLS Vector 🎯                      →     English / Mixed
------------------------------------------------------------------------------------
User Feedback 💬        →   Tokenizer 🔤                       →     Sentiment ❤️
                        →   Shared Encoder 🧠                  →     Negative / Neutral /
                        →   CLS Vector 🎯                      →     Positive
------------------------------------------------------------------------------------
Ride Context 🚗         →   Tokenizer 🔤                       →     Topic 🗂️
                        →   Shared Encoder 🧠                  →     Driver / Payment /
                        →   CLS Vector 🎯                      →     Support / App / Availability
------------------------------------------------------------------------------------