--- license: apache-2.0 language: - en - de --- # 🛡️ MLP Cybersecurity Classifier This repository hosts a lightweight `scikit-learn`-based MLP classifier trained to distinguish cybersecurity-related content from other text, using sentence-transformer embeddings. It supports English and German input texts. ## 📊 Training Data The model was trained on a multilingual dataset of cybersecurity and non-cybersecurity news articles. The dataset is publicly available on Zenodo: 🔗 [https://zenodo.org/records/16417939](https://zenodo.org/records/16417939) ## 📦 Model Details - **Architecture**: `MLPClassifier` with hidden layers `(128, 64)` - **Embedding model**: [`intfloat/multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large) - **Input**: Cleaned article (removed stopwords) or report text - **Output**: Binary label (e.g., `Cybersecurity`, `Not Cybersecurity`) - **Languages**: English, German ## 🔧 Usage ```python from sentence_transformers import SentenceTransformer from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder import pandas as pd import joblib from huggingface_hub import hf_hub_download # Load your cleaned dataset df = pd.read_csv("your_dataset.csv") # Requires 'clean_text' and 'label' columns # Load the sentence transformer embedder = SentenceTransformer("intfloat/multilingual-e5-large") # Train-test split X_train, X_test, y_train, y_test = train_test_split( df["clean_text"], df["label"], test_size=0.05, stratify=df["label"], random_state=42 ) # Encode labels label_encoder = LabelEncoder() y_train_enc = label_encoder.fit_transform(y_train) y_test_enc = label_encoder.transform(y_test) # Generate sentence embeddings X_train_emb = embedder.encode(X_train.tolist(), convert_to_numpy=True, show_progress_bar=True) X_test_emb = embedder.encode(X_test.tolist(), convert_to_numpy=True, show_progress_bar=True) # Load the trained classifier model_path = hf_hub_download(repo_id="selfconstruct3d/cybersec_classifier", filename="cybersec_classifier.pkl") model = joblib.load(model_path) # Predict y_pred = model.predict(X_test_emb)