|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). |
|
|
|
## Model Details |
|
This model classifies news headlines as either NBC or Fox News. |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** Jack Bader, Kaiyuan Wang, Pairan Xu |
|
- **Taks:** Binary classification (NBC News vs. Fox News) |
|
- **Preprocessing:** TF-IDF vectorization applied to the text data |
|
- stop_words = "english" |
|
- max_features = 1000 |
|
- **Model type:** Random Forest |
|
- **Freamwork:** Scikit-learn |
|
- |
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
- Accuracy Score |
|
|
|
### Model Evaluation |
|
```python |
|
import pandas as pd |
|
import joblib |
|
from huggingface_hub import hf_hub_download |
|
from sklearn.feature_extraction.text import TfidfVectorizer |
|
from sklearn.metrics import classification_report |
|
|
|
# Mount to drive |
|
from google.colab import drive |
|
drive.mount('/content/drive') |
|
|
|
# Load test set |
|
test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252") |
|
|
|
# Log in w/ huggingface token |
|
!huggingface-cli login |
|
|
|
# Download the model |
|
model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl") |
|
|
|
# Download the vectorizer |
|
tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl") |
|
|
|
# Load the model |
|
pipeline = joblib.load(model) |
|
|
|
# Load the vectorizer |
|
tfidf_vectorizer = joblib.load(tfidf_vectorizer) |
|
|
|
# Extract the headlines from the test set |
|
X_test = test_df['title'] |
|
|
|
# Apply transformation to the headlines into numerical features |
|
X_test_transformed = tfidf_vectorizer.transform(X_test) |
|
|
|
# Make predictions using the pipeline |
|
y_pred = pipeline.predict(X_test_transformed) |
|
|
|
# Extract 'labels' as target |
|
y_test = test_df['label'] |
|
|
|
# Print classification report |
|
print(classification_report(y_test, y_pred)) |